fbpx

What's a Data Lake and Why Businesses are moving towards it.

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

datalake-data-lake

Our Discovery Sessions are the most productive! We try to understand you pain-points!

Questions for Clients: Data Generation 

  • What is the data source? is it a database? data warehouse? IoT Swarm? Legacy System?
  • How is the data stored on the source system? is it stored forever? or has a time-to-live?
  • What’s the frequency and volume of data generated? Events per seconds? GBs per hour?
  • How often do data inconsistencies occur—nulls, unformatted data? Is the source reliable?
  • Is the source system error-prone? How  often do errors occur?
  • Does the data generated contains duplicates?
  • Can some part of the data arrive late? if yes, how old can it be?
  • What’s the schema? Do we need to join data from multiple source systems? or just one?
  • If schema changes, how is it communicated to the consumers? what the current process?
  • What’s the frequency of data generation? how often should it be pulled?
  • Will reading the data for ingestion impact system’s performance?
  • Are there any existing data quality checks?
 

Questions for Clients: Data Ingestion

  • What’s the data destination after ingestion?
  • What’s the access pattern? How frequently will the data be accessed?
  • What’s the volume of arriving data?
  • What’s the format of incoming data? JSON, CSV, TEXT? 
  • Do we need any in-flight transformations?
  • Is the data arriving in its purest form? Can we move it to the serving layer without processing?
  • Will it be batch or streaming?
  • Is it going to be pulled into the new data system, or pushed?

Questions for Clients: Data Storage

  • What’s the expected read and write speed for the data?
  • Is the Storage compatible with the read/write speed?
  • Will the selected storage solutions create a bottle-neck for the consumers?
  • Are you using the right storage? or using an object store like S3 for frequently updating objects?
  • Is the storage scalable and future-proof?
  • Is the storage capable of meeting the business SLAs (service level agreements)?
  • Are you capturing meta-data? Data Flow? and Data Lineage?
  • Does it need complex queries like a data warehouse?
  • Is there a compliance requirement to meet? Is the storage compliant?

Questions for Clients: Data Transformation

  • What’s the data destination after ingestion?
  • What’s the access pattern? How frequently will the data be accessed?
  • What’s the volume of arriving data?
  • What’s the format of incoming data? JSON, CSV, TEXT? 
  • Do we need any in-flight transformations?
  • Is the data arriving in its purest form? Can we move it to the serving layer without processing?
  • Will it be batch or streaming?
  • Is it going to be pulled into the new data system, or pushed?
  • Can we minimize the migration of data between different zones?
  • Are the transformations simple or complex? Will we be using Pandas? PySpark?

Questions for Clients: Serving Data

  • Is the data properly representing the ground-truth? Is the data biased?
  • Is the data in a form that can go through Feature Engineering?
  • Is it easily discoverable? Can stakeholders easily find the relevant dataset?
  • What are the boundaries? Permissions for different roles and users?
  • Is there a serving layer required to build for the data lake?
  • Do we need a frontend/mobile or APIs for serving the data?

Author

Muhammad Hamza Javed, Founder & CEO of Numpy Labs is an AWS Certified Solutions Architect (~50 certificates under his belt) with a decade of experience working with Fortune500 companies and 4x course author. He’s delivered hundreds of projects - all revolving around AWS Cloud and Data Lake. He’s been leading a team of designers, developers, and data engineers and enabling them to achieve their professional goals.

Muhammad-hamza-javed-numpy-labs
Muhammad Hamza Javed

Founder & CEO, Numpy Labs.

Share this with your team!

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn

Leverage DATA as your competitive advantage.

Copyright © Numpy Labs LLC – All Rights Reserved 2023

× Talk to humans 🙋‍♀️ not bots! 🤖