When You Need a Data Lake
Posted On July 29, 2019
If you work with data, you have probably encountered that nebulous term of the data lake. What is this data lake thing? What does it mean to my business. Are we ready for such a thing for our data? Let’s talk about when you need a data lake and what you would use the data lake for. At some point in most organization’s existences that question comes up about when you need a data lake.
What is a Data Lake?
Is it a fad? At its barest form, a data lake is a location where data assets can be stored in any form in any data store or structure. Some data stores are a combination of database technologies or file system storage. Amazon S3 is a great method of storing data assets. Storage is relatively cheap compared to other storage mechanisms and files can be stored in multiple formats.
There was a recent survey conducted in 2015 showing that only 1.12% of respondents felt that the data lake concept is well defined and understood. Of course, 2015 is an eternity away from today in technology years, but knowing that many companies are typically years behind in the technological curve, perhaps this number is not all that crazy.
It is More Than Just Storing Data
Of course, storing data is only half the battle. The true value of a data lake is having the ability to pull out the data that is needed. Some data lakes become a rat’s nest of data assets just thrown in with no organization making the retrieval of this valuable data a daunting task. It is worse to spend the time to build a data lake that no one uses than to not have a data lake at all.
Questions to Answer When You Know You Need a Data Lake
One of the most important decisions about taking on the project of building a data lake is to figure out the right file format when persisting data. To answer that question, you need to think about a few things. What is the most frequent method in which your data will be accessed? How will your data be stored (structured versus unstructured)? What kind of partitioning do you need to factor into your data sets?
There are advantages to choosing the proper file format to the overall success of your big data project. There is a powerful connection to the right file format and the query execution engine that makes sense for your teams. Not every file format makes sense for every team, their skills, and their use cases for the data.
The Architecture of a Data Lake
The common misconception I have seen is that data lake becomes synonymous to a specific data store or technology in many people’s eyes. It is okay to mix technologies inside of a data lake. Part of the reason that you know when you need a data lake is that data is in all types of formats and storage mechanisms. Putting data in a raw format into your data lake is not only okay, it is optimal. Think of a data lake as store now, analyze later. It is presumptuous to assume that we now know how this data will always be used every time.
High Velocity Data
Data lakes are typically used to store data that is generated from high velocity sources in a constant stream. Think of how much data can be generated for things like IoT, web logs, or product logs. There is a constant writing of valuable data being written. Some of this data is structured – some is unstructured.
In addition to the velocity of data, we want to take a look at how the data is being generated. If data is written in small bursts, the approach is different than it would be for data written constantly at all times of the day. For data being written in bursts, a data lake is the optimal solution. If you are dealing with many reads and writes to data in a tabular format, a data warehouse would be your optimal solution. Our data warehouse consulting services is another solution that may make sense for you.
Also, it is not like the options of a data lake or data warehouse are mutually exclusive. Most large organizations utilize both a lake and warehouse for the appropriate data sources. An RDBMS can be a valuable asset for an organization in needs of a data warehouse.
How Do You Know When You Need a Data Lake?
We now know what a data lake is and what the difference is between a lake and a warehouse. One of the key components of needing a data lake is the size of data that you are working with. And, the amount of data that you need streaming. If you are not in the big data realm with streaming data needs, a data lake does not make a ton of sense for you.
Before making the decision on not needing a data lake, you have to look at what exactly you are planning to do with the data. You now have a storage location for all of this valuable data. But, you now need to know how to access it and what you plan to access it for. One of the great benefits of a data lake is the flexibility that it provides when it comes to how data will eventually be stored.
Of course, there is always the possibility that you have no idea what you want to do with the data yet. That makes a strong case for needing a data lake. If set up the right way, storage is cheap. If we are ready to store now, analyze later, a data lake makes much more sense than a data warehouse.
Data at Rest
I am a big fan of storing data at rest in many of its forms. Your data science team can perhaps use data in its rawest format. But, other business stakeholders want to see data in all of its intermediate steps. This can be useful before the final data layer in your data warehouse.
Data Management and Data Governance
Both data lakes and data warehouses pose challenges when it comes to data governance. In the data warehouse, this challenge would be the need to constantly maintain and manage all the data that comes in. And, to make sure the data is added according to consistent business logic and data model. On the other hand, data lakes are often criticized as chaotic and impossible to effectively govern. You need to ensure you have a good way to address these challenges.
When You Need a Data Lake
At the end of the day, knowing when you need a data lake has some guidelines. But, the reality is that you should not try to reinvent the wheel whenever possible in this industry. Make the decisions on your data with data. Having the in-house expertise and the product roadmap to handle such a project will dictate whether or not it makes sense to create your lake and warehouse in house. You may want to supplement your teams. Or, you can go with a team that has been creating effective data lakes since the inception of data lakes. Making a mistake in your data lake architecture is an extremely expensive mistake so choose wisely.