Everything, Including the Kitchen Sink
A data lake is a persistent raw archive of any potentially actionable data. The philosophy really is “everything, including the kitchen sink.” This means that a data lake will archive data from many different business systems and non-traditional sources, including sensor data, logs, image data, streaming data, and audio or video data.
An ambitious data lake may also include information from external sources, such as weather, traffic, or stock market data. But that’s not all! A data lake won’t just store the current version of a record or file; it will also retain every revision it can get. By capturing everything, undiluted, a business will be able to answer the questions of today and the new questions of tomorrow.
That raw material, though, is a whole mess of data. It requires gobs of storage, gobs of processing power, and gobs of connectivity to continually archive all this data as it is generated. For this whole hog process, a traditional relational database will not scale easily into the petabyte range, nor does it effectively store or consume unstructured data. At this scale, you’re looking for a platform with near infinite scalability paired with elastic storage and processing power.
An Organized Mess
Despite the name, the mess of data in a data lake need not be total chaos. In fact, it should be an organized mess. There’s metadata that can be captured in the data ingestion process that does not transform the data, but will give the data additional context. At the very least, data can then be categorized by its source and the date it was captured. This enriches the data and gives it some level of organization without contaminating its raw nature.
Multiple Tools Have Many Uses
With a variety of data sources and types, there are an array of tools to get the job done. For ingesting relational data alone, there’s at least half a dozen tools. Broadly speaking, there are three types of data to be ingested: batched data, streaming data, and binary data.
Much to my dismay, there is no one tool that is ideally suited for all three types. A true data lake at an enterprise may wind up using two or three (or more!) ingestion tools for dozens of data sources. Orchestrating that aspect alone is a significant task, but that giant mess of data is the raw material for insights now and into the future.
What to Do with All that Mess
Here’s the problem with a data lake (stop me if you've heard this already): It’s raw data. It’s a mess. To get insight out of it, you need to make sense of that mess and integrate it into something coherent, which might sound like a data warehouse. And for some enterprises, it can be little more than a massive primary staging layer. For others, it can be a data science playground.
With a data lake feeding a data warehouse, adding new items to the warehouse is merely a matter of sourcing the required information from the data lake. The data will already be available and ready to go. In fact, it may even be possible to make a virtual data warehouse as a layer of views on the data lake itself. It adds much more agility to a data warehouse.
Additionally, a data lake is not only for feeding data warehouses. It can become a one-stop shop for data science efforts too. By capturing everything, there is a treasure trove of insights that may be hidden in the data lake. Machine learning, text analysis, image recognition, and other processes will have the gobs of data they need. It can open new insights about the workings of a business and audit conventional wisdom about your business processes.
In the data driven world of today and tomorrow, having all your business data available to gain a competitive edge is a must, not an option. For more information on data lakes and data warehouses, check out this blog post to learn about the differences. If you are planning your data lake and need help getting started, contact BlueGranite today for insights into the right solution for your firm.