When you jump into a data lake, you’ll find that if properly designed, it will be split into designated zones. Each zone has a mission to fulfill that justifies its existence. In this article, I'll focus on the curated zone and speak to how we strive to create a happy zone that's easy to navigate, broad in scope, flexible in structure, and provides a single source of truth for a data warehouse or sandbox.
Let’s start at the bottom: the base of the data lake has always been the raw zone, but it can be accompanied by a curated zone, a sandbox, or even a data warehouse zone. The data lake’s raw zone always made sense as it archives unfiltered data from all source systems, with all variations of that data over time. Data warehousing has been well-defined for decades also, as it contains structured fact and dimension tables oriented towards analysis of business needs. A sandbox is a working area for data science and one-off analysis. All of these zones have a clear, intuitive, or well-established purpose.
The Curated Zone
On the other hand, the definition of the curated zone can be vague. Often, it is weakly defined as something between the raw zone and a data warehouse. Let’s see if we can come up with a more useful definition. Without a clear mission, it’s hard to justify the zone’s existence, so the curated zone’s very existence is on trial.
Let’s start with our vague definition and see if we can refine it - it’s between the raw zone and a data warehouse. By comparing these two other zones, we might get a better picture. For our purposes, we will keep characteristics common to both the raw zone and a data warehouse, but for places where they differ, we want to see if a useful middle-ground exists.
The major commonality between the raw zone and a data warehouse is that they combine data from multiple sources into a single repository. Since we’re including common features in our definition, our first characteristic is that it is a consolidated data store. It provides a sole place to access data from a variety of different sources and serves as a single source of truth. This setup also allows you to analyze and compare the data using only a single interface and tool.
A second commonality between the raw zone and a data warehouse is that they are automated. Both zones are populated by processes on a recurring or continuous basis. With these automated processes come stability and reliability. In contrast, consider the sandbox - as mentioned previously, the sandbox is a home for ad-hoc, developmental, or one-off analysis. The data there may have been placed manually and could be unorganized, incomplete, or in a half-finished state. In short, the sandbox is not a reliable store of information. So, our second common characteristic is that, like the raw zone and the data warehouse, the curated zone is reliable because it is populated by automated processes.
Let’s now look at how the raw zone and data warehouse are different. The major vector of difference is that of the data transformation process. As previously mentioned, the raw zone contains an exact copy of everything, warts and all, with historical data available as well. It is extremely broad, difficult to navigate, and cumbersome to extract useful data. Unstructured or semi-structured data ultimately creates barriers to access. On the flip side, the data warehouse contains a highly-filtered and transformed set of facts and dimensions. It is relatively narrow, comparatively rigid in structure, and designed to be easy to navigate and analyze. With the curated zone, we’re looking for a happy medium between these two extremes.
A Happy Medium
What is that happy medium? Ultimately, it's a judgement call, but I think we can propose a baseline for discussion. Let's target what we really need from the curated zone: it must be easier to navigate and extract data than the raw zone, broader in scope and less rigid in structure than the data warehouse, and it must also be a valid source for a data warehouse or a sandbox. That helps us close in on a working definition:
- The data must be organized into query-ready tables. For database sources, it would look like a current mirror of the source system. For unstructured or semi-structured sources, it would require imposing structure or extracting structured data from them. Those tables don’t need to be as rigid as relational tables, but they should fit within a Hive table definition.
- It should be automatically populated from the raw zone. Once the raw zone has the newest version of the data, the curated zone gets updated with the current information.
- It must also contain the breadth of the operational data store. This gives analysts and data scientists one location and one tool where they can query and join data from all of their sources. Note, this is not pared down to known analytic vectors like the data warehouse though.
This picture of a unified data store seems to be the happy medium we’re looking for. It can be queried with common data tools, like notebooks in Azure Synapse or Databricks. It doesn't have the confusion of multiple revisions, however, it is not the cultivated garden of a data warehouse either. It has the full extent of the operational data that is appropriately needed when doing exploratory queries and analysis. In short, it’s a happy medium!
If you get the curated zone right, it will expand the insights available to you from a traditional data warehouse structure, while eliminating the tedium of working in the raw zone. If you need help planning your cloud solutions, BlueGranite has the experts to help you organize and curate a data lake to fit your needs. Contact us today!
If you're interested in my first post covering the data lake raw zone, you can read it here!