We recently explored how powerful visualization is in Business Intelligence (BI). In this next BI Basics post, we’re investigating the impact of the “T” in “ETL” – the Extract, Transform, Load data metamorphosis that is central to BI.
Data transformation turns numbers into insight. In the process of reporting data, even the most routine operational reports do basic transformations – joining different data tables together, filtering rows and returning requested fields. The information returned to the user is almost always distinct from the information as stored. That transformed data is what gives an organization leverage to compete and succeed.
Data Transformation is Vital
What’s really happening to an organization’s data between the at-rest stage data and that being served to a report? It’s being transformed from a handy storage and processing format to one that can change the fate of a business. Reporting systems make these transformations by applying rules that give a specific interpretation of the data.
For instance, a typical sales report gathers the date of sale, sales price, item(s) purchased, discounts applied, store or channel, and other relevant information. While those bits of information might be stored on 27 different tables in the source system, the report can present all those pieces of data as a single block of information to the user. By combining these numbers and applying rules we get a logical or a representative look at the data, bringing what’s pertinent to the forefront and removing much of the behind-the scenes housekeeping miscellany.
Stored Transformations Offer Stable View
Transformation is essential to BI. It offers a logical interpretation of the data that drives business. But it’s just as important to keep a record of those alterations to maintain a consistent, logical data view. Why? Because different people might have different interpretations of the same data. Let that sink in.
One sales manager might look at the numbers and see $15 million in sales, while another might see only $14 million. That’s a $1 million difference! One might have included discounts while the other didn’t. Or one might have deducted returns while the other didn’t. Stored transformations equip us with the necessary tools to enforce one version of the truth.
These stored transformations can help define that sales totals include deductions for returns and discounts. They can also define a concept of raw sales that does not include those deductions. Stockpiling transformations helps ensure consistency of interpretation.
You might notice that we refer to storing the transformation rather than the transformed data. It is not strictly necessary to store the transformed data. If the reporting data goes through the transformation, you can achieve the same results for current data, whether that transformation is done on the fly or as part of a traditional ETL process.
That being said, there are some real benefits to also storing the transformed data. Remember, this data has been converted to a format that is aligned with the organization’s use and its logical understanding of the data. When you store this transformed data in a secondary data store, it becomes a data mart or a data warehouse. You’ve arrived at the classical Extract, Transform, Load process, and storing that transformed data in a new reporting database has some real benefits.
Converted Data Eases Reporting
The transformed data is more efficient for reporting. When the data is converted as part of an ETL process, it can drastically reduce the processing need when generating a report. Queries for a system designed for day-to-day transactions are called online transaction processing (OLTP) databases. They often require many joins to arrive at the final dataset for a report. OLTP systems separate the data into a multitude of tables to eliminate duplicate data. However, a data mart or data warehouse uses fewer tables, and those tables are aligned with business definitions. A report generated from a data warehouse structure, then, requires less processing to arrive at the final dataset.
An independent reporting database improves availability. Reports often comb large swaths of data as they build their datasets. An OLTP system can be brought to its knees if a large report is run against its database during peak hours. It can cause that system to become unavailable for the normal day-to-day operations for minutes or even hours. For mission-critical systems, that kind of downtime is unacceptable. Any downtime might be unacceptable. If the reporting data is on a separate system, loaded periodically, each system can be optimized for its own needs.
A data mart or data warehouse is not limited by the source system. As a source system is upgraded or replaced, the reporting database can continue to feed carefully crafted reports with less interruption. Because it reports the logical business view rather than the nuts-and-bolts of the source system, a new source system can be integrated seamlessly with the existing data sourced from the older system. Likewise, it can maintain historical data that the source system may throw away. By being stored separately, it can become a more powerful source of information than the source system alone.
Flexibility in Hybrid Approach
There is a downside to the traditional data warehouse. What if the business definition changes? What if the business wants to capture a new aspect of the data? At those junctures, a data warehouse can only begin capturing the revisions after they’ve been discovered, coded, tested and deployed. With the advent of big data and data lakes, that need not be the case. Big data technologies allow for nearly unlimited storage and processing. A data lake leverages those technologies to store versioned copies of the raw source system’s data. When the source system is new or updated, those structural changes are incorporated to that period’s revision.
Don’t we lose all the benefits of a transformed data source with this data lake? Not necessarily. As mentioned previously, we keep all the revisions of the source data, so history is preserved. And, there’s nothing preventing a data lake from co-existing with a data warehouse or data mart. In fact, it’s entirely possible to build a data warehouse as a virtual layer on top of a data lake. Or, alternatively, have a data warehouse supplied by the data lake. What does that achieve? Well, at the expense of storage efficiency, it allows for the logical view, the business definitions, to change even retroactively. Since all the raw data is collected and stored with historical revisions, the data warehouse can be rebuilt from scratch without losing anything. Such an architecture offers the ultimate in flexibility and features.
Wow. Let’s take a step back for a moment. We just described the evolution of BI data provisioning. We started with transformations and arrived at architecture. In the beginning, there was the source system report. It was good but flawed. That evolved into mirrored reporting copies and quickly transformed into the data warehouse. That enabled more intense, efficient and business-oriented analysis, but technology isn’t content to stand still. The data lake looms to make the data warehouse flexible and rebuildable.
I hear one final whisper, “What of OLAP, then? Where does that fit?” Indeed, I would do well to address OLAP. On-line analytical processing databases are highly indexed databases designed to serve analytic queries involving lots of aggregate calculations. They’re often called cubes or tabular models. If a properly designed data warehouse is the source, then there is actually very little transformation in structure between the data warehouse and the OLAP database. What they do offer, however, besides query performance, are measures and key performance indicators (KPIs). In some sense, this is the capstone of business-oriented transformation. Measures are the final encoding of a business definition, and KPIs measure the goals of the business. With that, you have the business view of the data fully encapsulated in an analytic system.
In the end, data transformation gives an organization the tools it needs to see what’s working and what isn’t. The ability to make decisions based on statistics, rather than educated guesses, is the first step in becoming a true data-driven business. Stay tuned for our next BI Basics post by subscribing to our blog.
If you need help transforming your organization, contact BlueGranite. Our expert team is eager to help.