Organizing Big DataEmerging technologies like Hadoop are getting a lot of buzz in 2013.  Many of our clients are asking themselves (and our consultants) whether Hadoop and similar “Big Data” technologies have a place in their EDW/BI roadmap.

While no two organizations are exactly alike, there are some consistent evaluation points that can be considered.  Four near the top of the list are:

  1. Do analysts need access to more historical data than is now available in the EDW?
  2. Are there important information subject areas that cannot be adequately modeled in existing relational technologies?
  3. Do existing BI solutions provide only aggregate analysis, while users would benefit from the ability to analyze data at a fine level of detail?
  4. Would the business benefit if it could perform advanced analyses such as text mining and cluster analysis at large scale rather than using statistical sampling?

Let’s consider each of these factors in turn.

#1 More Historical Data

When planning a traditional EDW/BI strategy, one of the earliest requirements explored is “how much history does the database need to store?”  The reasons for this key consideration are probably obvious:

  • Cost: high-performance, enterprise SAN storage is a significant cost component for an EDW.
  • Performance: in most EDW and OLAP technologies a key driver for query response time is the volume of data loaded in the database.

Of course, some EDW subject areas would not benefit from extensive historical data.  Detailed data about marketing promotions that occurred ten years ago probably have little relevance on decisions made in the next few months.

Yet some analyses do benefit by using detailed data collected over long timeframes.  Quality and failure data for products that have long service lifetimes certainly benefit from extensive historical data.  Industrial products, medical devices and financial transactions are relevant examples.

When the business can benefit from keeping extensive history on-line and query-able, Hadoop is a compelling solution.  Hadoop’s distributed file system can expand to virtually unlimited size, and the storage cost per terabyte is a fraction of a typical enterprise SAN architecture.

#2 Difficult to Model Subject Areas

The initial use case for data warehousing was structured, transactional data.  Virtually all the best practices, design patterns and technologies designed for the EDW/BI market assume that the information stored in a data warehouse will fit nicely into a disciplined entity relationship diagram.

Over the last 30 years, virtually all data warehouses have been designed to hold only tabular, relational data.  Why have we ignored “less structured” data sources for so long?  There are two primary factors:

  1. Organizations haven’t collected and stored unstructured data because it was too difficult and costly to do so.
  2. Most traditional data warehouse technologies provide no means to store and analyze unstructured data, so these data types are often not included in the design scope.

During the last decade, Internet companies like Google, Yahoo! and Facebook demonstrated to those of us in the data analysis business that, in fact, there is significant value in storing and analyzing unstructured data. Not only that, they created the technology needed to store and analyze these workloads at any scale.  The combination of creating the technology and demonstrating its value has shown a light on the possibilities to organizations of every type.

With a light pointed in the right direction in terms of use cases and technologies, it’s merely a matter of time before what are now called “Big Data” technologies become mainstream.

#3 Analysis at Leaf Level

Often we provide users with the ability to analyze aggregate information, but not the ability to “drill down” to the lowest level of detail.  Sometimes analysis “at leaf level” is excluded because it’s not necessary to support the decision-making process. But not always. Sometimes excluding leaf-level granularity is, again, a decision made for reasons of storage cost, manageability or when or compute infrastructure lacks the processing power to support queries over such large volumes.

While aggregating data helps time series analysis and identification of macro trends, it necessarily removes context from information. We lose insight that comes from the series of interactions an individual customer had with use before selecting a set of products or services.

By following a “store everything” approach and cross-referencing data in multiple subject areas at fine grain, context be completely restored.

#4 Advanced Analysis

The fourth factor to consider in this document is whether it would benefit our organization to incorporate advanced types of analysis into our set of Business Intelligence capabilities.

Typically, when an organization matures from having no BI capabilities to having sophisticated capabilities, it starts from a “we know the data is out there but can’t analyze it” beginning.  Later, as data is consolidated and properly indexed, data is easily “sliced and diced”, but users begin to feel there are still insights contained in the data that cannot be easily gleaned through aggregated statistical measurements.

More advanced forms of analysis such as predictive analytics and machine learning invert the process by considering large volumes of information, using computational power to detect patterns in data.

Often the volume of data needed for effective machine learning is vast, and the types of information needed span both structured and unstructured information.  While most modern BI systems have some predictive analytics capabilities, but usually don’t have the ability to address the volume and variety of data that can be processed by Big Data systems like Hadoop and its related technologies.

Summary

Big Data technologies are transitioning from their emerging, experimental roots, and are now being adopted by the early adopting commercial enterprises.  The most important question on everyone’s mind: “Are Big Data technologies like Hadoop relevant to my organization?”

While the four factors to consider outlined in this article are not an exhaustive survey of all factors that combine to answer this important question, they are at the core of this decision-making process.

By assessing questions like these, benchmarking against organizations in similar industries and assessing specific organizational needs, an overall Big Data strategy and Roadmap can be systematically designed for any organization.