we're hiring

Need a career change? We're hiring top talent to join our team!

Aug 18, 2015

Demo Day: Simplify Analysis of Big Data with Spark on Azure HDInsight

David Eldersveld Posted by David Eldersveld

Some of the key tasks in data science involve basic exploration of new or existing data.  Raw data is given structure, data can be joined to other datasets, features are selected for later analysis, and much more.  Depending on the questions to which you seek answers as well as other requirements, the process repeats until you have data that is ideal for further, more advanced, analytics.


With Apache Spark on Azure HDInsight, these core tasks are made simpler with the inclusion of both the Apache Zeppelin and Jupyter notebooks.  In this Demo Day video, I walk through basic exploration of a city's traffic crash history using Zeppelin with both Spark DataFrames and Spark SQL.  I discuss some of the advantages of using Zeppelin and Spark for data of any volume.  Working with a new text file, I obtain an initial look at what features are available, see what cleansing may need to take place, and obtain a basic feel for the dataset through querying and visualization.  At this stage, I compute summary statistics as well as develop a repeatable process that can be used later.  While this is descriptive analysis, how can the data be prepared for other applications such as predictive analytics?

Overall, I can use the data to help bring me closer to answering my initial questions as well as prompt new questions.  For example:

  • Weather impacts road conditions.  During a snow storm, am I usually safer taking a two lane road or a freeway?  Freeways may have more accidents overall, but they also have a much higher traffic volume.  Factoring in a road's average daily traffic, do accidents during snow increase at similar rates for all road types--or increase at all?
  • College football home games increase traffic congestion.  Is there an increase in accidents that correlates with that congestion?  Do accidents on game days take place along main corridors to the stadium, or are they dispersed throughout the city?

View the video below to see how the Zeppelin notebook on a Spark on Azure HDInsight cluster can help me get answers.

Want to learn more about how data science in Azure can help your business?  Contact us for a consultation.

New call-to-action
David Eldersveld

About The Author

David Eldersveld

David is a former BlueGranite Solution Architect and current Microsoft MVP who has employed skills in technology development, data integration, data analysis, and systems analysis for over ten years. David enjoys building BI and advanced analytics solutions with technologies in Microsoft Azure and the Power Platform. He is active in various technical communities. In addition to blogging for BlueGranite, he also writes at

Latest Posts

New call-to-action