Health & Life Sciences / Modern Data Platform

Research Organization Scales Collaborative Research with the Creation of a Genomics Data Lake

Featured Image_SB_public s


When a $30M research endeavor plans to create over 400TB of multi-omics data, the cloud is the obvious option for scale and performance. A large research organization out of the Southeastern U.S. partnered with BlueGranite to provision a secure environment to house their genetic data. Built using Azure Data Lake and Azure Data Factory, we can now collect data from constituent research groups and allows for the secure management and control over the data assets in the data lake. Plus, future enhancement will include scalable analyses using Azure Databricks and Azure Machine Learning to gain insights from this massive amount of human health information.

The Challenge

Given that parts of this research were being performed by individual research institutions, collecting and collaborating on all of the data was important. Specifically, separate institutions were responsible for individual -omics analyses and needed a place to house these results for future analysis. Given that this data would amass a 400TB footprint over the next couple years, a scalable solution was paramount.

This research project contained genomic data from military personnel, making the data controlled, unclassified information (CUI). This means that security would prove to be an important factor in our architectural design. One specific requirement was that the overall data platform architecture in Azure be compliant with NIST 800-171 standards.

The Solution

Given that this solution involved sensitive data with strict security compliance requirements, we used the Azure Government Cloud. This specialized version of Azure comes built with world-class security, protection, and compliance and allowed for us to ensure NIST 800-171 standards were met with this solution.

BlueGranite designed a genomics data lake-centric architecture using Azure Data Lake and Azure Data Factory. This architecture utilized the power of Data Factory to copy or move data from external source systems to the client’s Azure tenant. In addition, Azure Data Lake provided a scalable data repository for housing the heterogeneous -omics data sources from the individual research groups.

Finally, the use of role-based access control (RBAC) in Azure Data Lake allowed for fine-grained permissions to be set on directories in the data lake. This means that individuals can only see and contribute to specific sections of the data.


The Results

In this first engagement, BlueGranite successfully created a data platform architecture that supported the collaboration of individual research groups and provided a scalable solution for housing large amounts of data. In future phases, additions to the Azure architecture will include the use of Azure Databricks and Azure Machine Learning for performing secondary and tertiary analyses on the -omics data along with machine learning and biostatistical workloads.

If your organization is ready to delve into the wide world of modern data platform and analytics solutions for genomics, we would love to be your guide! Contact us today.

Interested in how your organization can harness data to transform your bioinformatics and genomics practice? Feel free to reach out to us and we will be glad to answer any questions you have.


Kickstart Your Digital Transformation With BlueGranite's Business Intelligence Solutions