When a $30M research endeavor plans to create over 400TB of multi-omics data, the cloud is the obvious option for scale and performance. A large research organization out of the Southeastern U.S. partnered with BlueGranite to provision a secure environment to house their genetic data. Built using Azure Data Lake and Azure Data Factory, we can collect data from constituent research groups and allows for the secure management and control over the data assets in the data lake. Plus, future enhancement will include scalable analyses using Azure Databricks to gain insights from this massive amount of human health information.
Collaborative Framework: Working with academic research groups and enterprise IT architecture, we created a solution for uploading and using data while retaining security.
Solution Design and Security: The team worked within the constraints of the Azure Government Cloud to create a scalable genomics data lake while ensuring NIST 800-171 compliance for data security.
Data Lake-Centric: For this solution, exome sequences along with phenotypes, proteomics, methylomics, and more needed to be logically organized for future cohort-based analyses. By using Azure Data Lake, the heterogeneous data was organized and cataloged at scale.
Explore the BlueGranite team’s insightful blog posts on bioinformatics, genomics, and life science topics below:
Building a Genomics Data Lake in Azure
Scalability in genomics starts with a performant, secure, and collaborative space to store your data. In this eBook, we cover the ideas around building a data lake for your genomics data, including organization, security, and automation of analyses.
Easily copy your data from your BaseSpace account over to your Genomics Data Lake in Azure. This automated approach for retrieving your project samples, analysis outputs, and other datasets unlocks the ability to take advantage of the Azure cloud for secondary and tertiary analyses, machine learning, and more.
|Analysis Results||.bam, .vcf|
|Other Datasets||.csv, logs, etc.|
Illumina® and BaseSpace® are registered trademarks of Illumina, Inc.
BlueGranite nor this data connector are affiliated with or endorsed by Illumina.
Massively scalable, fast, and collaborative Apache Spark™-based analytics service.
Scalable workspaces for machine learning and bioinformatics experiments.
Serve scalable compute resources in Docker containers of virtually any application.
Prebuilt virtual machine image with pre-installed software for bioinformatics and ML.
Automated GATK-compliant pipeline for sequence alignment and annotation.
Power BI for Bioinformatics
Create interactive dashboards and reports of your genomics data with Power BI. By using our expertise coupled with some Power Query magic, we can read and visualize all sorts of files that are common in bioinformatics.
This enables users to take advantage of information in files such as .FASTQ, .BAM, .VCF, and .GFF. Also, you can now import data from virtually any site such as the Protein Data Bank, NCBI, PlasmoDB, and more.
Colby is Blue Granite’s Principal of Life Sciences. He helps clients in this space envision solutions that advance data management and production, speed insight delivery, to improve business outcomes. Colby's specialties include AI in genomics, phylogenetics, protein structure modeling, and the design of scalable bioinformatics pipelines in Azure.