Reading Bioinformatics and Genomics Files in Power BI

For many users in the finance, insurance, retail, manufacturing, and even healthcare industries, Power BI has become a staple in their business intelligence plan. From interactive visualizations to advanced data wrangling, Power BI offers a one-stop shop for gaining insights from your data.

However, for any of us that are researchers in the bioinformatics and genomics space, we know that our files can be a bit difficult to work with. From FASTA to BAM, working with files in bioinformatics add a layer of uniqueness that requires some special care.

Today, if you take a look at Power BI Desktop’s options for getting data, you’ll see a ton of sources to which you can easily connect. One problem: none of these uniquely help us bioinformaticians.

Putting the “Power” in Power Query

In bioinformatics, there are a plethora of file types for every occasion. Among these are very popular ones such as FASTA (or FASTQ) and BAM and, more recently, GFF3 and BGEN. We can break these data sources down into three main types:

Text-Based Files

Files that are human readable and can be open using virtually any text editor.

Binary Files

Files that are serialized and must be read by machines.

Online Sources

Databases, webpages, or FTP sites on the internet.

In Power BI, we can take advantage of Power Query to read in data and parse it appropriately. You’ll notice that, while Power BI has tons of connects to everything from CSV files to Spark clusters, there are no built-in connectors for our beloved genomics file types (yet?). So, we’ll have to use the Blank Query editor.

Text Files

In the query editor, we can write a custom M script to parse our files. For example, to parse a SAM file:

// Read SAM Files

let
    // Read in file
    Source = Table.FromColumns({Lines.FromBinary(File.Contents("C:UsersColbyDocumentsGitHubbioPowerBIbam_and_samsample.sam"), null, null, 65001)}),
    // Skip @ lines
    #"Filtered Rows" = Table.SelectRows(Source, each not Text.StartsWith([Column1], "@")),
    // Split into columns by t character and assign names
    // Note: This removes and values past the 11 standard columns
    #"Split Column by Delimiter" = Table.SplitColumn(#"Filtered Rows", "Column1", Splitter.SplitTextByDelimiter("#(tab)", QuoteStyle.Csv), {"QNAME","FLAG","RNAME","POS","MAPQ","CIGAR","RNEXT","PNEXT","TLEN","SEQ","QUAL"}),
    // Change data types
    #"Changed Type" = Table.TransformColumnTypes(#"Split Column by Delimiter",{{"QNAME", type text}, {"FLAG", Int64.Type}, {"RNAME", type text}, {"POS", Int64.Type}, {"MAPQ", Int64.Type}, {"CIGAR", type text}, {"RNEXT", type text}, {"PNEXT", Int64.Type}, {"TLEN", Int64.Type}, {"SEQ", type text}, {"QUAL", type text}})

in
    #"Changed Type"

Binary Files

We can use R or Python to read in those pesky binary files as well. (This sometimes make even parsing text files simpler, too.) For example, to parse a BGEN file:

// Read BGEN Files

let
    // Use rbgen in an R Script to get the data from the .bgen file
    Source = R.Execute("library(rbgen)#(lf)file <- ""C:\Users\Colby Ford\Desktop\bioPowerBI\bgen\sample.bgen""#(lf)dataset <- as.data.frame(bgen.load(file))"),
    sample = Source{[Name="dataset"]}[Value]
in
    sample

…which easily makes the file available as a table.

Online Sources

Lastly, Power BI makes it easy to grab data from the web. For example, let’s say that I wanted to get a list of all annotated genes of SARS-CoV-2 from NCBI.

I can quickly grab the URL from my browser and paste it into Power BI, which will then search the page for any tables of information.

This enables users to take advantage of data from virtually any site. Try it out on the Protein Data Bank, NCBI, PlasmoDB, and more!

Takeaway Messages

Be mindful of memory. Bioinformatics files can be large and, if you’re running on a machine with limited resources, you might bog it down.
Check the defined specifications of any file format you’re looking to parse.
R or Python can be your BFF, especially for binary or really complex file types.

Demo Video

Resources

All code used in above demos and additional examples are available at: www.github.com/BlueGranite/bioPowerBI

If you’d like to learn more about Power BI and how it can help you, contact 3Cloud today.

Industries

Reading Bioinformatics and Genomics Files in Power BI

Putting the “Power” in Power Query

Text Files

Binary Files

Online Sources

Takeaway Messages

Demo Video

Resources

Related Articles

Data & AI Roadmap

Healthcare AI Use Cases

AI Strategy Roadmap: Navigating the Stages of AI Value Creation

Your Cloud Transformation Journey Starts Here