Mango: Exploratory Data Analysis for Large-Scale Sequencing Datasets

Alyssa Kramer Morrow, George Zhixuan He, Frank Austin Nothaft, Eric Tongching Tu, Justin Paschall, Nir Yosef, Anthony Douglas Joseph

Research output: Contribution to journalArticlepeer-review

Abstract

The decreasing cost of DNA sequencing over the past decade has led to an explosion of sequencing datasets, leaving us with petabytes of data to analyze. However, current sequencing visualization tools are designed to run on single machines, which limits their scalability and interactivity on modern genomic datasets. Here, we leverage the scalability of Apache Spark to provide Mango, consisting of a Jupyter notebook and genome browser, which removes scalability and interactivity constraints by leveraging multi-node compute clusters to allow interactive analysis over terabytes of sequencing data. We demonstrate scalability of the Mango tools by performing quality control analyses on 10 terabytes of 100 high-coverage sequencing samples from the Simons Genome Diversity Project, enabling capability for interactive genomic exploration of multi-sample datasets that surpass the computational limitations of single-node visualization tools. Mango is freely available for download with full documentation at https://bdg-mango.readthedocs.io/en/latest/. The decreasing cost of DNA sequencing has led to petabytes of sequencing data for analysts in research and clinical settings to develop data-driven hypotheses from. Mango is a sequence visualization tool that leverages multi-node compute clusters to allow interactive analysis over large sequencing datasets. Mango provides a genome browser graphical user interface and python notebook form factor to allow users of varying analytical experience to explore large sequencing datasets.

Original languageEnglish
Pages (from-to)609-613.e3
JournalCell Systems
Volume9
Issue number6
DOIs
StatePublished - 18 Dec 2019
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Pathology and Forensic Medicine
  • Cell Biology
  • Histology

Fingerprint

Dive into the research topics of 'Mango: Exploratory Data Analysis for Large-Scale Sequencing Datasets'. Together they form a unique fingerprint.

Cite this