Bacterial metagenomics profiling for metagenomic whole sequencing (mWGS) usually starts by aligning sequencing reads to a collection of reference genomes. Current profiling tools are designed to work against a small representative collection of genomes, and do not scale very well to larger reference genome collections. In addressing this issue, the current project developed FLINT, a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes. FLINT takes advantage of Spark's built-in parallelism and streaming engine architecture to quickly map reads against a large (170 GB) reference collection of 43,552 bacterial genomes from Ensembl. FLINT runs on Amazon's Elastic MapReduce service, and is able to profile 1 million Illumina paired-end reads against over 40K genomes on 64 machines in 67san order of magnitude faster than the state of the art, while using a much larger reference collection. Streaming the sequencing reads allows this approach to sustain mapping rates of 55 million reads per hour, at an hourly cluster cost of $8.00 USD, while avoiding the necessity of storing large quantities of intermediate alignments. (publisher abstract modified)
Large Scale Microbiome Profiling in the Cloud
NCJ Number
254079
Journal
Bioinformatics Volume: 35 Issue: 14 Dated: 2019 Pages: 113-122
Date Published
2019
Length
10 pages
Annotation
Since large reference genome collections are capable of providing a more complete and accurate profile of the bacterial population in a metagenomics dataset, this article reports on a project that developed a scalable, efficient, and affordable approach that brings big data solutions within the reach of laboratories with modest resources.
Abstract