PASHA - Parallelized short read assembly

Introduction

PASHA is a parallel short read assembler for large genomes using de Bruijn graphs. Taking advantage of both shared-memory multi-core CPUs and distributed-memory compute clusters, PASHA has demonstrated its potential to perform high-quality de-novo assembly of large genomes in reasonable time with modest computing resources. Our evaluation using three small real paired-end datasets shows that PASHA is able to produce better assemblies with comparable genome coverage and mis-assembly rates compared to three leading assemblers: Velvet, ABySS and SOAPdenovo. Moreover, PASHA achieves the fastest speed for all three datasets on a single CPU. For the human genome, PASHA achieves competitive assembly quality with ABySS and is able to complete the assembly in about 21 hours, which is about 2.38 times faster than ABySS on the same hardware configurations.

Downloads

Latest source code (v1.0.10)
more details about the changes in this version are availabe at changelog.

Citation

Yongchao Liu, Bertil Schmidt, and Douglas L. Maskell: " Parallelized short read assembly of large genomes using de Bruijn graphs". BMC Bioinformatics, 2011, 12:354

Other related papers

Tony Pan, Patrick Flick, Chirag Jain, Yongchao Liu and Srinivas Aluru: "Kmerind: A flexible parallel library for k-mer indexing of biological sequences on distributed memory systems". 7th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB 2016), 2016, pp. 422-433

Parameters

Installation and Usage

Preparation

The algorithm comprises three executable binaries: pasha-kmergen, pasha-pregraph and pasha-graph. These three programs are executed in order to perform the assembly. The assembly results are stored in DIRECTORY. Before compiling PASHA, you need to install Intel Threading Building Blocks (TBB) in your system. The header files for Intel TBB and MPI can be modified in the Makefile.

NOTE: New GCC versions (from GCC 4.4?) have changed the relative path of the header file "<ext/hash_fun.h>" to "<backward/hash_fun.h>". I have modified the macro "#define HASH_FUN_H <ext/hash_fun.h>" in the file "google/sparsehash/sparseconfig.h" to "#define HASH_FUN_H <backward/hash_fun.h>" in order to keep compatible with the new GCC versions. For old versions, users can manually change the macro back.

Synopsis

pasha-kmergen DIRECTORY -fasta in.fasta
pasha-pregraph DIRECTORY -fasta in.fasta
pasha-graph DIRECTORY -fastaPaired in_1.fasta in_2.fasta

Typical Usage for PASHA-KMERGEN

This program accepts FASTA and FASTQ input formats to generate k-mers. The following are the typical commands:

pasha-kmergen DIRECTORY -fasta in.fasta -k KMER_SIZE
pasha-kmergen DIRECTORY -fasta in.fasta -fasta in2.fasta -k KMER_SIZE
pasha-kmergen DIRECTORY -fasta in.fasta -fastq in2.fastq -k KMER_SIZE
mpirun -np 8 pasha-kmergen DIRECTORY -fasta in.fasta -k KMER_SIZE

Note: Since each process of pasha-kmergen uses two threads, we recommend users to use the number of processes, in a single node, equal to or less than half of the number of CPU cores in that node.

Typical Usage for PASHA-PREGRAPH

This program uses the k-mers generated by pasha-kmergen to build a preliminary de Bruijn graph. This preliminary graph is then simplified by removing tips and low-coverage paths. The following are the typical commands:

pasha-pregraph DIRECTORY -fasta in.fasta
pasha-pregraph DIRECTORY -fasta in.fasta -fasta in2.fasta
pasha-pregraph DIRECTORY -fasta in.fasta -fastq in2.fastq
mpirun -np 8 pasha-pregraph DIRECTORY -fasta in.fasta

Note: When using multiple MPI processes, the number of processes used by pasha-pregraph must be the same as pasha-kmergen.

Typical Usage for PASHA-GRAPH

This program merges bubbles, generates contigs and performs scaffolding. If the users do not want to perform scaffolding, they must use "- fasta" or "-fastq" options to specify the input reads. These two options mean that the input reads are considered as single-end reads, even though they are paired-end reads interleaved in single files. The following are the typical commands without the use of scaffolding.

pasha-graph DIRECTORY -fasta in.fasta
pasha-graph DIRECTORY -fasta in.fasta -fasta in2.fasta
pasha-graph DIRECTORY -fasta in.fasta -fastq in2.fastq
pasha-graph DIRECTORY -fasta in.fasta -numthreads 4

If the uses want to perform scaffolding, they must use "-fastaPaired", "-fastqPaired", "-fastaPairedFile", or "-fastqPairedFile" options to specify the input reads. For "-fastaPaired" and "-fastqPaired" options, the paired-end reads are stored in two separate files, and for "-fastaPairedFile" and "-fastqPairedFile" options, the paired-end reads are interleaved in a single file. Please note that each option specifies a library, meaning that paired-end reads from a same library must be stored in two separated files or be interleaved in a single file. The following are the typical commands when using scaffolding

pasha-graph DIRECTORY -fastaPaired in_1.fasta in_2.fasta
pasha-graph DIRECTORY -fastaPaired in_1.fasta in_2.fasta -fastaPaired in2_1.fasta in2_2.fasta
pasha-graph DIRECTORY -fastqPaired in_1.fastq in_2.fastq -numthreads 4

Change Log

October 11, 2013 (v1.0.10)
1. fixed a bug in scaffolding problem, which results in 4 instances of array-out-of-bounds accesses, caused by too large limits in for-loops. We thank Lech Nieroda from the University of Cologne to provide a patch for our program.
June 20, 2013 (v1.0.9)
1. Fixed a bug in contig scaffolding and made some changes to reduce the mis-assmeby rate of scaffolds.
September 01, 2012 (v1.0.6)
1. Fixed a bug when using multiple paired-end libaries with different insert sizes for scaffolding.
August 29, 2012 (v1.0.5)
1. Fixed a bug when estimating insert sizes for paired-end reads. This bug can cause the program to crash at times.
September 22, 2011 (v1.0.3)
1. adds the support for compressed fasta and fastq formats through the use of zlib to open and to read the input short reads.

Contact

If any questions or improvements, please feel free to contact Liu, Yongchao.

PASHA - Parallelized short read assembly

Site Map

Project Links

List of My Software

Big Data

Machine Learning

Scientific Computing

Sequence Alignment

Motif Discovery

NGS Read Alignment

NGS Read Error Correction

NGS de novo Assembly

NGS SNV calling

NGS Metagenomics

Inspire Innovation