Introduction
PASHA is a parallel short read assembler for large genomes using de Bruijn graphs. Taking advantage of both shared-memory multi-core CPUs and distributed-memory compute clusters, PASHA has demonstrated its potential to perform high-quality de-novo assembly of large genomes in reasonable time with modest computing resources. Our evaluation using three small real paired-end datasets shows that PASHA is able to produce better assemblies with comparable genome coverage and mis-assembly rates compared to three leading assemblers: Velvet, ABySS and SOAPdenovo. Moreover, PASHA achieves the fastest speed for all three datasets on a single CPU. For the human genome, PASHA achieves competitive assembly quality with ABySS and is able to complete the assembly in about 21 hours, which is about 2.38 times faster than ABySS on the same hardware configurations.
Downloads
- Latest source code (v1.0.10)
more details about the changes in this version are availabe at changelog.
Citation
- Yongchao Liu, Bertil Schmidt, and Douglas L. Maskell: " Parallelized short read assembly of large genomes using de Bruijn graphs". BMC Bioinformatics, 2011, 12:354
Other related papers
- Tony Pan, Patrick Flick, Chirag Jain, Yongchao Liu and Srinivas Aluru: "Kmerind: A flexible parallel library for k-mer indexing of biological sequences on distributed memory systems". 7th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB 2016), 2016, pp. 422-433
Parameters
Installation and Usage
Preparation
The algorithm comprises three executable binaries: pasha-kmergen, pasha-pregraph and pasha-graph. These three programs are executed in order to perform the assembly. The assembly results are stored in DIRECTORY. Before compiling PASHA, you need to install Intel Threading Building Blocks (TBB) in your system. The header files for Intel TBB and MPI can be modified in the Makefile.
NOTE: New GCC versions (from GCC 4.4?) have changed the relative path of the header file "<ext/hash_fun.h>" to "<backward/hash_fun.h>". I have modified the macro "#define HASH_FUN_H <ext/hash_fun.h>" in the file "google/sparsehash/sparseconfig.h" to "#define HASH_FUN_H <backward/hash_fun.h>" in order to keep compatible with the new GCC versions. For old versions, users can manually change the macro back.
Synopsis
- pasha-kmergen DIRECTORY -fasta in.fasta
- pasha-pregraph DIRECTORY -fasta in.fasta
- pasha-graph DIRECTORY -fastaPaired in_1.fasta in_2.fasta
Typical Usage for PASHA-KMERGEN
This program accepts FASTA and FASTQ input formats to generate k-mers. The following are the typical commands:
- pasha-kmergen DIRECTORY -fasta in.fasta -k KMER_SIZE
- pasha-kmergen DIRECTORY -fasta in.fasta -fasta in2.fasta -k KMER_SIZE
- pasha-kmergen DIRECTORY -fasta in.fasta -fastq in2.fastq -k KMER_SIZE
- mpirun -np 8 pasha-kmergen DIRECTORY -fasta in.fasta -k KMER_SIZE
Note: Since each process of pasha-kmergen uses two threads, we recommend users to use the number of processes, in a single node, equal to or less than half of the number of CPU cores in that node.
Typical Usage for PASHA-PREGRAPH
This program uses the k-mers generated by pasha-kmergen to build a preliminary de Bruijn graph. This preliminary graph is then simplified by removing tips and low-coverage paths. The following are the typical commands:
- pasha-pregraph DIRECTORY -fasta in.fasta
- pasha-pregraph DIRECTORY -fasta in.fasta -fasta in2.fasta
- pasha-pregraph DIRECTORY -fasta in.fasta -fastq in2.fastq
- mpirun -np 8 pasha-pregraph DIRECTORY -fasta in.fasta
Note: When using multiple MPI processes, the number of processes used by pasha-pregraph must be the same as pasha-kmergen.
Typical Usage for PASHA-GRAPH
This program merges bubbles, generates contigs and performs scaffolding. If the users do not want to perform scaffolding, they must use "- fasta" or "-fastq" options to specify the input reads. These two options mean that the input reads are considered as single-end reads, even though they are paired-end reads interleaved in single files. The following are the typical commands without the use of scaffolding.
- pasha-graph DIRECTORY -fasta in.fasta
- pasha-graph DIRECTORY -fasta in.fasta -fasta in2.fasta
- pasha-graph DIRECTORY -fasta in.fasta -fastq in2.fastq
- pasha-graph DIRECTORY -fasta in.fasta -numthreads 4
If the uses want to perform scaffolding, they must use "-fastaPaired", "-fastqPaired", "-fastaPairedFile", or "-fastqPairedFile" options to specify the input reads. For "-fastaPaired" and "-fastqPaired" options, the paired-end reads are stored in two separate files, and for "-fastaPairedFile" and "-fastqPairedFile" options, the paired-end reads are interleaved in a single file. Please note that each option specifies a library, meaning that paired-end reads from a same library must be stored in two separated files or be interleaved in a single file. The following are the typical commands when using scaffolding
- pasha-graph DIRECTORY -fastaPaired in_1.fasta in_2.fasta
- pasha-graph DIRECTORY -fastaPaired in_1.fasta in_2.fasta -fastaPaired in2_1.fasta in2_2.fasta
- pasha-graph DIRECTORY -fastqPaired in_1.fastq in_2.fastq -numthreads 4
Change Log
- October 11, 2013 (v1.0.10)
- fixed a bug in scaffolding problem, which results in 4 instances of array-out-of-bounds accesses, caused by too large limits in for-loops. We thank Lech Nieroda from the University of Cologne to provide a patch for our program.
- June 20, 2013 (v1.0.9)
- Fixed a bug in contig scaffolding and made some changes to reduce the mis-assmeby rate of scaffolds.
- September 01, 2012 (v1.0.6)
- Fixed a bug when using multiple paired-end libaries with different insert sizes for scaffolding.
- August 29, 2012 (v1.0.5)
- Fixed a bug when estimating insert sizes for paired-end reads. This bug can cause the program to crash at times.
- September 22, 2011 (v1.0.3)
- adds the support for compressed fasta and fastq formats through the use of zlib to open and to read the input short reads.
Contact
If any questions or improvements, please feel free to contact Liu, Yongchao.