Introduction

PASHA is a parallel short read assembler for large genomes using de Bruijn graphs. Taking advantage of both shared-memory multi-core CPUs and distributed-memory compute clusters, PASHA has demonstrated its potential to perform high-quality de-novo assembly of large genomes in reasonable time with modest computing resources. Our evaluation using three small real paired-end datasets shows that PASHA is able to produce better assemblies with comparable genome coverage and mis-assembly rates compared to three leading assemblers: Velvet, ABySS and SOAPdenovo. Moreover, PASHA achieves the fastest speed for all three datasets on a single CPU. For the human genome, PASHA achieves competitive assembly quality with ABySS and is able to complete the assembly in about 21 hours, which is about 2.38 times faster than ABySS on the same hardware configurations.


Downloads


Citation

Other related papers


Parameters


Installation and Usage

Preparation

The algorithm comprises three executable binaries: pasha-kmergen, pasha-pregraph and pasha-graph. These three programs are executed in order to perform the assembly. The assembly results are stored in DIRECTORY. Before compiling PASHA, you need to install Intel Threading Building Blocks (TBB) in your system. The header files for Intel TBB and MPI can be modified in the Makefile.

NOTE: New GCC versions (from GCC 4.4?) have changed the relative path of the header file "<ext/hash_fun.h>" to "<backward/hash_fun.h>". I have modified the macro "#define HASH_FUN_H <ext/hash_fun.h>" in the file "google/sparsehash/sparseconfig.h" to "#define HASH_FUN_H <backward/hash_fun.h>" in order to keep compatible with the new GCC versions. For old versions, users can manually change the macro back.

Synopsis

  1. pasha-kmergen DIRECTORY -fasta in.fasta
  2. pasha-pregraph DIRECTORY -fasta in.fasta
  3. pasha-graph DIRECTORY -fastaPaired in_1.fasta in_2.fasta

Typical Usage for PASHA-KMERGEN

This program accepts FASTA and FASTQ input formats to generate k-mers. The following are the typical commands:

  1. pasha-kmergen DIRECTORY -fasta in.fasta -k KMER_SIZE
  2. pasha-kmergen DIRECTORY -fasta in.fasta -fasta in2.fasta -k KMER_SIZE
  3. pasha-kmergen DIRECTORY -fasta in.fasta -fastq in2.fastq -k KMER_SIZE
  4. mpirun -np 8 pasha-kmergen DIRECTORY -fasta in.fasta -k KMER_SIZE

Note: Since each process of pasha-kmergen uses two threads, we recommend users to use the number of processes, in a single node, equal to or less than half of the number of CPU cores in that node.

Typical Usage for PASHA-PREGRAPH

This program uses the k-mers generated by pasha-kmergen to build a preliminary de Bruijn graph. This preliminary graph is then simplified by removing tips and low-coverage paths. The following are the typical commands:

  1. pasha-pregraph DIRECTORY -fasta in.fasta
  2. pasha-pregraph DIRECTORY -fasta in.fasta -fasta in2.fasta
  3. pasha-pregraph DIRECTORY -fasta in.fasta -fastq in2.fastq
  4. mpirun -np 8 pasha-pregraph DIRECTORY -fasta in.fasta

Note: When using multiple MPI processes, the number of processes used by pasha-pregraph must be the same as pasha-kmergen.

Typical Usage for PASHA-GRAPH

This program merges bubbles, generates contigs and performs scaffolding. If the users do not want to perform scaffolding, they must use "- fasta" or "-fastq" options to specify the input reads. These two options mean that the input reads are considered as single-end reads, even though they are paired-end reads interleaved in single files. The following are the typical commands without the use of scaffolding.

  1. pasha-graph DIRECTORY -fasta in.fasta
  2. pasha-graph DIRECTORY -fasta in.fasta -fasta in2.fasta
  3. pasha-graph DIRECTORY -fasta in.fasta -fastq in2.fastq
  4. pasha-graph DIRECTORY -fasta in.fasta -numthreads 4

If the uses want to perform scaffolding, they must use "-fastaPaired", "-fastqPaired", "-fastaPairedFile", or "-fastqPairedFile" options to specify the input reads. For "-fastaPaired" and "-fastqPaired" options, the paired-end reads are stored in two separate files, and for "-fastaPairedFile" and "-fastqPairedFile" options, the paired-end reads are interleaved in a single file. Please note that each option specifies a library, meaning that paired-end reads from a same library must be stored in two separated files or be interleaved in a single file. The following are the typical commands when using scaffolding

  1. pasha-graph DIRECTORY -fastaPaired in_1.fasta in_2.fasta
  2. pasha-graph DIRECTORY -fastaPaired in_1.fasta in_2.fasta -fastaPaired in2_1.fasta in2_2.fasta
  3. pasha-graph DIRECTORY -fastqPaired in_1.fastq in_2.fastq -numthreads 4


Change Log


Contact

If any questions or improvements, please feel free to contact Liu, Yongchao.