PEPPER v0.1 release
PEPPER v0.1 release notes (haploid assembly polisher)
PEPPER
is a recurrent neural network-based haploid genome assembly polisher. This is the first release of the haploid assembly polishing component of PEPPER
. We tested PEPPER
's performance on several human genome samples, Zymo microbial community samples, and non-model organisms. The performance of PEPPER
suggests that we can achieve highly accurate genome assemblies using ONT reads only.
Installation
PEPPER
is available via pip
to install.
python3 -m pip install pepper-polish
# if you get permission error, then try:
python3 -m pip install --user pepper-polish
python3 -m pepper.pepper --help
python3 -m pepper.pepper polish --help
# Expected output: PEPPER VERSION: 0.1.1
Models
The model files are available here: https://github.com/kishwarshafin/pepper/tree/r0.1/models
MinION_r10_native_microbial.pkl : For R10.3 guppy 3.4.8 (Microbial)
MinION_r10_pcr_microbial.pkl : For R10.3 guppy 3.4.8 (Microbial)
PEPPER_polish_haploid_guppy360.pkl : Supports Guppy 3.0.5 to Guppy 4+ (Large genomes- trained to be sensitive to the heterozygosity of the genome, can be used in phase-aware polishing)
PromethION_r941_guppy305_HAC_human.pkl : Supports Guppy 3.0.5 to Guppy 4+ (Large genomes)
PromethION_r941_guppy305_HAC_microbial.pkl : Supports Guppy 3.0.5 to Guppy 4+ (Microbial)
Motivation
Assemblies generated using ONT data usually have low base-level quality and require further polishing. Existing polishers like Racon-Medaka
can improve the base-level quality of an assembly but performs poorly in transcriptome completeness. Previously, we introduced a new polisher suite, MarginPolish-HELEN
, with superior performance in transcriptome completeness and base-level accuracy. However, MarginPolish-HELEN
has runtime and cost overhead. To overcome the issue, we developed PEPPER
, where we use local realignment of reads to the assembly to produce highly accurate polished genome assemblies while being sensitive to the structural integrity of the assembly. PEPPER
can be paired with Shasta
, Flye
, Canu
or any other ONT based assemblers. The performance of PEPPER
as a standalone assembly polisher is superior to any other existing ONT assembly polisher including MarginPolish-HELEN
.
We participated in the HPRC assembly bakeoff where Shasta-PEPPER
HG002 assembly was able to achieve Q35 in assembly quality while having similar transcriptome completeness to that reported in the Shasta-MarginPolish-HELEN
paper.
Extension to variant calling
In collaboration with Google Health, we used a modified version of the haploid assembly polisher mode of PEPPER
and paired it with DeepVariant to achieve state-of-the-art performance in reference based small variant calling with ONT reads. Our effort has been recognized by the PrecisionFDA truth challenge V2 where PEPPER-DeepVariant
achieved top awards in ONT category. This work is still in development and future releases will include details about modules that we are developing to enable ONT-based variant calling.
Collaboration with Darwin tree of life project and other projects.
The Darwin Tree of Life project plans to sequence and assemble all known species of animals, plants, fungi and protists in Britain and Ireland. The project picked Shasta
to generate de novo ONT assemblies efficiently and after evaluating multiple existing assembly polishers, the tree of life project picked PEPPER
to polish the assemblies. We are collaborating with Ksenia Krasheninnikova from the Wellcome Sanger Institute, who is actively evaluating PEPPER
on non-model vertebrate genomes and helping us to improve our methods.
We are also collaborating with several other groups to use PEPPER
to polish ONT based genome assemblies. We have applied PEPPER
to polish tomato genomes, non-human vertebrate genomes, highly heterozygous plant genomes and microbial genomes. In all cases, we saw better performance than existing polishing tools when it comes to structural integrity of the genome assembly and base-level quality.
Future direction
PEPPER
builds a foundation upon which we plan to develop a set of next-generation genome inference tools for ONT reads. In collaboration with Google Health, we were able to use PEPPER
as a primary candidate finder that enabled DeepVariant
to identify variants from ONT reads accurately. We plan to keep improving the variant-calling pipeline. Moreover, Shasta is now producing haplotype-resolved genome assemblies, and we plan to deploy a diploid assembly polishing pipeline soon.