De Novo Genome Assembly

De Novo Genome Assembly

Introduction

Welcome! Here, we are going to run through some of the typical and most common steps for taking a newly sequenced isolate genome from raw Fastq files through to an assembled, curated genome you can begin to explore. It’s assumed you’re already somewhat familiar with working at the command line.

The steps shown below are definitely not the only way to process raw sequencing data, nor am I claiming that they are the best way! But these are the steps that I have found to work well!

*Disclaimer* I’ve learned SO MUCH from @ AstrobioMike, and this walkthrough is heavily borrow from his own site! You should check it out!

Anytime I am talking about things that are specific to my own computer or making comments that are not essential to the tutorial, they will appear like this!

Tools used here

I’ve listed the tools that we are going to use below. The links below go to their respective developer pages where you can usually find great help for installation, but you can also use the package manager conda to install things.

I really recommend using miniconda. Miniconda is a free minimal installer for conda. It is a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages, including pip, zlib and a few others. The details are not super important, but in short, conda (and miniconda) is really nice and makes things a lot easier! All of the instructions below will be run assuming that you have installed miniconda (or some other form of python). I’ve attached some instructions for getting miniconda to work below.

Downloading and getting miniconda to work on a Mac
  1. Go to the miniconda main page and download the correct version of miniconda for your operating system.
  2. Open your terminal and navigate to the folder where you downloaded the installer. Type the following command in the terminal and press “return” on your keyboard. Here, mine is located in my downloads folder.
cd /Users/Patrick/Downloads
  1. Now, run the installation program that is in your downloads folder. My file is named ‘Miniconda3-latest-MacOSX-x86_64.1.sh’.
bash /Users/Patrick/Downloads/Miniconda3-latest-MacOSX-x86_64.1.sh
  1. Now to test whether the installation was succesful, type and run the following command in your terminal:
conda list
Downloading and getting miniconda to work on a Windows
  1. Go to the miniconda main page and download the correct version of miniconda for your operating system.
  2. Double-click the .exe file.
  3. Follow the instructions on the screen. If you are unsure about any setting, accept the defaults. You can change them later. When installation is finished, from the Start menu, open the Anaconda Prompt.
  4. Now to test whether the installation was succesful, type and run the following command in your terminal:
Packages

Now that we have conda, here are all of the packages that we will be using:

Conda Installations

This will create an environment for this tutorial and install all these programs I used when putting this together (this will take a long time, I recommend allocating a couple of hours or even running this the day before you will need it). At the end there is an example of how to remove this environment 😊

conda create -n denovo_tutorial -c bioconda -c conda-forge fastqc \
             trimmomatic spades megahit quast multiqc abyss \
             bowtie2 anvio centrifuge java-jdk --yes

And to activate the environment:

conda activate denovo_tutorial

Or if that gives an error from that, depending on how our conda is set up, we may need to run:

source activate denovo_tutorial

Afterwards, we can exit the environment like this:

conda deactivate

The data

The practice data we’re going to use here was provided by a study on clinical multi-drug resistant isolates of Pseudomonas aeruginosa. This data was originally uploaded to the European Nucleotide Archive under accession #PRJNA554598. Sequencing was performed using paired-end 2×300 bp reads on an Illumina Miseq.

You can download all the data files and execute the commands as below. I’ve provided the raw reads and the error-corrected reads so that step can be skipped if wanted. The download also includes most of the intermediate and all of the end-result files so you can explore any component along the way at will without doing the processing.

You can download a zip file with all the raw, intermediate, and final files using this command:

Make sure you download this to the folder/directory that you want to work in

cd ~/
wget -v -O Denovo_Tutorial.zip -L https://www.dropbox.com/s/qrbmqk9sgq1jm83/Denovo_Tutorial.zip?dl=0
unzip Denovo_Tutorial.zip && rm Denovo_Tutorial.zip

Quality Filtering

Assessing the quality of your sequence data and filtering appropriately should be the first thing you do with your dataset. We’ll do that with FastQC. Here we’ll start with that on the raw reads.

FastQC scans the fastq files you give it to generate a broad overview of some summary statistics. It produces an html output for each fastq file of reads it is given (they can be gzipped). Let’s make sure we are in the proper working directory…

Don’t forget to change this script so that it matches the path to where you downloaded the files!

cd /Users/patrick/Desktop/Denovo_Tutorial

and then FastQC can be run like so:

mkdir ./qualityreports && fastqc SRR9681949_1.fastq SRR9681949_2.fastq -o ./qualityreports -t 4

This will make a new folder for the quality reports to be put into and then will run FastQC on our two samples.

The resulting html output files can be opened and explored showing all of the modules FastQC scans. These can be tough to understand when you’re looking at them with no prior experience. BUT, for example here is a good quality output and here is a relatively poor one.

Looking at our output from the forward reads (SRR9681949_1_fastqc.html). I typically want all of my reads to have a quality score above 28 (the green zone). These aren’t the best quality reads, but we have a lot of them! Hopefully that will help us later on:

Not too much stands out other than the quality scores are pretty mediocre still, but this is expected of reverse reads:

The blue line is the mean quality score of all reads at the corresponding positions, the red line is the median, and the yellow boxplots represent the interquartile range, and the whiskers the 10th and 90th percentiles. Getting this type of information from FastQC helps us determine what parameters we want to set for our quality filtering/read trimming.

Trimmomatic is a pretty flexible tool that enables you to trim up your sequences based on several quality thresholds and some other metrics (like minimum length or removing adapters and such). Since the summary from FastQC was good, for a first pass I just ran Trimmomatic with pretty stringent settings. I’ve also added in two lines of code that will help to clean up our directories and keep things tidy:

mkdir ./reads_strictfiltering && \
trimmomatic PE SRR9681949_1.fastq SRR9681949_2.fastq \
            SRR9681949_1_paired.fastq.gz SRR9681949_1_unpaired.fastq.gz \
            SRR9681949_2_paired.fastq.gz SRR9681949_2_unpaired.fastq.gz \
            LEADING:10 TRAILING:10 SLIDINGWINDOW:5:20 MINLEN:300 && \
mv ./*.fastq.gz ./reads_strictfiltering

If you want to skip this and just copy the results run:

mkdir ./reads_strictfiltering && \
cp ./Results/reads_strictfiltering/*.gz ./reads_strictfiltering

Our filtering thresholds here start with “LEADING:10”. This says cut the bases off the start of the read if their quality score is below 10, and we have the same set for the end with “TRAILING:10”. Then the sliding window parameters are 5 followed by 20, which means starting at base 1, look at a window of 5 bps and if the average quality score drops before 20, truncate the read at that position and only keep up to that point. The stringent part comes in with the MINLEN:300 at the end. Since the reads are already only 300 bps long, this means if any part of the read is truncated due to those quality metrics set above the entire read will be thrown away.

Yikes! The output from that shows us that only 0.35% of the read pairs (both forward and reverse from the same fragment) passed! This definitely will not be enough. Although, this is somewhat expected considering we used very stringent filtering parameters and did not have great quality, especially towards the end of the reads.

Lets try some less stringent settings:

mkdir ./reads_looserfiltering && \
trimmomatic PE SRR9681949_1.fastq SRR9681949_2.fastq \
            SRR9681949_1_paired.fastq.gz SRR9681949_1_unpaired.fastq.gz \
            SRR9681949_2_paired.fastq.gz SRR9681949_2_unpaired.fastq.gz \
            CROP:225 LEADING:10 TRAILING:10 SLIDINGWINDOW:5:20 MINLEN:220 && \
mv ./*.fastq.gz ./reads_looserfiltering

If you want to skip this and just copy the results run:

mkdir ./reads_looserfiltering && \
cp ./Results/reads_looserfiltering/*.gz ./reads_looserfiltering

These small changes gave us a lot more data, but still only ~31% of our reads still passed! We are left with ~1,200,000 read pairs. Since we know what we’re working with here (meaning it’s an isolate of a known genus and not a metagenome or somethign completely unknown), we can pretty quickly estimate if this could even possibly be enough depth for us to assemble the genome we’re expecting. Assuming those reads were perfect quality and perfectly evenly distrubuted (which they’re not), that would be (1,200,000 paired reads) * (440 bps per paired read) = 528 Mbps covered. Most P. aeruginosa are around 6 Mbps, meaning we’d have around 88x coverage right now. You typically want to aim for around ~50-100X coverage for the de novo assembly of a typical prokaryotic genome. Even though we had to trim more than normal, we are looking good now!

And just for a peek at the FastQC output after our trimming:

mkdir ./looserfilteredqualityreports && \
cd ./reads_looserfiltering && \
fastqc SRR9681949_1_paired.fastq.gz SRR9681949_2_paired.fastq.gz -o ../looserfilteredqualityreports && \
cd ../

Assembly

Now that we have our reads quality filtered, we’re ready to move on to assembling them. There are lots of assembly programs out there, and once again, there is no one-size-fits-all. SPAdes and MEGAHIT are both good and we’ll try those!

Error Correction

I reccomend incorporating an error-correction step tends improve assembly results, and the one I happen to use is available through the SPAdes program. So even when I use the assembly from another program, I run error-correction on the reads with SPAdes first. A default SPAdes run will run the error-correction step and save the reads from it so you can then use them elsewhere. If you don’t want to do the assembly with SPAdes, but run the error-correction step, you can set the –only-error-correction flag like we do first here.

This took ~40 minutes.

spades.py -1 ./reads_looserfiltering/SRR9681949_1_paired.fastq.gz \
           -2 ./reads_looserfiltering/SRR9681949_2_paired.fastq.gz \
           --only-error-correction \
           -o spades_error_corrected_looserreads -t 50 -m 500

To skip this and copy these results over:

mkdir spades_error_corrected_looserreads &&\
mkdir spades_error_corrected_looserreads/corrected &&\
cp ./Results/spades_error_corrected_looserreads/corrected/*paired.fastq.00.0_0.cor.fastq.gz ./spades_error_corrected_looserreads/corrected

SPAdes

Now that we have run the error correction, we can run the actual assemblies!

First, lets run the SPAdes assembly with its default settings using our error-corrected reads. You can run the default assembly program in isolate mode. The --isolate flag is deisgned to better handle the higher coverage that is typically found with isolate sequencing. We will also use the --only-assembler flag to tell it not to run the error-correction steps since we just did that above:

spades.py -1 ./spades_error_corrected_looserreads/corrected/SRR9681949_1_paired.fastq.00.0_0.cor.fastq.gz \
          -2 ./spades_error_corrected_looserreads/corrected/SRR9681949_2_paired.fastq.00.0_0.cor.fastq.gz \
          -o spades_default_assembly_loose --only-assembler --isolate

To skip this and copy these results over:

mkdir spades_default_assembly_loose &&\
cp ./Results/spades_default_assembly_loose/contigs.fasta ./spades_default_assembly_loose

It has also been suggested that paired-end Illumina data can benefit from having the kmer lengths set to 21, 33, 55, 77, 99, and 121 . This is the default if the reads are at least 150 bps. It is also reccomended that you run the assembler in --careful mode, which attempts to decrease the rate of misassemblies. Unfortunately, you can’t run --careful and --isolate at the same time, but that’s okay. This will give us a good compairson between the two. Lets try both of these things to see if it makes a difference:

spades.py -1 ./spades_error_corrected_looserreads/corrected/SRR9681949_1_paired.fastq.00.0_0.cor.fastq.gz \
          -2 ./spades_error_corrected_looserreads/corrected/SRR9681949_2_paired.fastq.00.0_0.cor.fastq.gz \
          -o spades_kmers_set_careful_assembly_looser -k 21,33,55,77, 99, 121 \
          --careful --only-assembler

To skip this and copy these results over:

mkdir spades_kmers_set_careful_assembly_looser &&\
cp ./Results/spades_kmers_set_careful_assembly_looser/contigs.fasta ./spades_kmers_set_careful_assembly_looser

MegaHit

Next, we will run a couple of assemblies with MegaHit. The first one we will run will be using the default settings:

megahit -1 ./spades_error_corrected_looserreads/corrected/SRR9681949_1_paired.fastq.00.0_0.cor.fastq.gz -2 ./spades_error_corrected_looserreads/corrected/SRR9681949_2_paired.fastq.00.0_0.cor.fastq.gz -o megahit_default_assembly_looser -t 4

To skip this and copy these results over:

mkdir megahit_default_assembly_looser &&\
cp ./Results/megahit_default_assembly_looser/final.contigs.fa ./megahit_default_assembly_looser

And then we will run it again using a different --min-count setting. This is suggested in the developer tips for assembly:

megahit -1 ./spades_error_corrected_looserreads/corrected/SRR9681949_1_paired.fastq.00.0_0.cor.fastq.gz \
          -2 ./spades_error_corrected_looserreads/corrected/SRR9681949_2_paired.fastq.00.0_0.cor.fastq.gz \
        -o megahit_min_count_3_assembly_looser --min-count 3

To skip this and copy these results over:

mkdir megahit_min_count_3_assembly_looser &&\
cp ./Results/megahit_min_count_3_assembly_looser/final.contigs.fa ./megahit_min_count_3_assembly_looser

Comparing Assemblies

Now that we have a couple of different assemblies, we can compare them. A useful tool for this is QUAST (and if you’re ever working on metagenomes, MetaQUAST). We can give QUAST our assemblies, a fasta file of our reference genome, and a .gff (general feature format) file of our reference genome that contains information about its genes. I downloaded the two reference files for P. aeruginosa from the European Nucleotide Archive (ENA) here. And here’s how we can run it:

quast -o quast_comparison -R References/paeruginosa.fna \
      -g References/paeruginosa.gff \
      -l "spades_default, spades_kmers_careful, megahit_default, megahit_min_count_3" \
      -t 4 -m 1000 spades_default_assembly_loose/contigs.fasta \
      spades_kmers_set_careful_assembly_looser/contigs.fasta \
      megahit_default_assembly_looser/final.contigs.fa \
      megahit_min_count_3_assembly_looser/final.contigs.fa

They all look close, but it appears that the SPAdes with –careful settings worked the best this time. They are all very close and it likely would not make a difference in the long run, but for the sake of tutorial, I am going to proceed with only one assembly. Though it’s never a bad idea to continue with multiple assemblies to figure out the differences!

Explore our assembly using anvi’o

Now that we have decided which assembly we are going to move forward with, we can start to dive into that assembly a little more. A great tool for this is anvi’o. anvi’o is a powerful data visualization and exploration platform.

For us to get our assembly into anvi’o, first we need to generate a contigs database. This contains the contigs from our assembly and information about them. We need to organize our contigs in an anvi’o-friendly way, generate some basic stats about them, and use the program Prodigal to identify open-reading frames.

mkdir anvio && \
anvi-gen-contigs-database -f spades_kmers_set_careful_assembly_looser/contigs.fasta \
                          -o anvio/contigs.db -n assembly

Now we have our contigs database (contigs.db). This holds our sequence and some basic information about them. And now we can:

• use the program HMMER with 3 profile hidden Markov models (HMMs) to scan for bacterial single-copy genes (from Campbell et al. 2013) and ribosomal RNAs (from Tørsten Seemann’s Barrnap tool

• use NCBI COGs to functionally annotate the open-reading frames Prodigal predicted with either BLASTor DIAMOND

• and use a tool called centrifuge for taxonomic classification of the identified open-reading frames

 # HMM searching for single-copy genes and rRNAs
anvi-run-hmms -I Ribosomal_RNAs -c anvio/contigs.db

# functional annotation with DIAMOND against NCBI's COGs
anvi-setup-ncbi-cogs -T 4 # only needed the first time && anvi-run-ncbi-cogs -c anvio/contigs.db --num-threads 4

# exporting Prodigal-identified open-reading frames from anvi'o
anvi-get-sequences-for-gene-calls -c anvio/contigs.db -o anvio/gene_calls.fa

# setting up and running centrifuge for taxonomy
mkdir ./centrifuge &&\
cd ./centrifuge &&\
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed+h+v.tar.gz &&\
tar -xzvf p_compressed+h+v.tar.gz && rm -rf p_compressed+h+v.tar.gz &&\
cd ../ &&\
centrifuge -f -x centrifuge/p_compressed+h+v anvio/gene_calls.fa -S centrifuge/centrifuge_hits.tsv -p 4 && mv centrifuge_report.tsv ./centrifuge

# importing the taxonomy results into our anvi'o contigs database
anvi-import-taxonomy-for-genes -c anvio/contigs.db -i centrifuge/centrifuge_report.tsv centrifuge/centrifuge_hits.tsv -p centrifuge

Now we want to add mapping information from the recruiting of the reads to our assembly. We can do this with bowtie2:

# building bowtie index from our selected assembly fasta file
bowtie2-build spades_default_assembly_loose/contigs.fasta \
spades_default_assembly_loose.btindex

# mapping our reads
bowtie2 -q -x spades_default_assembly_loose.btindex -1 ./spades_error_corrected_looserreads/corrected/SRR9681949_1_paired.fastq.00.0_0.cor.fastq.gz -2 ./spades_error_corrected_looserreads/corrected/SRR9681949_2_paired.fastq.00.0_0.cor.fastq.gz -p 4 -S spades_kmers_set_careful_assembly.sam

# converting to a bam file
samtools view -bS spades_kmers_set_careful_assembly.sam > B_cep_assembly.bam

# sorting and indexing our bam file (can be done with samtools also)
anvi-init-bam B_cep_assembly.bam -o B_cep.bam

We can then integrate this mapping information into anvi’o with the anvi-profile program:

# integrating the read-mapping info into anvi'o
anvi-profile -i B_cep.bam -c anvio/contigs.db -M 1000 -T 4 --cluster-contigs -o B_cep_profiled/

You can skip everything from #HMM searching for single-copy copy genes and rRNAs to here and get the final files by running:

mkdir anvio &&\
mkdir B_cep_profiled &&\
cp ./Results/anvio/contigs.db ./anvio &&\
cp -r ./Results/B_cep_profiled/ ./B_cep_profiled

Now we have generated and put a lot of information about our assembly into the contigs and profile database. Because this is anvi’o. This makes it really easy to access specific sections of the information! For example, anvi-get-sequences-for-hmm-hitswill give us the sequence from our hmm hits above. So, we can look at what was identified as ribosomal RNA:

anvi-get-sequences-for-hmm-hits -c anvio/contigs.db --hmm-sources Ribosomal_RNAs -o rRNAs.fa

This wrote all of the ribosomal RNA hits to a new file called rRNAs.fa, and if we look in that file we see 2 were found, a 16S and a 23S. If we quickly BLAST them, they are both 100% identical to P. aeruginosa!

anvi-get-sequences-for-hmm-hits -c anvio/contigs.db --hmm-sources Ribosomal_RNAs --get-aa-sequences -o bacterial_SCGs.faa --no-wrap

We can also generate some summary statistics on our assembly, including estimated percent completion and redundancy based on the presence/absence of the single-copy marker genes we scanned for above:

# this is adding all contigs to a group called "DEFAULT"
anvi-script-add-default-collection -p B_cep_profiled/PROFILE.db
  
# and here is our summary command
anvi-summarize -c contigs.db -p B_cep_profiled/PROFILE.db -C DEFAULT -o B_cepacia_assembly_summary/

This created a lot and you can explore this is an interactive html document that you can open and explore!

We can also visually inspect our assembly, and how the reads that went into it recruit to it. In theory, if all the DNA in the assembly came from the same organisms (i.e. it’s a clean assembly), there should be pretty even coverage across the whole thing. So let’s finally take a look with anvi-interactive.

anvi-interactive -c anvio/contigs.db -p B_cep_profiled/PROFILE.db --title "P. aeruginosa assembly"

Before I say anything else, I understand that there is a lot going on here. You can read about it here and hereto start digging into it some more when you can.

At the center of the figure is a hierarchical clustering of the contigs from our assembly (here clustered based on tetranucleotide frequency and coverage). So each tip (leaf) represents a contig (or a fragment of a contig as each is actually broken down into a max of ~20,000bps, but for right now I’ll just be referring to them as contigs). Then moving out from the center are layers of information (“Parent”, “Length”, “GC content”, etc.), with each layer displaying information for each contig.

This solid bar all around tells us that the genes in almost the entire assembly were identified as Pseudomonas. There are a few small parts that are not identified as Pseudomonas, but it is A LOT of it. The next thing that stands out is how stable the mean coverage is across all contigs. Overall, this is a good assembly 🙂

Just for a quick comparison, here is the same type of figure (taxonomy at the inner-part o the circle here), but from an enrichment (rather than axenic) culture:

credit: https://astrobiomike.github.io/genomics/de_novo_assembly

This was all basically to get a high-quality draft of our isolate genome, that we could feel confident about investigating further. Once you feel comfortable with your assembled genome, you can go a lot of different ways, though I won’t discuss those here

And then if you want to remove the environment completely, you can run this:

conda env remove -n denovo_tutorial

Follow My Blog

Get new content delivered directly to your inbox.

%d bloggers like this: