Large Dataset Phylogenetic Trees

One of the problems that I have consistently come across when trying to analyze large amplicon-based sequencing data sets (i.e. more than 150 individual samples) is creating a phylogenetic tree. One of the ways that I had previously attempted to create a phylogenetic tree from a sequence table can be found here using phanghorn. I have found that this works well for small datasets (30-40 samples), but runs into problems when running on larger datasets because it takes a restrictive amount of time to run, even when running on a cluster. Because of this, I went searching for alternate ways to create phylogenetic trees (1) because I think they look cool and (2) I wanted to be able to analyze datasets using weighted and unweighted Unifrac distances. I’ve condensed a method that I found to work well with large datasets below and hopefully it can be helpful to other who also have this problem!

I am starting with a phyloseq object here, that has already had the contaminants removed. I’ve found that this makes it easier, because the tree often won’t change as contaminants are removed. Probably my lack of knowledge, though it can for sure be done both ways.

pseq = merge_phyloseq(pseq1_decontam, pseq2_decontam, pseq3_decontam) # This is my phyloseq

## Now prep to export to Qiime to make the phylogenetic trees
# Extract abundance matrix from the phyloseq object
clean_seqs = as(otu_table(pseq), "matrix")

write.table(t(clean_seqs), "./seqtab.txt", sep="\t", row.names=TRUE, col.names=NA, quote=FALSE)
uniquesToFasta(clean_seqs, fout='./rep-seqs.fna', ids=colnames(clean_seqs))

Now we will need to switch over to bash to run a few commands. I really recommend getting familiar with bash and the condo environment, they are both great tools and can really help make things easier when working with large datasets and anything generally bioinformatics. For information on bash, you can check out this great site that goes over some of the basics of running commands in the command line and a lot of other really helpful bioinformatics tutorials. To learn about conda, you can check out my de novo genome assembly walkthrough that has a whole section on downloading and installing conda through the command line.

We are also going to be using Qiime2 here. You can visit their site and find great instructions for downloading and any other information you could ever want on Qiime2.

## Now we need to run these in qiime ##
conda activate qiime2-2020.2 #Activate Qiime2 (you may need to change this based on your version of Qiime2 that you are using

qiime tools import \
--input-path rep-seqs.fna \
--type 'FeatureData[Sequence]' \
--output-path rep-seqs1.qza

echo -n "#OTU Table" | cat - seqtab.txt > biom-table1.txt

biom convert -i biom-table1.txt -o table1.biom --table-type="OTU table" --to-hdf5

qiime tools import \
--input-path table1.biom \
--type 'FeatureTable[Frequency]' \
--input-format BIOMV210Format \
--output-path table1.qza

qiime phylogeny align-to-tree-mafft-fasttree \
--i-sequences rep-seqs1.qza \
--o-alignment aligned-rep-seqs1.qza \
--o-masked-alignment masked-aligned-rep-seqs1.qza \
--o-tree unrooted-tree1.qza \
--o-rooted-tree rooted-tree1.qza

qiime tools export \
--input-path rooted-tree1.qza \
--output-path tree

Now that we have created out tree, we can read in the tree from Qiime2 back into R:

tree = read_tree('./tree.nwk')

pseq.final = merge_phyloseq(pseq, tree)

And then you can take a look at your tree using:

plot_tree(tree)
%d bloggers like this: