Advanced Clustering Techniques for Metagenomic Sequences
Written on
Chapter 1: Understanding Clustering in Metagenomics
Clustering is the process of organizing data points such that similar points are grouped closely, while dissimilar points are positioned further apart. This technique is particularly useful in sequence analysis, especially within the realm of metagenomics (you can find further details on metagenomics in my previous article). Metagenomic samples often comprise sequences from a myriad of species, necessitating the grouping of these sequences at various taxonomic levels to facilitate subsequent analyses. This grouping is referred to as metagenomic binning. In this article, I will delve into how we can categorize metagenomic sequences based on their oligonucleotide composition.
Oligonucleotide Composition
An oligonucleotide is defined as a continuous string composed of a small number of nucleotides. Computationally, we refer to these as k-mers (words of size k). The oligonucleotide composition tends to be conserved within microbial species while varying between different species [1][2]. This principle generally applies to oligonucleotides ranging from sizes of 2 (dinucleotides/2-mers) to 8 (octanucleotides/8-mers) [2]. The frequency of oligonucleotides of a specific size in a sequence provides a genomic signature for that sequence. These genomic signatures may differ between species due to various factors, including:
- DNA structure
- Replication and repair mechanisms
- Evolutionary pressures
In my comparison, I focused on 3-mers (also known as trimers or trinucleotides) and their composition. There are 32 distinct 3-mers when reverse complements are combined. We derive the normalized frequencies of each unique trinucleotide by counting its occurrences and dividing that by the total number of trinucleotides. Using normalized frequencies in our analysis helps mitigate discrepancies caused by varying sequence lengths.
text{Normalized frequency of } k_i = frac{text{Number of occurrences of } k_i}{text{Total number of k-mers}}
(where ( k_i ) is the i-th k-mer)
For more on obtaining k-mer frequency vectors from sequences, refer to my article on Vectorization of DNA sequences.
Two Genomes Comparison
In the examples that follow, I utilized reference genomes sourced from the Nanopore GridION and PromethION Mock Microbial Community Data Community Release of the ZymoBIOMICS Microbial Community Standards.
Genomic Signatures of Two Genomes
Let’s examine a straightforward example featuring two genomes: Pseudomonas aeruginosa and Staphylococcus aureus. We can generate normalized trinucleotide frequency vectors for each genome using the method outlined in the article on Vectorization of DNA sequences. You can experiment with different k values as well; for this article, I employed ( k=3 ).
If we visualize the normalized trinucleotide frequencies for these two genomes, it will appear as shown in Figure 1.
It is evident that there is a distinct separation between the trinucleotide profiles of the two genomes, which can be leveraged to differentiate sequences.
Clustering a Mixture of Sequences from Two Genomes
For this illustration, I simulated a dataset comprising 100 reads of length 10,000 bp from both Pseudomonas aeruginosa and Staphylococcus aureus, utilizing a tool named SimLoRD. Below is the sample command I executed:
simlord --read-reference <reference_genome> --fixed-readlength 10000 --num-reads 100
Upon generating the normalized trinucleotide frequency vectors for all reads, we can create a PCA plot (Figure 2) and a t-SNE plot (Figure 3).
The plots clearly indicate a separation between the sequences from the two species.
Clustering a Mixture of Sequences from Three Genomes
Another intriguing example involves three genomes: Pseudomonas aeruginosa, Staphylococcus aureus, and Escherichia coli. Each of these genomes possesses unique genomic signatures. The plots of the normalized trinucleotide frequency vectors for reads simulated from these three genomes appear as follows.
Similar to the previous analysis, there is a clear distinction among the sequences of the three species, allowing us to apply various clustering and machine learning techniques for separation.
Example Tools
Several tools are available for this purpose:
- MaxBin: Utilizes tetranucleotide frequencies combined with an expectation-maximization algorithm and a probabilistic approach to bin contigs.
- MrGBP: Employs oligonucleotide composition (possibly with a slightly different representation) and DBSCAN for binning contigs.
- LikelyBin: Makes use of pentanucleotide frequencies with a Markov Chain Monte Carlo approach.
What Happens if the Genomes Have Similar Composition?
There may be scenarios where two different species exhibit identical oligonucleotide composition patterns. For instance, consider the genomes Enterococcus faecalis and Listeria monocytogenes. When we visualize the normalized trinucleotide frequencies for these genomes, the results are displayed in Figure 6.
When we plot the normalized trinucleotide frequency vectors from reads simulated from these two genomes, the figures appear as follows:
These plots illustrate the difficulty in achieving a clear separation between the two species. In such cases, additional information like species abundance may be necessary for effective clustering.
Conclusion
I hope this overview has provided you with insights into how we can cluster metagenomic sequences using composition-based binning methods. May this knowledge assist you in your studies, and feel free to incorporate these techniques into your research projects. I've included a notebook with the code for you to experiment with various genomes.
You can also explore more about metagenomics and the analyses I've conducted in my previous articles listed below.
Thank you for your time!
Cheers!
References
[1] Karlin, S. et al. Compositional biases of bacterial genomes and evolutionary implications. Journal of Bacteriology, 179(12), 3899–3913 (1997).
[2] Dick, G. J. et al. Community-wide analysis of microbial genome sequence signatures. Genome Biology, 10(8), R85 (2009).
Chapter 2: Video Insights on Sequencing Analysis
The first video titled 16s rRNA Sequencing Analysis and Visualization provides a comprehensive overview of the methodologies used in sequencing analysis, including visualization techniques crucial for understanding metagenomic data.
The second video, Course: Analyzing amplicon sequencing data with Qiime 2 - Pt. 1, offers detailed instruction on how to analyze amplicon sequencing data, which is essential for effective metagenomic studies.