Metagenomics is a primary tool for the description of microbial and viral communities. The sheer magnitude of the data generated in each metagenome makes identifying key differences in the function and taxonomy between communities difficult to elucidate. Here we discuss the application of seven different data mining and statistical analyses by comparing and contrasting the metabolic functions of 212 microbial metagenomes within and between 10 environments. Not all approaches are appropriate for all questions, and researchers should decide which approach addresses their questions.This work demonstrated the use of each approach: for example, random forests provided a robust and enlightening description of both the clustering of metagenomes and the metabolic processes that were important in separating microbial communities from different environments. All analyses identified that the presence of phage genes within the microbial community was a predictor of whether the microbial community was host-associated or free-living. Several analyses identified the subtle differences that occur with environments, such as those seen in different regions of the marine environment.
Bibliographical noteThis is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.
- Canonical discriminant analysis
- Principal component analysis
- Random forest