TY - JOUR
T1 - CH-Bin
T2 - A convex hull based approach for binning metagenomic contigs
AU - Chandrasiri, Sunera
AU - Perera, Thumula
AU - Dilhara, Anjala
AU - Perera, Indika
AU - Mallawaarachchi, Vijini
PY - 2022/10
Y1 - 2022/10
N2 - Metagenomics has enabled culture-independent analysis of micro-organisms present in environmental samples. Metagenomics binning, which involves the grouping of contigs into bins that represent different taxonomic groups, is an important step of a typical metagenomic workflow followed after assembly. The majority of the metagenomic binning tools represent the composition and coverage information of contigs as feature vectors consisting of a large number of dimensions. However, these tools use traditional Euclidean distance or Manhattan distance metrics which become unreliable in the high dimensional space. We propose CH-Bin, a binning approach that leverages the benefits of using convex hull distance for binning contigs represented by high dimensional feature vectors. We demonstrate using experimental evidence on simulated and real datasets that the use of high dimensional feature vectors to represent contigs can preserve additional information, and result in improved binning results. We further demonstrate that the convex hull distance based binning approach can be effectively utilized in binning such high dimensional data. To the best of our knowledge, this is the first time that composition information from oligonucleotides of multiple sizes has been used in representing the composition information of contigs and a convex hull distance based binning algorithm has been used to bin metagenomic contigs. The source code of CH-Bin is available at https://github.com/kdsuneraavinash/CH-Bin.
AB - Metagenomics has enabled culture-independent analysis of micro-organisms present in environmental samples. Metagenomics binning, which involves the grouping of contigs into bins that represent different taxonomic groups, is an important step of a typical metagenomic workflow followed after assembly. The majority of the metagenomic binning tools represent the composition and coverage information of contigs as feature vectors consisting of a large number of dimensions. However, these tools use traditional Euclidean distance or Manhattan distance metrics which become unreliable in the high dimensional space. We propose CH-Bin, a binning approach that leverages the benefits of using convex hull distance for binning contigs represented by high dimensional feature vectors. We demonstrate using experimental evidence on simulated and real datasets that the use of high dimensional feature vectors to represent contigs can preserve additional information, and result in improved binning results. We further demonstrate that the convex hull distance based binning approach can be effectively utilized in binning such high dimensional data. To the best of our knowledge, this is the first time that composition information from oligonucleotides of multiple sizes has been used in representing the composition information of contigs and a convex hull distance based binning algorithm has been used to bin metagenomic contigs. The source code of CH-Bin is available at https://github.com/kdsuneraavinash/CH-Bin.
KW - Clustering algorithm
KW - Convex hull
KW - Convex hull distance
KW - High dimensional data clustering
KW - Metagenomic binning
KW - Multiple k values
UR - http://www.scopus.com/inward/record.url?scp=85135886788&partnerID=8YFLogxK
U2 - 10.1016/j.compbiolchem.2022.107734
DO - 10.1016/j.compbiolchem.2022.107734
M3 - Article
AN - SCOPUS:85135886788
SN - 1476-9271
VL - 100
JO - Computational Biology and Chemistry
JF - Computational Biology and Chemistry
M1 - 107734
ER -