TY - GEN
T1 - RepBin
T2 - 36th AAAI Conference on Artificial Intelligence, AAAI 2022
AU - Xue, Hansheng
AU - Mallawaarachchi, Vijini
AU - Zhang, Yujia
AU - Rajan, Vaibhav
AU - Lin, Yu
PY - 2022/6/28
Y1 - 2022/6/28
N2 - Mixed communities of organisms are found in many environments - from the human gut to marine ecosystems - and can have profound impact on human health and the environment. Metagenomics studies the genomic material of such communities through high-throughput sequencing that yields DNA subsequences for subsequent analysis. A fundamental problem in the standard workflow, called binning, is to discover clusters, of genomic subsequences, associated with the unknown constituent organisms. Inherent noise in the subsequences, various biological constraints that need to be imposed on them and the skewed cluster size distribution exacerbate the difficulty of this unsupervised learning problem. In this paper, we present a new formulation using a graph where the nodes are subsequences and edges represent homophily information. In addition, we model biological constraints providing heterophilous signal about nodes that cannot be clustered together. We solve the binning problem by developing new algorithms for (i) graph representation learning that preserves both homophily relations and heterophily constraints (ii) constraint-based graph clustering method that addresses the problems of skewed cluster size distribution. Extensive experiments, on real and synthetic datasets, demonstrate that our approach, called RepBin, outperforms a wide variety of competing methods. Our constraint-based graph representation learning and clustering methods, that may be useful in other domains as well, advance the state-of-the-art in both metagenomics binning and graph representation learning.
AB - Mixed communities of organisms are found in many environments - from the human gut to marine ecosystems - and can have profound impact on human health and the environment. Metagenomics studies the genomic material of such communities through high-throughput sequencing that yields DNA subsequences for subsequent analysis. A fundamental problem in the standard workflow, called binning, is to discover clusters, of genomic subsequences, associated with the unknown constituent organisms. Inherent noise in the subsequences, various biological constraints that need to be imposed on them and the skewed cluster size distribution exacerbate the difficulty of this unsupervised learning problem. In this paper, we present a new formulation using a graph where the nodes are subsequences and edges represent homophily information. In addition, we model biological constraints providing heterophilous signal about nodes that cannot be clustered together. We solve the binning problem by developing new algorithms for (i) graph representation learning that preserves both homophily relations and heterophily constraints (ii) constraint-based graph clustering method that addresses the problems of skewed cluster size distribution. Extensive experiments, on real and synthetic datasets, demonstrate that our approach, called RepBin, outperforms a wide variety of competing methods. Our constraint-based graph representation learning and clustering methods, that may be useful in other domains as well, advance the state-of-the-art in both metagenomics binning and graph representation learning.
KW - RepBin
KW - mixed communities
KW - Metagenomic binning
KW - metagenomics
KW - DNA subsequences
KW - organisms
UR - http://www.scopus.com/inward/record.url?scp=85143572231&partnerID=8YFLogxK
U2 - 10.1609/aaai.v36i4.20388
DO - 10.1609/aaai.v36i4.20388
M3 - Conference contribution
AN - SCOPUS:85143572231
T3 - Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022
SP - 4637
EP - 4645
BT - AAAI-22 Technical Tracks 4
PB - Association for the Advancement of Artificial Intelligence
CY - Palo Alto, California
Y2 - 22 February 2022 through 1 March 2022
ER -