TY - UNPB
T1 - What the protein!? Computational methods for predicting microbial protein functions
AU - Grigson, Susanna
AU - Edwards, Robert
PY - 2023/4/27
Y1 - 2023/4/27
N2 - The identification of protein functions is crucial for understanding microbial life at a molecular scale. While computational methods for annotating protein sequences have greatly advanced in recent years, 30% of all bacterial and 65% of all viral protein sequences cannot be attributed a known biological function. As a result, protein function inference remains a fundamental challenge in computational biology. This paper reviews various bioinformatics methods for annotating microbial and viral proteins, categorised into homology-based and homology-free approaches. Widely used homology-based methods encompass sequence similarity searches such as BLAST and profile hidden Markov models, both of which compare novel protein sequences to databases of protein sequences with known functions. These homology-based methods have limitations, particularly for viral sequences which are severely underrepresented in protein sequence databases. As a result, homology-free methods, including numerical feature extraction, language-based models, guilt-by-association, and protein structure prediction software, offer potential alternatives. In addition, it is also important to critically consider the functional labels used to describe protein functions, and the hierarchical organisation of functional labels, regardless of the annotation method implemented. This review highlights that a combination of multiple functional prediction strategies, including machine learning, may provide the best improvements for microbial protein annotation and alleviate the ever-expanding sequence-function gap affecting microbial proteins. Overall, we provide experimental biologists with a comprehensive overview of annotation methods and inform computational scientists of open challenges and future research avenues.
AB - The identification of protein functions is crucial for understanding microbial life at a molecular scale. While computational methods for annotating protein sequences have greatly advanced in recent years, 30% of all bacterial and 65% of all viral protein sequences cannot be attributed a known biological function. As a result, protein function inference remains a fundamental challenge in computational biology. This paper reviews various bioinformatics methods for annotating microbial and viral proteins, categorised into homology-based and homology-free approaches. Widely used homology-based methods encompass sequence similarity searches such as BLAST and profile hidden Markov models, both of which compare novel protein sequences to databases of protein sequences with known functions. These homology-based methods have limitations, particularly for viral sequences which are severely underrepresented in protein sequence databases. As a result, homology-free methods, including numerical feature extraction, language-based models, guilt-by-association, and protein structure prediction software, offer potential alternatives. In addition, it is also important to critically consider the functional labels used to describe protein functions, and the hierarchical organisation of functional labels, regardless of the annotation method implemented. This review highlights that a combination of multiple functional prediction strategies, including machine learning, may provide the best improvements for microbial protein annotation and alleviate the ever-expanding sequence-function gap affecting microbial proteins. Overall, we provide experimental biologists with a comprehensive overview of annotation methods and inform computational scientists of open challenges and future research avenues.
KW - microbiology
KW - protein function prediction
KW - sequence annotation
KW - machine learning
KW - proteomics
UR - http://purl.org/au-research/grants/ARC/DP220102915
U2 - 10.31219/osf.io/jhmta
DO - 10.31219/osf.io/jhmta
M3 - Preprint
BT - What the protein!? Computational methods for predicting microbial protein functions
PB - OSF Preprints
ER -