Data mining techniques for enhancing protein function prediction

Thumbnail Image

Persistent link to this item

View Statistics

Journal Title

Journal ISSN

Volume Title


Data mining techniques for enhancing protein function prediction

Published Date




Thesis or Dissertation


Proteins are the most essential and versatile macromolecules of life, and the knowledge of their functions is crucial for obtaining a basic understanding of the cellular processes operating in an organism as well as for important applications in biotechnology, such as the development of new drugs, better crops, and synthetic biochemicals such as biofuels. Recent revolutions in biotechnology has given us numerous high-throughput experimental technologies that generate very useful data, such as gene expression and protein interaction data, that provide high-resolution snapshots of complex cellular processes and a novel avenue to understand their underlying mechanisms. In particular, several computational approaches based on the principle of Guilt by Association (GBA) have been proposed to predict the function(s) of the protein are inferred from those of other proteins that are "associated" to it in these data sets. In this thesis, we have developed several novel methods for improving the performance of these approaches by making use of the unutilized and under-utilized information in genomic data sets, as well as their associated knowledge bases. In particular, we have developed pre-processing methods for handling data quality issues with gene expression (microarray) data sets and protein interaction networks that aim to enhance the utility of these data sets for protein function prediction. We have also developed a method for incorporating the inter-relationships between functional classes, as captured by the ontologies in Gene Ontology, into classification-based protein function prediction algoriths, which enabled us to improve the quality of predictions made for several functional classes, particularly those with very few member proteins (rare classes). Finally, we have developed a novel association analysis-based biclustering algorithm to address two major challenges with traditional biclustering algorithms, namely an exhaustive search of all valid biclusters satisfying the definition specified by the algorithm, and the ability to search for small biclusters. This algorithm makes it possible to discover smaller sized biclusters that are more significantly enriched with specific GO terms than those produced by the traditional biclustering algorithms. Overall, the methods proposed in this thesis are expected to help uncover the functions of several unannotated proteins (or genes), as shown by specific examples cited in some of the chapters. To conclude, we also suggest several opportunities for further progress on the very important problem of protein function prediction


University of Minnesota Ph.D. dissertation. May 2010. Major: Computer Science. Advisor: Vipin Kumar.1 computer file (PDF); xv, 176 pages. Ill. (some col.)

Related to




Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Suggested citation

Pandey, Gaurav. (2010). Data mining techniques for enhancing protein function prediction. Retrieved from the University Digital Conservancy,

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.