Proteins are the most essential and versatile macromolecules of life, and the knowledge of their functions is crucial for obtaining a basic understanding of the cellular processes operating in an organism as well as for important applications in biotechnology, such as the development of new drugs, better crops, and synthetic biochemicals such as biofuels. Recent revolutions in biotechnology has given us numerous high-throughput experimental technologies that generate very useful data, such as gene expression and protein interaction data, that provide high-resolution snapshots of complex cellular processes and a novel avenue to understand their underlying mechanisms. In particular, several computational approaches based on the principle of Guilt by Association (GBA) have been proposed to predict the function(s) of the protein are inferred from those of other proteins that are "associated" to it in these data sets. In this thesis, we have developed several novel methods for improving the performance of these approaches by making use of the unutilized and under-utilized information in genomic data sets, as well as their associated knowledge bases. In particular, we have developed pre-processing methods for handling data quality issues with gene expression (microarray) data sets and protein interaction networks that aim to enhance the utility of these data sets for protein function prediction. We have also developed a method for incorporating the inter-relationships between functional classes, as captured by the ontologies in Gene Ontology, into classification-based protein function prediction algoriths, which enabled us to improve the quality of predictions made for several functional classes, particularly those with very few member proteins (rare classes). Finally, we have developed a novel association analysis-based biclustering algorithm to address two major challenges with traditional biclustering algorithms, namely an exhaustive search of all valid biclusters satisfying the definition specified by the algorithm, and the ability to search for small biclusters. This algorithm makes it possible to discover smaller sized biclusters that are more significantly enriched with specific GO terms than those produced by the traditional biclustering algorithms. Overall, the methods proposed in this thesis are expected to help uncover the functions of several unannotated proteins (or genes), as shown by specific examples cited in some of the chapters.
To conclude, we also suggest several opportunities for further progress on the very important problem of protein function prediction