Experimental methods to determine the structure and function of proteins have not been able to keep up with the high-throughput sequencing technologies. As a result, there is an over abundance of protein sequence information but only a fraction of these proteins have experimentally determined structure, and even a lesser fraction have experimentally determined function. Consequently, researchers are relying on computational methods to bridge this gap between sequence and structure, and between sequence and function.
In this dissertation we study and develop several algorithms that have significantly advanced the state-of-the-art computational methods for structural and functional characterization of proteins using sequence information only. Specifically, our contributions have led to the development of methods for remote homology detection, fold recognition, sequence alignment, prediction of local structure and function of protein, and a novel pairwise local structure similarity score estimated from sequence.
We approach the problem of classifying proteins into functional or structural classes by solving the remote homology detection and fold recognition as a multiclass classification problem. We aim to identify a particular class sharing similar evolutionary characteristics (i.e., remote homologs) and similar overall structural features and shapes (i.e., folds) using sequence information. Our technique is to use support vector machines to train one-versus-rest binary classifiers (one for each class), with the key contributions leading to the development of novel profile-derived kernel functions. These kernel functions use an explicit similarity measure that score a pair of sequences using ungapped alignment of high scoring subsequences or a standard local alignment. Our kernel functions have proven to be the state-of-the-art prediction methods on a common benchmark by a set of independent evaluators.
We also present and study algorithms to solve the k -way multiclass classification problem within the context of remote homology detection and fold recognition. We show that a low error rate can be achieved by integrating the prediction outputs of the highly accurate one-versus-rest classifiers by learning weight parameters using large margin principles. We are also able to integrate hierarchical information prevalent in these structure databases effectively.
Motivated by the success of our string kernels, we also develop a new approach for sequence alignment that incrementally aligns the best profile-profile scored short subsequences. This algorithm shows comparable accuracy to the standard dynamic-programming based algorithms but also aligns several more residue-pairs classified as reliable aiding in transfer of functional and structural characteristics from known protein. This also helps in producing high quality homology-based modeled proteins.
In this thesis we also introduce a novel local structure similarity score estimated from sequence using a support vector machine framework. This score called f RMSD is the root mean square deviation between structure fragment pairs and forms the basis of several structure alignment algorithms. Sequence-based f RMSD estimation has several potential applications, one of which improves the accuracy of sequence alignment algorithms that leads to improved homology-based protein models. A case study presented in this thesis shows this predicted local structure similarity score effective in improving the accuracy of sequence alignments, especially when the identity between sequence pairs is less than 12%. One of the major contributions in prediction of f RMSD scores has been the development of a new kernel function that better captures pairwise interaction information within sequence and has shown superiority in comparison to the standard radial basis kernel function.
Over the last decade several prediction methods have been developed for determining structural and functional properties of individual protein residues using sequence and sequence-derived information. We also present a generalized a generalized protein sequence annotation toolkit (PROSAT) for solving classification or regression problems using support vector machines. The key characteristic of our method is its effective use of window-based information to capture the local environment of a protein sequence residue. This window information is used with several kernel functions available within our framework. We show the effectiveness of using the previously developed normalized second order exponential kernel function and experiment with local window-based information at different levels of granularity. PROSAT has shown comparable and even better performance to the competing custom-tailored methods for a wide range of annotation problems. PROSAT provides practitioners an efficient and easy-to-use tool, the results of which can be used to assist in solving the overarching 3D structure prediction problem.
The algorithms and methods presented here can be used to improve the various steps of a comparative modeling server ranging from template identification, alignment, and quality assessment. (Abstract shortened by UMI.)