Browsing by Subject "Protein structure prediction"
Now showing 1 - 2 of 2
- Results Per Page
- Sort Options
Item Computational methods for protein structure prediction and energy minimization(2013-07) Kauffman, Christopher DanielThe importance of proteins in biological systems cannot be overstated: genetic defects manifest themselves in misfolded proteins with tremendous human cost, drugs in turn target proteins to cure diseases, and our ability to accurately predict the behavior of designed proteins has allowed us to manufacture biological materials from engineered micro-organisms. All of these areas stand to benefit from fundamental improvements in computer modeling of protein structures. Due to the richness and complexity of protein structure data, it is a fruitful area to demonstrate the power of machine learning. In this dissertation we address three areas of structural bioinformatics with machine learning tools. Where current approaches are limited, we derive new solution methods via optimization theory.Identifying residues that interact with ligands is useful as a first step to understanding protein function and as an aid to designing small molecules that target the protein for interaction. Several studies have shown sequence features are very informative for this type of prediction while structure features have also been useful when structure is available. In the first major topic of this dissertation, we develop a sequence-based method, called LIBRUS, that combines homology-based transfer and direct prediction using machine learning. We compare it to previous sequence-based work and current structure-based methods. Our analysis shows that homology-based transfer is slightly more discriminating than a support vector machine learner using profiles and predicted secondary structure. We combine these two approaches in a method called LIBRUS. On a benchmark of 885 sequence independent proteins, it achieves an area under the ROC curve (ROC) of 0.83 with 45% precision at 50% recall, a significant improvement over previous sequence-based efforts. On an independent benchmark set, a current method, FINDSITE, based on structure features achieves a 0.81 ROC with 54% precision at 50% recall while LIBRUS achieves a ROC of 0.82 with 39% precision at 50% recall at a smaller computational cost. When LIBRUS and FINDSITE predictions are combined, performance is increased beyond either reaching an ROC of 0.86 and 59% precision at 50% recall. Coarse-grained models for protein structure are increasingly utilized in simulations and structural bioinformatics to avoid the cost associated with including all atoms. Currently there is little consensus as to what accuracy is lost transitioning from all-atom to coarse-grained models or how best to select the level of coarseness. The second major thrust of this dissertation is employing machine learning tools to address these two issues. We first illustrate how binary classifiers and ranking methods can be used to evaluate coarse-, medium-, and fine-grained protein models for their ability to discriminate between correctly and incorrectly folded structures. Through regularization and feature selection, we are able to determine the trade-offs associated with coarse models and their associated energy functions. We also propose an optimization method capable of creating a mixed representation of the protein from multiple granularities. The method utilizes a hinge loss similar to support vector machines and a max/L1 group regularization term to perform feature selection. Solutions are found for the whole regularization path using subgradient optimization. We illustrate its behavior on decoy discrimination and discuss implications for data-driven protein model selection.Finally, identifying the folded structure of a protein with a given sequence is often cast as a global optimization problem. One seeks the structural conformation that minimizes an energy function as it is believed the native states of naturally occurring proteins are at the global minimum of nature's energy function. In mathematical programming, convex optimization is the tool of choice for the speedy solution of global optimization problems. In the final section of this dissertation we introduce a framework, dubbed Marie, which formulates protein folding as a convex optimization problem. Protein structures are represented using convex constraints with a few well-defined nonconvexities that can be handled. Marie trades away the ability to observe the dynamics of the system but gains tremendous speed in searching for a single low-energy structure. Several convex energy functions that mirror standard energy functions are established so that Marie performs energy minimization by solving a series of semidefinite programs. Marie's speed allows us to study a wide range of parameters defining a Go-like potential where energy is based solely on native contacts. We also implement an energy function affecting hydrophobic collapse, thought to be a primary driving force in protein folding. We study several variants and find that they are insufficient to reproduce native structures due in part to native structures adopting non-spherical conformations.Item Protein structure and function prediction using kernel methods.(2008-08) Rangwala, HuzefaExperimental methods to determine the structure and function of proteins have not been able to keep up with the high-throughput sequencing technologies. As a result, there is an over abundance of protein sequence information but only a fraction of these proteins have experimentally determined structure, and even a lesser fraction have experimentally determined function. Consequently, researchers are relying on computational methods to bridge this gap between sequence and structure, and between sequence and function. In this dissertation we study and develop several algorithms that have significantly advanced the state-of-the-art computational methods for structural and functional characterization of proteins using sequence information only. Specifically, our contributions have led to the development of methods for remote homology detection, fold recognition, sequence alignment, prediction of local structure and function of protein, and a novel pairwise local structure similarity score estimated from sequence. We approach the problem of classifying proteins into functional or structural classes by solving the remote homology detection and fold recognition as a multiclass classification problem. We aim to identify a particular class sharing similar evolutionary characteristics (i.e., remote homologs) and similar overall structural features and shapes (i.e., folds) using sequence information. Our technique is to use support vector machines to train one-versus-rest binary classifiers (one for each class), with the key contributions leading to the development of novel profile-derived kernel functions. These kernel functions use an explicit similarity measure that score a pair of sequences using ungapped alignment of high scoring subsequences or a standard local alignment. Our kernel functions have proven to be the state-of-the-art prediction methods on a common benchmark by a set of independent evaluators. We also present and study algorithms to solve the k -way multiclass classification problem within the context of remote homology detection and fold recognition. We show that a low error rate can be achieved by integrating the prediction outputs of the highly accurate one-versus-rest classifiers by learning weight parameters using large margin principles. We are also able to integrate hierarchical information prevalent in these structure databases effectively. Motivated by the success of our string kernels, we also develop a new approach for sequence alignment that incrementally aligns the best profile-profile scored short subsequences. This algorithm shows comparable accuracy to the standard dynamic-programming based algorithms but also aligns several more residue-pairs classified as reliable aiding in transfer of functional and structural characteristics from known protein. This also helps in producing high quality homology-based modeled proteins. In this thesis we also introduce a novel local structure similarity score estimated from sequence using a support vector machine framework. This score called f RMSD is the root mean square deviation between structure fragment pairs and forms the basis of several structure alignment algorithms. Sequence-based f RMSD estimation has several potential applications, one of which improves the accuracy of sequence alignment algorithms that leads to improved homology-based protein models. A case study presented in this thesis shows this predicted local structure similarity score effective in improving the accuracy of sequence alignments, especially when the identity between sequence pairs is less than 12%. One of the major contributions in prediction of f RMSD scores has been the development of a new kernel function that better captures pairwise interaction information within sequence and has shown superiority in comparison to the standard radial basis kernel function. Over the last decade several prediction methods have been developed for determining structural and functional properties of individual protein residues using sequence and sequence-derived information. We also present a generalized a generalized protein sequence annotation toolkit (PROSAT) for solving classification or regression problems using support vector machines. The key characteristic of our method is its effective use of window-based information to capture the local environment of a protein sequence residue. This window information is used with several kernel functions available within our framework. We show the effectiveness of using the previously developed normalized second order exponential kernel function and experiment with local window-based information at different levels of granularity. PROSAT has shown comparable and even better performance to the competing custom-tailored methods for a wide range of annotation problems. PROSAT provides practitioners an efficient and easy-to-use tool, the results of which can be used to assist in solving the overarching 3D structure prediction problem. The algorithms and methods presented here can be used to improve the various steps of a comparative modeling server ranging from template identification, alignment, and quality assessment. (Abstract shortened by UMI.)