Computational methods for protein structure prediction and energy minimization

The importance of proteins in biological systems cannot be overstated: genetic defects manifest themselves in misfolded proteins with tremendous human cost, drugs in turn target proteins to cure diseases, and our ability to accurately predict the behavior of designed proteins has allowed us to manufacture biological materials from engineered micro-organisms. All of these areas stand to benefit from fundamental improvements in computer modeling of protein structures. Due to the richness and complexity of protein structure data, it is a fruitful area to demonstrate the power of machine learning. In this dissertation we address three areas of structural bioinformatics with machine learning tools. Where current approaches are limited, we derive new solution methods via optimization theory.Identifying residues that interact with ligands is useful as a first step to understanding protein function and as an aid to designing small molecules that target the protein for interaction. Several studies have shown sequence features are very informative for this type of prediction while structure features have also been useful when structure is available. In the first major topic of this dissertation, we develop a sequence-based method, called LIBRUS, that combines homology-based transfer and direct prediction using machine learning. We compare it to previous sequence-based work and current structure-based methods. Our analysis shows that homology-based transfer is slightly more discriminating than a support vector machine learner using profiles and predicted secondary structure. We combine these two approaches in a method called LIBRUS. On a benchmark of 885 sequence independent proteins, it achieves an area under the ROC curve (ROC) of 0.83 with 45% precision at 50% recall, a significant improvement over previous sequence-based efforts. On an independent benchmark set, a current method, FINDSITE, based on structure features achieves a 0.81 ROC with 54% precision at 50% recall while LIBRUS achieves a ROC of 0.82 with 39% precision at 50% recall at a smaller computational cost. When LIBRUS and FINDSITE predictions are combined, performance is increased beyond either reaching an ROC of 0.86 and 59% precision at 50% recall. Coarse-grained models for protein structure are increasingly utilized in simulations and structural bioinformatics to avoid the cost associated with including all atoms. Currently there is little consensus as to what accuracy is lost transitioning from all-atom to coarse-grained models or how best to select the level of coarseness. The second major thrust of this dissertation is employing machine learning tools to address these two issues. We first illustrate how binary classifiers and ranking methods can be used to evaluate coarse-, medium-, and fine-grained protein models for their ability to discriminate between correctly and incorrectly folded structures. Through regularization and feature selection, we are able to determine the trade-offs associated with coarse models and their associated energy functions. We also propose an optimization method capable of creating a mixed representation of the protein from multiple granularities. The method utilizes a hinge loss similar to support vector machines and a max/L1 group regularization term to perform feature selection. Solutions are found for the whole regularization path using subgradient optimization. We illustrate its behavior on decoy discrimination and discuss implications for data-driven protein model selection.Finally, identifying the folded structure of a protein with a given sequence is often cast as a global optimization problem. One seeks the structural conformation that minimizes an energy function as it is believed the native states of naturally occurring proteins are at the global minimum of nature's energy function. In mathematical programming, convex optimization is the tool of choice for the speedy solution of global optimization problems. In the final section of this dissertation we introduce a framework, dubbed Marie, which formulates protein folding as a convex optimization problem. Protein structures are represented using convex constraints with a few well-defined nonconvexities that can be handled. Marie trades away the ability to observe the dynamics of the system but gains tremendous speed in searching for a single low-energy structure. Several convex energy functions that mirror standard energy functions are established so that Marie performs energy minimization by solving a series of semidefinite programs. Marie's speed allows us to study a wide range of parameters defining a Go-like potential where energy is based solely on native contacts. We also implement an energy function affecting hydrophobic collapse, thought to be a primary driving force in protein folding. We study several variants and find that they are insufficient to reproduce native structures due in part to native structures adopting non-spherical conformations.

Keywords

Machine learning

Protein decoys

Protein structure prediction

Semidefinite programming

Structural biology

Description

University of Minnesota Ph.D. dissertation. July 2013. Major. Computer science. Advisor: George Karypis. 1 computer file (PDF); xi, 141 pages.

Collections

Dissertations

Suggested citation

Kauffman, Christopher Daniel. (2013). Computational methods for protein structure prediction and energy minimization. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/158523.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.

University Digital Conservancy

Computational methods for protein structure prediction and energy minimization

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Alternative title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation

University Digital Conservancy

University of Minnesota Twin Cities

Computational methods for protein structure prediction and energy minimization

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Alternative title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation