Browsing by Author "Rangwala, Huzefa"
Now showing 1 - 15 of 15
- Results Per Page
- Sort Options
Item A Generalized Framework for Protein Sequence Annotation(2007-10-15) Rangwala, Huzefa; Kauffman, Christopher; Karypis, GeorgeOver the last decade several data mining techniques have been developed for determining structural and functional properties of individual protein residues using sequence and sequence-derived information. These protein residue annotation problems are often formulated as either classification or regression problems and solved using a common set of techniques. We develop a generalized protein sequence annotation toolkit (prosat) for solving classification or regression problems using support vector machines. The key characteristic of our method is its effective use of window-based information to capture the local environment of a protein sequence residue. This window information is used with several kernel functions available within our framework. We show the effectiveness of using the previously developed normalized second order exponential kernel function and experiment with local window-based information at different levels of granularity. We report empirical results on a diverse set of classification and regression problems: prediction of solvent accessibility, secondary structure, local structure alphabet, transmembrane helices, DNA-protein interaction sites, contact order, and regions of disorder are all explored. Our methods show either comparable or superior results to several state-of-the-art application tuned prediction methods for these problems. prosat provides practitioners an efficient and easy-to-use tool for a wide variety of annotation problems. The results of some of these predictions can be used to assist in solving the overarching 3D structure prediction problem.Item Affinity-based Structure-Activity-Relationship Models: Improving Structure-Activity-Relationship Models by Incorporating Activity Information from Related Targets(2009-05-29) Ning, Xia; Rangwala, Huzefa; Karypis, GeorgeStructure-activity-relationship SAR models are used to inform and guide the iterative optimization of chemical leads, and play a fundamental role in modern drug discovery. In this paper we present a new class of methods for building SAR models, referred to as affinity-based, that utilize activity information from different targets. These methods first identify a set of targets that are related to the target under consideration and then they employ various machine-learning techniques that utilize activity information from these targets in order to build the desired SAR model. We developed different methods for identifying the set of related targets, which take into account the primary sequence of the targets or the structure of their ligands,and we also developed different machine learning techniques that were derived by using principles of semi-supervised learning, multi-task learning, and classifier ensembles.The comprehensive evaluation of these methods shows that they lead to considerable improvements over the standard SAR models that are based only on the ligands of the target under consideration. On a set of 117 protein targets obtained from PubChem, these affinity-based methods achieve an ROC score that is on the average 7.0% - 7.2% higher than that achieved by the standard SAR models. Moreover, on a set of targets belonging to six protein families, the affinity-based methods outperform chemogenomics-based approaches by 4.33%.Item Building Multiclass Classifiers for Remote Homology Detection and Fold Recognition(2006-04-05) Rangwala, Huzefa; Karypis, GeorgeMotivation: Protein remote homology prediction and recognition are central problems in computational biology. Supervised learning algorithms based on support vector machines currently one of the most effective methods for solving these problem. These methods are primarily used to solve binary classification problems and they have not been extensively used to solve the general multiclass remote homology prediction and fold recognition problems. Methods: We developed a number of methods for building SVMbased multiclass classification schemes in the context of the protein classification. These methods includes schemes that build an SVM-based multiclass model, schemes that employ second level learning approach to combine the predictions generated by a set of binary SVM-based classifiers, and schemes that build combine binary classifiers for various levels of the SCOP hierarchy beyond those defining the target classes. Results: We performed a comprehensive study analyzing different approaches using four different datasets. Our results that most of the proposed multiclass SVM-based classification approaches are quite effective in solving the remote homology prediction and fold recognition problems and that the schemes predictions from binary models constructed for ancestral categories within the SCOP hierarchy tend to qualitatively improve the prediction results. Website: http://bioinfo.cs.umn.edu/supplements/mc-fold/ Keywords: fold recognition, remote homology, multiclass, hierarchical, structured learning, support vector machines.Item fRMSDAlign: Protein Sequence Alignment Using Predicted Local Structure Information(2007-05-31) Rangwala, Huzefa; Karypis, GeorgeAs the sequence identity between a pair of proteins decreases, alignment strategies that are based on sequence and/or sequence profiles become progressively less effective in identifying the correct structural correspondence between residue pairs. This significantly reduces the ability of comparative modeling-based approaches to build accurate structural models. Incorporating predicted information about the local structure of the protein into the alignment process holds the promise of significantly improving the alignment quality of distant proteins. This paper studies the impact on the alignment quality of a new class of predicted local structural features that measure how well fixed-length backbone fragments centered around each residue-pair align with each other. It presents a comprehensive experimental evaluation comparing these new features against existing state-of-the-art approaches utilizing profile-based and predicted secondary-structure information. It shows that for protein pairs with low sequence similarity (less than 12% sequence identity) the new structural features alone or in conjunction with profile-based information lead to alignments that are considerably better than those obtained by previous schemes.Item fRMSDPred: Predicting local rmsd between structural fragments using sequence information(2007-04-04) Rangwala, Huzefa; Karypis, GeorgeThe effectiveness of comparative modeling approaches for protein structure prediction can be substantially improved by incorporating predicted structural information in the initial sequence-structure alignment. Motivated by the approaches used to align protein structures, this paper focuses on developing machine learning approaches for estimating the RMSD value of a pair of protein fragments. These estimated fragment-level RMSD values can be used to construct the alignment, assess the quality of an alignment, and identify high-quality alignment segments. We present algorithms to solve this fragment-level RMSD prediction problem using a supervised learning framework based on support vector regression and classification that incorporates protein profiles, predicted secondary structure, effective information encoding schemes, and novel second-order pairwise exponential kernel functions. Our comprehensive empirical study shows superior results compared to the profile-to-profile scoring schemes. Keywords: structure prediction, comparative modeling, machine learning, classification, regressionItem Improved SAR Models - Exploiting the Target-Ligand Relationships(2008-04-04) Ning, Xia; Rangwala, Huzefa; Karypis, GeorgeSmall organic molecules, by binding to different proteins, can be used to modulate (inhibit/activate) their functions for therapeutic purposes and to elucidate the molecular mechanisms underlying biological processes. Over the decades structure-activity-relationship (SAR) models have been developed to quantify the bioactivity relationship of a chemical compound interacting with a target protein, with advances focussing on the chemical compound representation and the statistical learning methods. We have developed approaches to improve the performance of SAR models using compound activity information from different targets. The methods developed in the study aim to determine the candidacy of a target to help another target in improving the performance of its SAR model by providing supplemental activity information. Having identified a helping target we also develop methods to identify a subset of compounds that would result in improving the sensitivity of the SAR model. Identification of helping targets as well as helping compounds is performed using various nearest neighbor approaches using similarity measures derived from the targets as well as active compounds. We also developed methods that involve use of cross-training a series of SVM-based models for identifying the helping set of targets. Our experimental results show that our methods show statistically significant results and incorporate the target-ligand activity relationship well.Item Improving Homology Models for Protein-Ligand Binding Sites(2008-04-04) Kauffman, Christopher; Rangwala, Huzefa; Karypis, GeorgeIn order to improve the prediction of protein-ligand binding sites through homology modeling, we incorporate knowledge of the binding residues into the modeling framework. Residues are identi?ed as binding or nonbinding based on their true labels as well as labels predicted from structure and sequence. The sequence predictions were made using a support vector machine framework which employs a sophisticated window-based kernel. Binding labels are used with a very sensitive sequence alignment method to align the target and template. Relevant parameters governing the alignment process are searched for optimal values. Based on our results, homology models of the binding site can be improved if a priori knowledge of the binding residues is available. For target-template pairs with low sequence identity and high structural diversity our sequence-based prediction method provided sufficient information to realize this improvement.Item Incremental Window-based Protein Sequence Alignment Algorithms(2006-03-23) Rangwala, Huzefa; Karypis, GeorgeMotivation: Protein sequence alignment plays a critical role in computational biology as it is an integral part in many analysis tasks designed to solve problems in comparative genomics, structure and function prediction, and homology modeling. Methods: We have developed novel sequence alignment algorithms that compute the alignment between a pair of sequences based on short fixed- or variable-length high-scoring subsequences. Our algorithms build the alignments by repeatedly selecting the highest scoring pairs of subsequences and using them to construct small portions of the final alignment. We utilize PSI-BLAST generated sequence profiles and employ a profile-to-profile scoring scheme derived from PICASSO. Results: We evaluated the performance of the computed alignments on two recently published benchmark datasets and compared them against the alignments computed by existing state-of-the-art dynamic programming-based profile-to-profile local and global sequence alignment algorithms. Our results show that the new algorithms achieve alignments that are comparable or better to those achieved by existing algorithms. Moreover, our results also showed that these algorithms can be used to provide better information as to which of the aligned positions are more reliablea critical piece of information for comparative modeling applications. Suppl. Data http://bioinfo.cs.umn.edu/supplements/win-aln/Item MONSTER: Minnesota prOteiN Sequence annotaTion servER(2008-04-01) Rangwala, Huzefa; Karypis, GeorgeSummary: MONSTER is a server for predicting the local structure and function properties of protein sequences. MONSTER provides residue-wise annotation services, that include secondary structure, transmembrane-helix region, disorder region, protein-dna binding site, local structure alphabet, solvent accessibility surface area, and residue-wise contact order prediction. MONSTER uses sequence-derived information (in the form of PSI-BLAST profiles), a window-based encoding scheme with an accurate kernel function to perform the classification or estimation. The user provides an amino acid sequence and selects the desired predictions, and submits a job to the MONSTER server. The results are emailed to the user as a link directing the user to a well formatted HTML output page. Availability: http://bio.dtc.umn.edu/monsterItem Profile Based Direct Kernels for Remote Homology Detection and Fold Recognition(2005-03-31) Rangwala, Huzefa; Karypis, GeorgeMotivation: Remote homology detection between protein sequences is a central problem in computational biology. Supervised learning algorithms based on support vector machines are currently the most effective method for remote homology detection. The performance of these methods depends on how the protein sequences are modeled and on the method used to compute the kernel function between them. Results: We introduce new classes of kernel functions that are constructed by directly combining automatically generated sequence profiles with new and existing approaches for determining the similarity between pairs of protein sequences, which employ effective schemes for scoring the aligned profile positions. Experiments with remote homology detection and fold recognition problems show that these kernels are capable of producing results that are substantially better than those produced by all of the existing state-of-the-art SVM-based methods. In addition, the experiments show that these kernels, even when used in the absence of profiles, produce results that are better than those produced by existing non-profile-based schemes.Item PROSAT: Protein Sequence Annotation Toolkit- Software Manual(2007-12-27) Rangwala, Huzefa; Karypis, GeorgeWe provide a generalized protein sequence annotation toolkit (prosat) for solving classification or regression problems using support vector machines. The key characteristic of our method is its effective use of window-based information to capture the local environment of a protein sequence residue. This window information is used with several kernel functions available within our framework. We show the effectiveness of using the previously developed normalized second order exponential kernel function and experiment with local window-based information at different levels of granularity. This is the manual for this developed software.Item Protein structure and function prediction using kernel methods.(2008-08) Rangwala, HuzefaExperimental methods to determine the structure and function of proteins have not been able to keep up with the high-throughput sequencing technologies. As a result, there is an over abundance of protein sequence information but only a fraction of these proteins have experimentally determined structure, and even a lesser fraction have experimentally determined function. Consequently, researchers are relying on computational methods to bridge this gap between sequence and structure, and between sequence and function. In this dissertation we study and develop several algorithms that have significantly advanced the state-of-the-art computational methods for structural and functional characterization of proteins using sequence information only. Specifically, our contributions have led to the development of methods for remote homology detection, fold recognition, sequence alignment, prediction of local structure and function of protein, and a novel pairwise local structure similarity score estimated from sequence. We approach the problem of classifying proteins into functional or structural classes by solving the remote homology detection and fold recognition as a multiclass classification problem. We aim to identify a particular class sharing similar evolutionary characteristics (i.e., remote homologs) and similar overall structural features and shapes (i.e., folds) using sequence information. Our technique is to use support vector machines to train one-versus-rest binary classifiers (one for each class), with the key contributions leading to the development of novel profile-derived kernel functions. These kernel functions use an explicit similarity measure that score a pair of sequences using ungapped alignment of high scoring subsequences or a standard local alignment. Our kernel functions have proven to be the state-of-the-art prediction methods on a common benchmark by a set of independent evaluators. We also present and study algorithms to solve the k -way multiclass classification problem within the context of remote homology detection and fold recognition. We show that a low error rate can be achieved by integrating the prediction outputs of the highly accurate one-versus-rest classifiers by learning weight parameters using large margin principles. We are also able to integrate hierarchical information prevalent in these structure databases effectively. Motivated by the success of our string kernels, we also develop a new approach for sequence alignment that incrementally aligns the best profile-profile scored short subsequences. This algorithm shows comparable accuracy to the standard dynamic-programming based algorithms but also aligns several more residue-pairs classified as reliable aiding in transfer of functional and structural characteristics from known protein. This also helps in producing high quality homology-based modeled proteins. In this thesis we also introduce a novel local structure similarity score estimated from sequence using a support vector machine framework. This score called f RMSD is the root mean square deviation between structure fragment pairs and forms the basis of several structure alignment algorithms. Sequence-based f RMSD estimation has several potential applications, one of which improves the accuracy of sequence alignment algorithms that leads to improved homology-based protein models. A case study presented in this thesis shows this predicted local structure similarity score effective in improving the accuracy of sequence alignments, especially when the identity between sequence pairs is less than 12%. One of the major contributions in prediction of f RMSD scores has been the development of a new kernel function that better captures pairwise interaction information within sequence and has shown superiority in comparison to the standard radial basis kernel function. Over the last decade several prediction methods have been developed for determining structural and functional properties of individual protein residues using sequence and sequence-derived information. We also present a generalized a generalized protein sequence annotation toolkit (PROSAT) for solving classification or regression problems using support vector machines. The key characteristic of our method is its effective use of window-based information to capture the local environment of a protein sequence residue. This window information is used with several kernel functions available within our framework. We show the effectiveness of using the previously developed normalized second order exponential kernel function and experiment with local window-based information at different levels of granularity. PROSAT has shown comparable and even better performance to the competing custom-tailored methods for a wide range of annotation problems. PROSAT provides practitioners an efficient and easy-to-use tool, the results of which can be used to assist in solving the overarching 3D structure prediction problem. The algorithms and methods presented here can be used to improve the various steps of a comparative modeling server ranging from template identification, alignment, and quality assessment. (Abstract shortened by UMI.)Item Protein Structure Prediction using String Kernels(2006-03-03) Rangwala, Huzefa; DeRonne, Kevin; Karypis, GeorgeWith recent advances in large scale sequencing technologies, we have seen an exponential growth in protein sequence information. Currently, our ability to produce sequence information far out-paces the rate at which we can produce structural and functional information. Consequently, researchers increasingly rely on computational techniques to extract useful information from known structures contained in large databases, though such approaches remain incomplete. As such, unraveling the relationship between pure sequence information and three dimensional structure remains one of the great fundamental problems in molecular biology. In this report we aim to show several ways in which researchers try to characterize the structural, functional and evolutionary nature of proteins. Specifically, we focus on three common prediction problems, secondary structure prediction, remote homology and fold prediction. We describe a class of methods employing large margin classifiers with novel kernel functions for solving these problems, supplemented with a thorough evaluation study.Item TOPTMH: Topology Predictor for Transmembrane Alpha-Helices(2008-02-13) Ahmed, Rezwan; Rangwala, Huzefa; Karypis, GeorgeAlpha-helical transmembrane proteins mediate many key biological processes and represent 20-30% of all genes in many organisms. Due to the difficulties in experimentally determining their high-resolution 3D structure, computational methods that predict their topology (transmembrane helical segments and their orientation) are essential in advancing the understanding of membrane proteins structures and functions. We developed a new topology prediction method for transmembrane helices called TOPTMH that combines a helix residue predictor with a helix segment identification method and determines the overall orientation using the positive-inside rule. The residue predictor is built using Support Vector Machines (SVM) that utilize evolutionary information in the form of PSI-BLAST generated sequence profiles to annotate each residue by its likelihood of being part of a helix segment. The helix segment identification method is built by combining the segments predicted by two Hidden Markov Models (HMM)one based on the SVM predictions and the other based on the hydrophobicity values of the sequences amino acids. This approach combines the power of SVM-based models to discriminate between the helical and non-helical residues with the power of HMMs to identify contiguous segments of helical residues that take into account the SVM predictions and the hydrophobicity values of neighboring residues. We present empirical results on two standard datasets and show that both the per-residue (Q2) and per-segment (Qok) scores obtained by TOPTMH are higher than those achieved by well-known methods such as Phobius and MEMSAT3. In addition, on an independent static benchmark, TOPTMH achieved the highest scores on high-resolution sequences (Q2 score of 84% and Qok score of 86%) against existing state-of-the-art systems while achieving low signal peptide error.Item Whole genome alignments using MPI-LAGAN(2008-06-19) Zhang, Ruinan; Rangwala, Huzefa; Karypis, GeorgeAdvances in sequencing technologies have substantially increased the number of fully sequenced genomes. Alignment algorithms play a crucial rule in analyzing whole genomes, identifying similar and conserved regions between pairs of genomes, leading to annotation of genomes with site-specific properties and functions. In this work we introduce a parallel algorithm for a widely used whole genome alignment method called LAGAN. We use the MPI-based protocol, to develop parallel solutions for two phases of the algorithm which take up a significant portion of the total runtime, and also have a high memory requirement. The serial LAGAN program uses CHAOS to quickly determine initial anchor or seeds, which are extended using a sparse dynamic programming based longest-increasing subsequence method. Our work involves parallelizing the CHAOS and LIS phases of the algorithm using a one-dimensional block cyclic partitioning of the computation. This leads to development of an efficient algorithm that utilizes the processors in a balanced way. We also ensure minimum time spent in communication or transfer of information across processors. We also report experimental evaluation of our parallel implementation using pairs of human contigs of varying lengths. We discuss and illustrate the challenges faced in parallelizing a sparse dynamic programming formulation as in this work, and show equivalent to theoretical speedups for our parallelized phases of the LAGAN algorithm.