Browsing by Subject "Embeddings"
Now showing 1 - 2 of 2
- Results Per Page
- Sort Options
Item Data Driven Approach to Engineering Protein Evolvability and Developability(2021-08) Golinski, AlexanderProteins can be engineered to perform a variety of functions ranging from diagnostics and therapeutics to industrial and commercial enzymes. The ability to computationally evaluate the performance of a protein from its amino acid sequence would increase the efficiency of discovery, expanding the impact of engineered proteins. However, the problem is plagued by the immensity, complexity, and barrenness of the amino acid sequence-function landscape. The following research is focused on predicting two nontraditional protein functions: 1) Evolvability - the ability to generate novel functionality based upon the mutation of a subset of amino acid positions, and 2) Developability - the ability to be efficiently manufactured and maintain primary functionality. Limited prior understanding of these functions was available across broad swaths of sequence space. This work advanced a hybrid experimental/computational platform to provide broad and deep experimental data on sequence-function relationship. Empowered by data analytics, the dataset enabled accurate predictions and provided mechanistic insight regarding protein evolvability and developability. The first story aimed to determine which computable biophysical properties drive evolvability. Utilizing high-throughput screens for evolving specific molecular targeting, the performance of seventeen protein scaffolds were obtained for seven molecular targets. A model predicting evolvability from biophysical properties was trained, with a focus on generalizability and interpretability. Achieving a 4/6 true positive rate, a 9/11 negative predictive value, and a 4/6 positive predictive value, the predictive model analysis suggests a large, disconnected paratope (location of sequence variation) will permit evolved binding function. The second story aimed to generate a model to predict protein developability, as determined by bacterial production, from amino acid sequence. As traditional metrics of developability are often capacity limited (10^2 - 10^3), a set of three of high-throughput (10^5) assays were created to generate a sufficient dataset. The relevance of the assays to traditional metrics was certified by a model that predicts expression from assay performance 35% closer to the experimental variance and trains 80% more efficiently than a model predicting from sequence information alone. The validated assays offer the ability to identify developable proteins at unprecedented scales, reducing a bottleneck of protein commercialization. Neural networks were trained to generate a numeric developability representation (embedding) for each sequence from the high-throughput dataset and transfer the embedding to predict recombinant expression. Mimicking protein theory, our deep-learning model convolves machine-learned amino acid properties to predict expression 42% closer to the experimental variance compared to a traditional approach. Analysis of trained numeric encodings of the amino acids highlights the unique capability of cysteine, the importance of hydrophobicity and charge, and unimportance of aromaticity when aiming to improve developability of the protein scaffold Gp2. The completion of the studies supports the hypothesis that data-driven protein engineering can both accurately predict protein evolvability and developability while also providing meaningful insight into the properties driving functionality. The success of this approach is predicted to increase significantly as the capacity to parametrize protein function continues to grow. The research presents the increased ability to engineer proteins across their diverse sequence landscape using modern experimental techniques and data analytics.Item Multiple Choice Question Answering using a Large Corpus of Information(2020-07) Kinney, MitchellThe amount of natural language data is massive and the potential to harness the information contained within has led to many recent discoveries. In this dissertation I explore only one aspect of learning with the goal of answering multiple choice questions with information from a large corpus of information. I chose this topic because of an internship at NASA’s Jet Propulsion Laboratory, where there is a growing interest in making rovers more autonomous in their field research. Being able to process information and act correctly is a key stepping stone to accomplish this, which is an aspect my dissertation covers. The chapters involve a review on the early embedding methods, and two novel approaches to create multiple choice question answering mechanisms. In Chapter 2 I review popular algorithms to create word and sentence embeddings given the surrounding context. These embeddings are a numerical representation of the language data that can be used in downhill models such as logistic regression. In Chapter 3 I present a novel method to create a domain specific knowledge base that can be querired to answer multiple choice questions from a database of Elementary School science questions. The knowledge base is made up of a graph structure and trained using deep learning techniques. The classifier creates an embedding to represent the question and answers. This embedding is then passed through a feed forward network to determine the probability of a correct answer. We train on questions and general information from a large corpus in a semi-supervised setting. In Chapter 4 I propose a strategy to train a network to simultaneously classify multiple choice questions and learn to generate words relevant to the surrounding context of the question. Using the Transformer architecture in a Generative Adversarial Network as well as an additional classifier is a novel approach to train a network that is robust against data not seen in the training set. This semi-supervised training regiment also uses sentences from a large corpus of information and Reinforcement Learning to better inform the generator of relevant words