Data driven protein scaffold developability engineering
Loading...
View/Download File
Persistent link to this item
Statistics
View StatisticsJournal Title
Journal ISSN
Volume Title
Authors
Published Date
Publisher
Abstract
For a protein to constitute a compelling industrial or therapeutic candidate, it must be both functional and able to weather the vicissitudes of time and handling to maintain said function. This latter attribute alludes to a protein’s developability profile: the protease stability, thermal stability, flexibility, recombinant yield, and related properties of a protein that enable enaction of its key function of interest, be it enzymatic degradation of deleterious chemical compounds in an industrial context or flagging an overexpressed cancer-linked immune checkpoint inhibitor for adaptive immune system-mediated elimination in a therapeutic one. Despite protein developability’s importance in classical protein discovery pipelines, the vast ruggedness and sparsity of the underlying sequence-function and sequence-developability landscape significantly impedes identification of lead variants. High throughput developability proxy metrics on deep mutational scanning protein datasets partially alleviate this issue: rigorous data-driven techniques must rise to the challenge of utilizing such exponentially growing datasets to guide successive lead developability hit discovery efforts. The following research is focused on both retrospective and prospective data-driven prediction of small protein scaffold developability across three protein scaffolds and several novel high-throughput proxy and low-throughput gold standard developability assays. The first story sought to develop a model capable of first mapping 104 Gp2 scaffold paratope sequences to a set of three high-throughput assays to first guide construction of a sequence-developability latent space representation via an integrated embedding and top model approach. Then, we trained a separate top model in a transfer-learning paradigm to map these variants’ learned developability representations to a low-throughput gold standard developability metric, bacterial recombinant yield. This deep learning approach outperformed all assessed benchmarks and controls on both the retrospective assay and recombinant yield tasks on a held-out independent test set. Interpretability of this model revealed the importances of amino acidcharge, size, and cysteine identity to inform Gp2 developability. Prospective analysis via application of Monte Carlo sampling in conjunction with variant discriminative scoring using our model against controls resulted in recombinant yield enhancement relative to both said controls and the original naïve library. The second story adapts the approach outlined in the first to perform sequence-high throughput developability proxy assay-low throughput gold standard developability assay metric mapping for two additional small scaffold protein candidate molecules: affibody and fibronectin. The overlap of observed affibody and fibronectin variant hits across the evaluated assays was sparse due to large assessed library sizes, requiring we first assess relative correlation and information content across the assays. We then demonstrated the utility of this information to enable maximal top model performance via comparison of linear and nonlinear models’ performances as measured by independent test set performance and relative degree of overfitting from the ratio of cross-validation to test set performances on predicting each of two low-throughput gold standard metrics for each protein of interest. As expected, models trained on those assays with higher mutual information for a given gold-standard metric performed maximally on the held-out task and yielded minimal overfitting degree relative to controls. The completion of the studies reinforces the conviction that high-throughput developability proxy assays can inform deep-learning models on both the retrospective and prospective protein variant prediction tasks to outperform contemporary naïve autoencoder-based representations alone. The research detailed herein thus heralds the utility of both creating large-scale developability datasets and designing modern data-driven deep-learning algorithms to leverage them to enable optimal traversal of the daunting sequence-fitness (here: developability) design space.
Description
University of Minnesota Ph.D. dissertation. May 2025. Major: Chemical Engineering. Advisor: Benjamin Hackel. 1 computer file (PDF); x, 147 pages.
Related to
Replaces
License
Collections
Series/Report Number
Funding information
Isbn identifier
Doi identifier
Previously Published Citation
Other identifiers
Suggested citation
Schmitz, Zachary. (2025). Data driven protein scaffold developability engineering. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/275922.
Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.