Clipsham, Maia2021-10-132021-10-132021-08https://hdl.handle.net/11299/224921University of Minnesota M.S. thesis. 2021. Major: Microbial Engineering. Advisors: Lawrence Wackett, Alptekin Aksan. 1 computer file (PDF); viii, 69 pages.For efficient long-term storage and use of bacteria for environmental applications, understanding and identifying desiccation resistance in bacteria is key. In the past, desiccation tolerance was a common way of characterizing bacteria, so there is much data on the desiccation tolerance of a wide range of bacterial species. Since the advent of transcriptomics, multiple papers have been published on the expression level of genes during desiccation stress. Additionally, many reviews have described mechanisms and genes relevant to desiccation tolerance in bacteria, but an overarching framework for the prediction of desiccation survival in bacteria is lacking. Model building based on data collected from the literature has been used to successfully predict aerobic vs anaerobic phenotype, enzyme function and substrate specificity (Robinson et al., 2020; Jabłońska et al, 2019) Building on this wealth of previous research, machine learning was used to create a robust model that predicts desiccation tolerance given bacterial genomes. Validation and accuracy of the machine learning model was tested using a desiccation assay carried out over three months. To build the model, a literature review was conducted to find genes that were upregulated greater than two-fold during desiccation stress in bacteria. From the review, 2609 genes from 11 papers were found and condensed to 1082 non-homologous and non near-zero variance genes. A second literature search was conducted to identify bacterial species with a known desiccation response, either tolerant or sensitive, and a publicly available genome. Thirty-five desiccation tolerant and 33 desiccation sensitive genomes were chosen and then queried for the previously curated desiccation upregulated genes list. Approximately 176,800 genes were analyzed, and genes with non-zero variance were removed. The remaining 75,982 genes are included in the model (Rogozin et al., 2002). A random forest supervised machine learning approach was used to create a preliminary model for desiccation resistance. The genomes were split into 80% training data and 20% test data and the model was run 100 times with different seeds, 10-fold cross validation, and three repeats. The average accuracy for the 100 iterations of the model was 0.898 ± 0.0266, indicating the model could accurately predict the desiccation phenotype of the testing data 89.8% of the time. The experimental validation of the desiccation model looked at the viability of 28 bacteria, seven with documented desiccation phenotypes and 21 bacteria with no known desiccation phenotype. For all organisms tested the model had an accuracy of 0.75 demonstrating good model performance.enBacteriaBioinformaticsDesiccationMachine LearningUse of Machine Learning to Predict the Desiccation Tolerance of BacteriaThesis or Dissertation