Browsing by Subject "Bioinformatics"
Now showing 1 - 13 of 13
Results Per Page
Sort Options
Item Analysis Of Human Leukocyte Antigen (HLA) Immunogenetic Data For Hematopoietic Stem Cell Transplantation And Disease Association(2014-12) Gragert, LorenThe Major Histocompatibility Complex (MHC) of chromosome 6 is the most polymorphic region of the human genome, and is also under very strong selection pressure, resulting in genetic divergence of immune gene variants between human populations. The human leukocyte antigen (HLA) genes located in the MHC region play a central role in the immune system as HLA proteins distinguish self from non-self through antigenic peptide presentation to T-cells. Hematopoietic stem cell transplantation (HSCT) is a curative therapy for many patients with hematologic diseases, but successful transplant requires a high degree of HLA matching between donor and recipient. Unfortunately, HLA-matched donors are not available for all patients. HLA diversity is vast as millions of unique HLA genotypes have been observed worldwide, many of which have high privacy to specific human populations. In response to this HLA-matching challenge, large registries of unrelated donors have been constructed worldwide to provide HLA-matched HSCT to patients. Even with large registries, minority and admixed race/ethnic groups in the United States have lower likelihood than European-Americans of finding an HLA match. Legacy high-throughput HLA typing methods give high levels of typing ambiguity at recruitment, resulting in a lack of initial confirmation that a suitable match exists. Current population genetics techniques fall short in addressing the unique challenges of stem cell registry analytics, resulting in a difficult search process for some patients. This thesis describes new techniques developed to analyze immunogenetics data with direct operational application in the registry setting. Advancement in computational techniques in population genetics to better handle HLA typing ambiguity has improved calculation of HLA haplotype frequencies, prediction of allele-level HLA typing for subjects with typing ambiguity in registry matching algorithms, and projection of HLA match likelihoods as registries expand. These advances have had direct operational impact for National Marrow Donor Program (NMDP) through more rapid identification of suitably-matched donors and optimized allocation of resources in order to serve more patients, especially in underserved minority groups. These computational techniques have also enabled more detailed evaluation of immunogenetic associations with disease, which may lead to new avenues for treatment for cancer and autoimmune diseases.Item Comparative genomics of Ornithobacterium rhinotracheale and determination of strain-specific pathogenicity and virulence(2021-09) Smith, EmilyThe subsequent chapters of this dissertation will address many of the current knowledge gaps surrounding ORT. First, comparative genomics of clinical ORT isolates from several US commercial turkey producers will highlight the genetic similarities and differences between currently circulating ORT strains. Second, a study comparing these clinical isolates to commensal isolates of ORT will reveal whether there are genetic differences between clinical and commensal isolates. Finally, a series of challenge studies will determine if clinical ORT strains that differ genomically result in differences clinically, and if controlled exposure is effective in preventing negative outcomes associated with ORT.Item Computational Techniques for Analyzing Tumor DNA Data(2016-06) Landman, SeanCancer has often been described as a disease of the genome, and understanding the underlying genetics of this complex disease opens the door to developing improved treatments and more accurate diagnoses. The abundant availability of next-generation DNA sequencing data in recent years has provided a tremendous opportunity to enhance our understanding of cancer genetics. Despite having this wealth of data available, analyzing tumor DNA data is complicated by issues such as genetic heterogeneity often found in tumor tissue samples, and the diverse and complex genetic landscape that is characteristic of tumors. Advanced computational analysis techniques are required in order to address these challenges and to deal with the enormous size and inherent complexity of tumor DNA data. The focus of this thesis is to develop novel computational techniques to analyze tumor DNA data and address several ongoing challenges in the area of cancer genomics research. These techniques are organized into three main aims or focuses. The first focus is on developing algorithms to detect patterns of co-occurring mutations associated with tumor formation in insertional mutagenesis data. Such patterns can be used to enhance our understanding of cancer genetics, as well as to identify potential targets for therapy. The second focus is on assembling personal genomic sequences from tumor DNA. Personal genomic sequences can enhance the efficacy of downstream analyses that measure gene expression or regulation, especially for tumor cells. The final focus is on estimating variant frequencies from heterogeneous tumor tissue samples. Accounting for heterogeneous variants is essential when analyzing tumor samples, as they are often the cause of therapy resistance and tumor recurrence in cancer.Item Developing accessible informatics tools for integrated genomic-proteomic data analysis(2019-11) Kumar, PraveenMass-spectrometry (MS) based proteomics is widely used to identify and quantify proteins present in biological samples. Emerging multi-omics approaches involve integrating next-generation DNA and RNA sequencing data with MS-based proteomic data to identify novel and known protein products (proteoforms) present in a sample that could be from a single organism (proteogenomics) or a community of organisms (metaproteomics). These methods can offer a more complete molecular picture of complex biological samples used in human health and environmental studies. In these MS-based proteomics approaches, tandem-mass-spectrometry (MS/MS) data derived from peptides is matched against a database containing amino-acid sequences translated from DNA or RNA sequencing to confirm the presence of proteoforms. However, proteogenomic and metaproteomic databases are significantly larger than those used in traditional MS-based proteomics, leading to decreased sensitivity for identifying true peptide spectrum matches (PSMs) for MS/MS matched to sequences in these databases. Once peptides are identified and used to infer protein presence and quantities, there is also a need of advanced tools to compare the response of proteins to their corresponding RNA transcripts, to analyze underlying molecular mechanisms of biology and disease. Ideally, all of these informatic tools would be accessible to lab scientists within a user-friendly platform, to promote wide-adoption and impact in diverse research studies. To address these challenges, we have developed software tools and workflows in the freely-available and user-friendly Galaxy bioinformatics platform, with the objective of providing solutions to MS-based proteomics multi-omics challenges and making them accessible to others. First, we implemented a novel database sectioning method, integrating it into the suite of tools developed for the Galaxy for proteomics (Galaxy-P) project, and evaluated its utility in metaproteomics, and proteogenomics applications. Second, we created a comprehensive workflow for proteogenomics that can efficiently utilize RNA and protein data to identify novel protein variants and proteoforms. Third, we developed a Galaxy-P based tool for comparing the abundance levels of RNA and proteins for integrated analysis of quantitative transcriptomic and proteomic datasets. Collectively, this work has delivered on our goals to develop accessible and reproducible software tools and workflows for more efficient matching of MS/MS data with large databases and also improve integrated analysis of multi-omics applications that can help enable new discoveries in biological and biomedical research.Item Evaluating the information content of human microbiomes(2022-03) Hillmann, BenjaminMicrobes vastly outnumber all other organisms on earth and are integral to many aspects of the ecological fitness of the earth’s soils, oceans, animals, and plants. Unfortunately, most of the microbes in these communities cannot be cultured, so to observe these communities’ biological functions, we must study their DNA. After a researcher sequences a microbial community, they utilize informatics methods to correlate the taxonomic and functional profiles to their traits of interest. However, these methods assume that the underlying taxonomic and functional profiling are accurate. If procedures are developed to identify the profiles of a community more accurately, the increased precision will enable higher power testing of hypotheses and detection of these communities’ causal roles. We propose novel, accurate, and data-efficient methods for taxonomic and functional profiles in shotgun metagenomic datasets.Item Integrated analysis of genomic data for inferring gene regulatory networks.(2009-04) Zare Sangederazi, HosseinAs genomic technology and sequencing projects continue to advance, more emphasis needs to be put on data analysis, while addressing the issue of how best to extract information from diverse data sets. For example, functional annotation of new genes can no longer depends only on sequence analysis, but requires integration of additional sources of information including phylogeny, gene expression, protein interaction, metabolic and regulatory networks. Therefore, new biological discoveries will depend strongly on our ability to combine these diverse data sets. We demonstrate how information from gene expression, regulatory sequence patterns and location data can be combined to discover regulatory modules and to construct gene transcriptional regulatory networks. In the context of modeling regulatory sequences, we propose a higher order probabilistic model to efficiently discriminate between the binding sites of a transcription factor and non-specific DNA sequences. Moreover, a model-based algorithm is developed, which integrates gene expression data, modeled by mixtures of Gaussian, with the regulatory sequence patterns for clustering of functionally related genes. For the construction of the gene regulatory network, we introduce the concept of Gene-Regulon association in contrast to Gene-Gene interaction. Unlike Gene-Gene interaction methods, where the mRNA levels of the regulators play the important role, Gene-Regulon methods rely on the activity profiles of the transcription factors. These activity profiles, in the absence of their direct measurements, are estimated concurrently via a computational model. We develop a model selection algorithm, which is capable of capturing the activity profile of a transcription factor from the transcriptional activity of its target genes. In addition, we present a data driven approach based on nonlinear kernel embedding for capturing the nonlinear correlation and geometric connectivity pattern in gene expression data. We apply these methods for integrating gene expression and interaction data to construct a network of transcriptional regulation in Escherichia coli (E. coli).Item Interview with Milton Corn(2014-11-21) Corn, Milton; Tobbell, DominiqueMilton Corn begins the interview discussing the definition of health informatics and the early National Library of Medicine Research Training in Medical Informatics programs, including the University of Minnesota’s training program. Dr. Corn describes his first introduction to medical informatics while serving as dean of Georgetown University School of Medicine and his decision to join the NLM in 1990. He describes at length the evolution of the NLM Research Training Program and the related history of the University of Minnesota’s training program based on the evaluations the NLM performed of the training program every five years. He discusses the University of Minnesota and Mayo Clinic’s efforts to establish a collaborative training program with Arizona State University. He also discusses the implications of Minnesota’s decision not to fully pursue bioinformatics when the NLM shifted the focus of its training program in the 1990s. Dr. Corn goes on to discuss the development of the Clinical and Translational Science Awards and the influence of the awards on health informatics research.Item Pearl in the mud: Genome assembly and binning of a cold seep Thiomargarita nelsonii cell and associated epibionts from an environmental metagenome(2014-01) Fliss, Palmer ScottAs the study of microbes and their impact on the environment grows, so too does the desire to understand the genetic basis of the physiologies that make possible interactions between microbial cells and their environment. Since it is now much more cost-effective to sequence bacterial genomes, environmental metagenomic assembly is a very attractive option for obtaining the genetic blueprints of bacterial physiologies. Bacteria of the genus Thiomargarita (Greek; theio-: sulfur; margarites: pearl), pose a particularly interesting quandary. The genus includes the world's largest bacteria, but as uncultured organisms, their physiologies and basis for their gigantism are not well understood. In order to investigate the genetic basis for these modes, a single cell MDA amplification approach was used on T. nelsonii cells collected at the Hydrate Ridge methane seep off of the coast of Oregon. These particular cells were derived from a gastropod-attached epibiont community. Next-generation sequencing produced a metagenomic product representing both T. nelsonii and attached bacteria (epibionts). These reads were assembled into contigs, binned using the tetranucleotide frequency of the resultant contigs, and finalized using a more stringent secondary assembly. The resulting draft genome shows evidence in Thiomargarita nelsonii for a complete denitrification pathway not previously known in large, vacuolated, sulfur-oxidizing bacteria. Additionally, the genes necessary for polyphosphate metabolism were observed. Polyphosphate metabolism is thought to play a role in the formation of phosphatic minerals that serve as important reservoirs in the marine phosphorous cycle.Item Protein expression profile of rat type two alveolar epithelial cells during hyperoxic stress and recovery(2013-05) Bhargava, ManeeshRationale: In rodent model systems, the sequential changes in lung morphology resulting from hyperoxic injury are well characterized, and are similar to changes in human acute respiratory distress syndrome (ARDS). In the injured lung, alveolar type two (AT2) epithelial cells play a critical role restoring the normal alveolar structure. Thus characterizing the changes in AT2 cells will provide insights into the mechanisms underpinning the recovery from lung injury. Methods: We applied an unbiased systems level proteomics approach to elucidate molecular mechanisms contributing to lung repair in a rat hyperoxic lung injury model. AT2 cells were isolated from rat lungs at predetermined intervals during hyperoxic injury and recovery. Protein expression profiles were determined by using iTRAQ® with tandem mass spectrometry. Results: Of 959 distinct proteins identified, 183 significantly changed in abundance during the injury-recovery cycle. Gene Ontology enrichment analysis identified cell cycle, cell differentiation, cell metabolism, ion homeostasis, programmed cell death, ubiquitination, and cell migration to be significantly enriched by these proteins. Gene Set Enrichment Analysis of data acquired during lung repair revealed differential expression of gene sets that control multicellular organismal development, systems development, organ development, and chemical homeostasis. More detailed analysis identified activity in two regulatory pathways, JNK and miR 374. A Short Time-series Expression Miner (STEM) algorithm identified protein clusters with coherent changes during injury and repair. Conclusion: Coherent changes occur in the AT2 cell proteome in response to hyperoxic stress. These findings offer guidance regarding the specific molecular mechanisms governing repair of the injured lung.Item Proteomic Studies in Acute Respiratory Failure(2015-08) Bhargava, ManeeshRespiratory failure is a syndrome of impaired gas exchange resulting in abnormal oxygenation and carbon dioxide elimination. Lung damage seen in Acute Respiratory Distress Syndrome (ARDS) and Idiopathic Pneumonia Syndrome (IPS) cause acute respiratory failure and result in a high mortality and morbidity. Our objective is to gain novel insights into the pathways and biological processes that occur in response to diffuse lung injury by using comprehensive protein expression profiling in combination with bioinformatics tools. We characterized the protein expression in the Bronchoalveolar lavage fluid (BALF) from subjects with ARDS and also in hematopoietic stem cell transplantation (HSCT) recipients. For our studies, ARDS cases were grouped into survivors and non-survivors. The HSCT recipients were assigned to either infectious lung injury or IPS, i.e. non-infectious lung injury. The BALF samples were processed by desalting, concentration and removal of high abundance proteins. Enriched medium and low abundant protein fractions were trypsin digested and labeled with the iTRAQ reagent for mass spectrometry (MS). The complex mixture of iTRAQ labeled peptides was analyzed by 2D capillary LC-MS/MS on an Orbitrap Velos system in HCD mode for data-dependent peptide tandem MS. Protein identification employed a target decoy strategy using ProteinPilot. To determine the biologic relevance of the differentially expressed proteins we used Database for Visualization and Annotation for Integrated Discovery (DAVID) and Ingenuity Pathway Analysis (IPA). In the studies done on pooled BALF described in Chapter 3, we identified 792 proteins at a global FDR of <= 1%. The proteins that were more abundant in early phase survivors represented the GO groups involved in coagulation, fibrinolysis and wound healing, cation homeostasis and activation of the immune response. In contrast, non-survivors had evidence of carbohydrate catabolism, collagen deposition and actin cytoskeleton reorganization. These proof of concept studies identified early differences in the BALF from ARDS survivors compared to non-survivors. As a follow-up, we characterized BALF from the individual subject with ARDS, 20 survivors and 16 non-survivors (Chapter 4). To accomplish this we performed six eight-plex iTRAQ LC-MS/MS experiments, and we identified 1122 unique proteins in the BALF. The proteins that had a differential expression between survivors and non-survivors represented three canonical pathways -- acute phase response signaling, complement system activation, LXR/RXR activation- and four IPA Diseases and Functions- cellular movement, immune cell trafficking, hematological system development and inflammatory response. Similar to our prior studies, GO biological processes annotated to these proteins included programmed cell death, collagen metabolic processes, and acute inflammatory response. The sparse logistic regression model identified twenty proteins that predicted survival in ARDS. For the studies conducted in HSCT recipients (Chapter 5), we performed five eight-plex iTRAQ LC-MS/MS experiments and identified 1125 unique proteins. The proteins that had a differential expression between IPS and infectious lung injury enrich GO biological terms of immune response, leucocyte adhesion, coagulation, wound healing, cell migration, glycolysis, and apoptosis. In summary, the BALF protein expression profile identifies key differences in the biological processes in different subgroups of patients with diffuse lung injury. These differences position us to develop diagnostic and prognostic biomarkers and identify new targets for pharmacological therapy.Item Quantification and Mechanistic Analysis of Plant Genome Editing Outcomes using Nanopore Sequencing(2020-08) Atkins, PaulPrecise genome modification via homologous recombination, or gene targeting (GT), allows crop genomes to be tailored to any application or environment. While GT’s potential is immense, it tends to be inefficient and technically challenging in plants. These problems are compounded by the slow and low-throughput nature of plant transformation, drastically hindering optimization. More insidiously, these issues result in dependence upon proxies and reporter readouts for estimating GT frequencies that vary between groups and delivery platform making it difficult to compare experimental outcomes. To enable widespread optimization of plant GT, a universal platform for directly measuring genome editing outcomes at the molecular level that accommodates plant-specific technical constraints is urgently needed. Here I develop such a platform, an amplicon-based analysis pipeline using Oxford Nanopore Sequencing (ONS). ONS has several valuable qualities for a plant GT optimization pipeline, namely its accessibility, speed, and read length, making it feasible for even the smallest labs to perform on-demand sequencing with their own equipment. These strengths are accompanied by a major shortcoming – sequencing error. I mitigate this problem using several approaches in a novel bioinformatics pipeline to minimize the effect of ONS error on estimates of targeted mutagenesis and virtually eliminating its effect on estimates of GT frequencies. Using this pipeline, I observed a significant impact of both geminiviral replicons (GVRs) and donor sequence divergence on gene targeting frequencies. Additionally, I was able to observe the conversion tracts of hundreds of gene targeting events, revealing their deposition by multiple DNA repair pathways and the prevalence of extremely short tracts, which will inform future optimization efforts. This work establishes a universal pipeline for quantifying plant gene targeting events, facilitating future optimization and communication of results between disparate experimental systems within the plant community.Item The relationship of gut microbiota in standard and overweight children, before and after probiotic administration(2019) Linhardt, Carter A; Clayton, Jonathan B; Hoops, Suzie; Amin-Nordin, Syafinaz; Knights, DanThe usage of probiotic foods and supplements has been widely considered part of a healthy diet by supplementing the gut microbiome with beneficial bacteria. Although the usage of probiotics is a common dietary accessory, there is limited reproducible evidence showing bacterial colonization, thus limiting long term effectiveness. We administered Yakult, a commercial probiotic composed of Lactobacillus paracasei strain Shirota, to overweight and standard weight school children in Malaysia. Using a crossover intervention study design, two groups of school children were administered the probiotic supplement or continued their typical diet in sequential 5-week intervention periods, separated by a 5-week washout period. Fecal samples were collected every five weeks over the course of the 15-week study period. The gut microbiome of each subject was analyzed using 16S rRNA gene sequencing. We observed significant differences in Lachnospiraceae, Coproccus, Roseburia, Pyramidobacter, and Bacteroides ovatus between weight classes. However, differences in overall microbiome diversity between weight classes were not found to be significant. Subjects clustered according to their relative abundance of well-known genera Bacteroides and Prevotella, regardless of age, gender, or weight class. Overall, individual-to-individual variation overshadowed trends in gut microbiome composition associated with probiotic administration.Item Use of Machine Learning to Predict the Desiccation Tolerance of Bacteria(2021-08) Clipsham, MaiaFor efficient long-term storage and use of bacteria for environmental applications, understanding and identifying desiccation resistance in bacteria is key. In the past, desiccation tolerance was a common way of characterizing bacteria, so there is much data on the desiccation tolerance of a wide range of bacterial species. Since the advent of transcriptomics, multiple papers have been published on the expression level of genes during desiccation stress. Additionally, many reviews have described mechanisms and genes relevant to desiccation tolerance in bacteria, but an overarching framework for the prediction of desiccation survival in bacteria is lacking. Model building based on data collected from the literature has been used to successfully predict aerobic vs anaerobic phenotype, enzyme function and substrate specificity (Robinson et al., 2020; Jabłońska et al, 2019) Building on this wealth of previous research, machine learning was used to create a robust model that predicts desiccation tolerance given bacterial genomes. Validation and accuracy of the machine learning model was tested using a desiccation assay carried out over three months. To build the model, a literature review was conducted to find genes that were upregulated greater than two-fold during desiccation stress in bacteria. From the review, 2609 genes from 11 papers were found and condensed to 1082 non-homologous and non near-zero variance genes. A second literature search was conducted to identify bacterial species with a known desiccation response, either tolerant or sensitive, and a publicly available genome. Thirty-five desiccation tolerant and 33 desiccation sensitive genomes were chosen and then queried for the previously curated desiccation upregulated genes list. Approximately 176,800 genes were analyzed, and genes with non-zero variance were removed. The remaining 75,982 genes are included in the model (Rogozin et al., 2002). A random forest supervised machine learning approach was used to create a preliminary model for desiccation resistance. The genomes were split into 80% training data and 20% test data and the model was run 100 times with different seeds, 10-fold cross validation, and three repeats. The average accuracy for the 100 iterations of the model was 0.898 ± 0.0266, indicating the model could accurately predict the desiccation phenotype of the testing data 89.8% of the time. The experimental validation of the desiccation model looked at the viability of 28 bacteria, seven with documented desiccation phenotypes and 21 bacteria with no known desiccation phenotype. For all organisms tested the model had an accuracy of 0.75 demonstrating good model performance.