Developing accessible informatics tools for integrated genomic-proteomic data analysis

Kumar, Praveen2022-01-042022-01-042019-11https://hdl.handle.net/11299/225893University of Minnesota Ph.D. dissertation. November 2019. Major: Biomedical Informatics and Computational Biology. Advisor: Timothy Griffin. 1 computer file (PDF); x, 151 pages.Mass-spectrometry (MS) based proteomics is widely used to identify and quantify proteins present in biological samples. Emerging multi-omics approaches involve integrating next-generation DNA and RNA sequencing data with MS-based proteomic data to identify novel and known protein products (proteoforms) present in a sample that could be from a single organism (proteogenomics) or a community of organisms (metaproteomics). These methods can offer a more complete molecular picture of complex biological samples used in human health and environmental studies. In these MS-based proteomics approaches, tandem-mass-spectrometry (MS/MS) data derived from peptides is matched against a database containing amino-acid sequences translated from DNA or RNA sequencing to confirm the presence of proteoforms. However, proteogenomic and metaproteomic databases are significantly larger than those used in traditional MS-based proteomics, leading to decreased sensitivity for identifying true peptide spectrum matches (PSMs) for MS/MS matched to sequences in these databases. Once peptides are identified and used to infer protein presence and quantities, there is also a need of advanced tools to compare the response of proteins to their corresponding RNA transcripts, to analyze underlying molecular mechanisms of biology and disease. Ideally, all of these informatic tools would be accessible to lab scientists within a user-friendly platform, to promote wide-adoption and impact in diverse research studies. To address these challenges, we have developed software tools and workflows in the freely-available and user-friendly Galaxy bioinformatics platform, with the objective of providing solutions to MS-based proteomics multi-omics challenges and making them accessible to others. First, we implemented a novel database sectioning method, integrating it into the suite of tools developed for the Galaxy for proteomics (Galaxy-P) project, and evaluated its utility in metaproteomics, and proteogenomics applications. Second, we created a comprehensive workflow for proteogenomics that can efficiently utilize RNA and protein data to identify novel protein variants and proteoforms. Third, we developed a Galaxy-P based tool for comparing the abundance levels of RNA and proteins for integrated analysis of quantitative transcriptomic and proteomic datasets. Collectively, this work has delivered on our goals to develop accessible and reproducible software tools and workflows for more efficient matching of MS/MS data with large databases and also improve integrated analysis of multi-omics applications that can help enable new discoveries in biological and biomedical research.enBioinformaticsmetaproteomicsproteogenomicsproteomicsquantitative proteo-transcriptomicstandem mass spectrometryDeveloping accessible informatics tools for integrated genomic-proteomic data analysisThesis or Dissertation