Statistical and machine learning methods for multi-view multi-cohort biomedical data integration
2024-09
Loading...
View/Download File
Persistent link to this item
Statistics
View StatisticsJournal Title
Journal ISSN
Volume Title
Title
Statistical and machine learning methods for multi-view multi-cohort biomedical data integration
Alternative title
Authors
Published Date
2024-09
Publisher
Type
Thesis or Dissertation
Abstract
The dramatic proliferation of omics data in biomedical research has allowed increasingly comprehensive investigations spanning multiple distinct sample sets (called multi-cohort data, e.g., patients of different cancer types) and multiple molecular facets (called multi-view data, e.g., gene expression, proteomics, and clinical records for the same patients). Statistical approaches that combine multiple datasets are more powerful, efficient, and informative than separate analyses. We develop several novel methods to analyze multi-view multi-cohort data by modeling their complex relationships and uncovering their complex structures. These methods will facilitate more accurate prediction, classification and missing data imputation and address new scientific questions. In the first project, we propose Deep IDA (Integrative Discriminant Analysis), a deep learning method to learn nonlinear projections of two or more views that maximally associate the views and separate the classes in each view. We consider a homogeneous ensemble approach for feature ranking in order to identify variables from each view that contribute most to the association of the views and the separation of the classes within each view, resulting in interpretable findings. Through our framework, we identify signatures that better discriminate COVID-19 patient groups, and relate to neurological conditions, cancer, and metabolic diseases, corroborating current research findings and heightening the need to study the post sequelae effects of COVID-19 to devise effective treatments and to improve patient care. In the second project, to address variation architectures correctly and comprehensively for high-dimensional data across multiple sample sets (i.e., cohorts), we propose multiple augmented reduced rank regression (maRRR), a flexible matrix regression and factorization method to concurrently learn both covariate-driven and auxiliary structured variation. In the third project, we propose the Bi-dimensional Augmented Reduced Rank Regression (baRRR) method, as an extension to maRRR in the second project. This innovative approach is tailored to model mulyi covariate-related and unrelated effects concurrently, capturing diverse effect levels—global, partially-shared, and individual—across both cohorts (i.e. groups) and views (i.e. sources, modalities). We apply maRRR to gene expression data from multiple cancer types (i.e., pan-cancer) from TCGA, with somatic mutations as covariates. Similarly, we apply baRRR to pan-cancer and furthermore pan-omics (mRNA, miRNA, DNA methylation, proteins) data from TCGA. Both methods perform well with respect to prediction and imputation of held-out data, and provide new insights into mutation-driven and auxiliary variation that is shared or specific to certain cancer types and/or molecular modalities.
Description
University of Minnesota Ph.D. dissertation. September 2024. Major: Biostatistics. Advisor: Eric Lock. 1 computer file (PDF); xx, 155 pages.
Related to
Replaces
License
Collections
Series/Report Number
Funding information
Isbn identifier
Doi identifier
Previously Published Citation
Other identifiers
Suggested citation
Wang, Jiuzhou. (2024). Statistical and machine learning methods for multi-view multi-cohort biomedical data integration. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/270062.
Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.