Statistical and machine learning methods for multi-view multi-cohort biomedical data integration

The dramatic proliferation of omics data in biomedical research has allowed increasingly comprehensive investigations spanning multiple distinct sample sets (called multi-cohort data, e.g., patients of different cancer types) and multiple molecular facets (called multi-view data, e.g., gene expression, proteomics, and clinical records for the same patients). Statistical approaches that combine multiple datasets are more powerful, efficient, and informative than separate analyses. We develop several novel methods to analyze multi-view multi-cohort data by modeling their complex relationships and uncovering their complex structures. These methods will facilitate more accurate prediction, classification and missing data imputation and address new scientific questions. In the first project, we propose Deep IDA (Integrative Discriminant Analysis), a deep learning method to learn nonlinear projections of two or more views that maximally associate the views and separate the classes in each view. We consider a homogeneous ensemble approach for feature ranking in order to identify variables from each view that contribute most to the association of the views and the separation of the classes within each view, resulting in interpretable findings. Through our framework, we identify signatures that better discriminate COVID-19 patient groups, and relate to neurological conditions, cancer, and metabolic diseases, corroborating current research findings and heightening the need to study the post sequelae effects of COVID-19 to devise effective treatments and to improve patient care. In the second project, to address variation architectures correctly and comprehensively for high-dimensional data across multiple sample sets (i.e., cohorts), we propose multiple augmented reduced rank regression (maRRR), a flexible matrix regression and factorization method to concurrently learn both covariate-driven and auxiliary structured variation. In the third project, we propose the Bi-dimensional Augmented Reduced Rank Regression (baRRR) method, as an extension to maRRR in the second project. This innovative approach is tailored to model mulyi covariate-related and unrelated effects concurrently, capturing diverse effect levels—global, partially-shared, and individual—across both cohorts (i.e. groups) and views (i.e. sources, modalities). We apply maRRR to gene expression data from multiple cancer types (i.e., pan-cancer) from TCGA, with somatic mutations as covariates. Similarly, we apply baRRR to pan-cancer and furthermore pan-omics (mRNA, miRNA, DNA methylation, proteins) data from TCGA. Both methods perform well with respect to prediction and imputation of held-out data, and provide new insights into mutation-driven and auxiliary variation that is shared or specific to certain cancer types and/or molecular modalities.

Keywords

cancer

data integration

deep learning

low rank matrix decomposition

missing data imputation

multi-omics

Description

University of Minnesota Ph.D. dissertation. September 2024. Major: Biostatistics. Advisor: Eric Lock. 1 computer file (PDF); xx, 155 pages.

Collections

Dissertations

Suggested citation

Wang, Jiuzhou. (2024). Statistical and machine learning methods for multi-view multi-cohort biomedical data integration. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/270062.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.

University Digital Conservancy

Statistical and machine learning methods for multi-view multi-cohort biomedical data integration

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Alternative title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation

University Digital Conservancy

University of Minnesota Twin Cities

Statistical and machine learning methods for multi-view multi-cohort biomedical data integration

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Alternative title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation