Finding Integrative Biomarkers from Biomedical Datasets: An application to Clinical and Genomic Data

Thumbnail Image

Persistent link to this item

View Statistics

Journal Title

Journal ISSN

Volume Title


Finding Integrative Biomarkers from Biomedical Datasets: An application to Clinical and Genomic Data

Published Date




Thesis or Dissertation


Human diseases, such as cancer, diabetes and schizophrenia, are inherently complex and governed by the interplay of various underlying factors ranging from genetic and genomic influences to environmental effects. Recent advancements in high throughput data collection technologies in bioinformatics have resulted in a dramatic increase in diverse data sets that can provide information about such factors related to diseases. These types of data include DNA microarrays providing cellular information, Single Nucleotide Polymorphisms (SNPs) providing genetic information, metabolomics data in terms of proteins and other metabolites, structural and functional brain data from magnetic resonance imaging (MRI), and electronic health records (EHRs) containing copious information about histo-pathological factors, demographic, and environmental effects. Despite their richness, each of these datasets only provides information about a part of the complex biological mechanism behind human diseases. Thus, effective integration of the partial information of any of these genomic and clinical data can help reveal disease complexities in greater detail by generating new data-driven hypotheses beyond the traditional hypotheses about biomarkers. In particular, integrative biomarkers, i.e., patterns of features that are predictive of disease and that go beyond the simple biomarkers derived from a single dataset, can lead to a customized and more effective approach to improving healthcare. This thesis focuses on addressing the key issues related to integrative biomarkers by developing new data mining approaches. One very important issue of biomarker discovery is that the models have to easily interpretable, i.e., integrative models have to be not only predictive of the disease, but also interpretable enough so that domain experts can infer useful knowledge from the obtained patterns. In one such effort to make models interpretable, domain information about disease relationships was used as prior knowledge during model development. In addition, a novel metric called I-score was proposed using medical literature to quantify the interpretability of the obtained patterns. Another key issue of integrative biomarker discovery is that there may be many potential relationships present among diverse datasets. For example, a very important types of relationship in biomarker discovery is interaction, which are those biomarkers spanning multiple datasets, whose combined features are more indicative of disease than the individual constituent factors. In particular, the individual effects of each type of factor on disease predisposition can be small and thus, remain undetected by most disease association techniques performed on individual datasets. Different types of relationships are explored and an association analysis based framework is proposed to discover them. The proposed framework is especially effective for discovering higher-order relationships, which cannot be found by the existing prominent integrative approaches for the biomarker discovery. When applied on real datasets collected from three different types of data from schizophrenic and normal subjects, this approach yielded significant integrated biomarkers which are biologically relevant. Disease heterogeneity creates further issues for integrative biomarker discovery, biomarkers obtained from clinicogenomic studies may not be applicable to all patients in the same degree, i.e., a disease consist of multiple subtypes, each occurring in different subpopulations. Some potential reasons responsible for disease heterogeneity are different pathways playing different roles in the same disease and confounding factors such as age, ethnicity and race, or genetic predisposition, which can be available in rich EHR data. Most biomarker discovery techniques use full space model development techniques, i.e., they assess the performance of biomarkers on all patients without finding the distinct subpopulations. In this thesis, more customized models were built depending on patient\'s characteristics to handle disease heterogeneity. In summary, several data mining techniques developed in this thesis advance the state-of-the art in integration of diverse biomedical datasets. Moreover, their applications on large-scale EHR yield significant discoveries, which can ultimately lead to generating new data-driven hypotheses for inferring meaningful information about complex disease mechanism.


University of Minnesota Ph.D. dissertation. August 2015. Major: Computer Science. Advisors: Vipin Kumar, Michael Steinbach. 1 computer file (PDF); xiv, 175 pages.

Related to




Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Suggested citation

Dey, Sanjoy. (2015). Finding Integrative Biomarkers from Biomedical Datasets: An application to Clinical and Genomic Data. Retrieved from the University Digital Conservancy,

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.