Improving Predictive Modeling in High Dimensional, Heterogeneous and Sparse Health Care Data

In the past few decades predictive modeling has emerged as an important tool for exploratory data analysis and decision making in health care. Predictive modeling is a commonly used statistical and data mining technique that works by analyzing historical and current data and generating a model to help predict future outcomes. It gives us the power to discover hidden relationships in volumes of data and use those insights to confidently predict the outcome of future events and interactions. In health care, complex models can be created to combine patient information like demographic and clinical information from care providers, in order to predict and improve model accuracy. Predictive modeling in health care seeks out subtle data patterns to enhance decision making such as care providers can recommend prescription drugs and services based on patient profile. Although all predictive techniques have different strengths and weaknesses, model accuracy is mostly dependent on the raw input data with various features used to train a predictive model. Model building often requires data pre-processing in order to reduce the impact of the skewed property of the data or outliers. This helps by significantly improving performance. From hundreds of available raw data fields, a subset is selected and fields are pre-processed before being presented to a predictive modeling technique. For example, there can be thousands of variables consisting of genetic, clinical and demographic information for different groups of patients. Therefore detecting significant variables for a particular group of patient can enhance model accuracy. Hence, the secret behind a good predictive model often times depends on good pre-processing and more so than the technique used to train the model. While the above responsibilities of an effective and efficient data pre-processing mechanism and its usage with predictive modeling in health care data are better understood, three key challenges were identified that faces this data pre-processing task. These include, 1) High dimensionality: The challenge of high-dimensionality arises in diverse fields, ranging from health care and computational biology to financial engineering and risk management. This work identifies that there is no single feature selection strategy that is robust towards different families of classification or prediction algorithm. The existing feature selection techniques produce different results with different predictive models. This can be a problem when deciding about the best predictive model to use while working with real high dimensional health care data and especially without domain experts. 2) Heterogeneity in the data and data redundancy: Most of the real world data is heterogeneous in nature, i.e. the population consists of overlapping homogeneous groups. In health care, Electronic Health Records (EHR) data consists of diverse groups of patients with a wide range of diverse health conditions. This thesis identifies that predictive modeling with a single learning model over heterogeneous data can result in inconclusive results and ineffective explanation of an outcome. Therefore, it has been proposed in this thesis that, there is a need for data segmentation/ co-clustering technique that extracts groups from data while removing insignificant features and extraneous rows, giving result to an improved predictive modeling with a learning model. 3) Data sparseness: When a row is created, storage is allocated for every column, irrespective of whether a value exists for a given field. This gives rise to sparse data which has a relatively high percentage of the variable's cells, missing the actual data. In health care, not all patients undergo every possible medical diagnostics and lab results are equally sparse. Such Sparse information or missing values causes predictive models to produce inconclusive results. One primitive technique is manual imputation of missing values by the domain experts. Today, this scenario is almost impossible as the data is huge and high dimensional in nature. A variety of statistical and machine learning based missing value estimation techniques exist which estimates missing values by statistical analysis of the data set available. However, most of these techniques do not consider the importance of a domain expert's opinion in estimating missing data. It has been proposed in this thesis that techniques that use statistical information from the data as well as opinion of the experts can estimate missing values more effectively. This imputation procedure can results in non-sparse data which is closer to the ground truth and that improves predictive modeling. In this thesis, the following computational approaches has been proposed for handling challenges described above for an effective and improved predictive modeling �" 1) For handling high-dimensional data a novel robust rank aggregation-based feature selection technique has been developed using exclusive rank aggregation strategies by Borda (1781) and Kemeny (1959). The concept of robustness of a feature selection algorithm has been introduced, which can be defined as the property that characterizes the stability of a ranked feature set toward achieving similar classification accuracy across a wide range of classifiers. This concept has been quantified with an evaluation measure namely, the robustness index (RI). The concept of inter-rater agreement for improving the quality of the rank aggregation approach for feature selection has also been proposed in this thesis. 2) The concept of a co-clustering has been proposed that is dedicated towards improving predictive modeling. The novel idea of Learning based Co-Clustering (LCC) has been developed as an optimization problem for a more effective and improved predictive analysis. An important property of this algorithm is that there is no need to specify the number of co-clusters. A separate model testing framework has also been proposed in this work, for reducing model over-fitting and for a more accurate result. The methodology has been evaluated on health care data as a case study as well as several other publicly available data sets. 3) A missing value imputation technique based on domain expert's knowledge and statistical analysis of the available data has been proposed in this thesis. The medical domain of HSCT has been chosen for the case study and the domain expert's knowledge is a group of stem cell transplant physician's opinion. The machine learning approach developed can be defined as - rule mining with expert knowledge and similarity scoring based missing value imputation. This technique has been developed and validated using real world medical data set. The results demonstrate the effectiveness and utility of this technique in practice.

Keywords

Co-clustering

Feature Selection

Health care

High Dimensional Data

Missing Value Imputation

Predictive Modeling

Description

University of Minnesota Ph.D. dissertation. August 2015. Major: Computer Science. Advisors: Jaideep Srivastava, Sarah Cooley. 1 computer file (PDF); xiii, 124 pages.

Collections

Dissertations

Suggested citation

Sarkar, Chandrima. (2015). Improving Predictive Modeling in High Dimensional, Heterogeneous and Sparse Health Care Data. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/175324.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.

University Digital Conservancy

Improving Predictive Modeling in High Dimensional, Heterogeneous and Sparse Health Care Data

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Alternative title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation

University Digital Conservancy

University of Minnesota Twin Cities

Improving Predictive Modeling in High Dimensional, Heterogeneous and Sparse Health Care Data

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Alternative title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation