Variable Selection and Prediction in Messy'' High-Dimensional Data"

Brown, Benjamin2017-10-092017-10-092017-07https://hdl.handle.net/11299/190464University of Minnesota Ph.D. dissertation. July 2017. Major: Biostatistics. Advisors: Julian Wolfson, Wei Pan. 1 computer file (PDF); x, 85 pages.When dealing with high-dimensional data, performing variable selection in a regression model reduces statistical noise and simplifies interpretation. There are many ways to perform variable selection when standard regression assumptions are met, but few that work well when one or more assumptions is violated. In this thesis, we propose three variable selection methods that outperform existing methods in such "messy data'' situations where standard regression assumptions are violated. First, we introduce Thresholded EEBoost (ThrEEBoost), an iterative algorithm which applies a gradient boosting type algorithm to estimating equations. Extending its progenitor, EEBoost (Wolfson, 2011), ThrEEBoost allows multiple coefficients to be updated at each iteration. The number of coefficients updated is controlled by a threshold parameter on the magnitude of the estimating equation. By allowing more coefficients to be updated at each iteration, ThrEEBoost can explore a greater diversity of variable selection "paths'' (i.e., sequences of coefficient vectors) through the model space, possibly finding models with smaller prediction error than any of those on the path defined by EEBoost. In a simulation of data with correlated outcomes, ThrEEBoost reduced prediction error compared to more naive methods and the less flexible EEBoost. We also applied our method to the Box Lunch Study where we found that we were able to reduce our error in predicting BMI from longitudinal data. Next, we propose a novel method, MEBoost, for variable selection and prediction when covariates are measured with error. To do this, we incorporate a measurement error corrected score function due to Nakamura (1990) into the ThrEEBoost framework. In both simulated and real data, MEBoost outperformed the CoCoLasso (Datta and Zou, 2017), a recently proposed penalization-based approach to variable selection in the presence of measurement error, and the (non-measurement error corrected) Lasso. Lastly, we consider the case where multiple regression assumptions may be simultaneously violated. Motivated by the idea of stacking, specifically the SuperLearner technique (VanDerLaan et al., 2007), we propose a novel method, Super Learner Estimating Equation Boosting (SuperBoost). SuperBoost performs variable selection in the presence of multiple data challenges by combining the results from variable selection procedures which are each tailored to address a different regression assumption violation. The ThrEEBoost framework is a natural fit for this approach, since the component "learners'' (i.e., violation-specific variable selection techniques) are fairly straightforward to construct and implement by using various estimating equations. We illustrate the application of SuperBoost on simulated data with both correlated outcomes and covariate measurement error, and show that it performs as well or better than methods which address only one (or neither) of these factors.enBoostingGEEMeasurement ErrorPredictionStackingVariable SelectionVariable Selection and Prediction in Messy'' High-Dimensional Data"Thesis or Dissertation