Browsing by Subject "variable selection"
Now showing 1 - 4 of 4
- Results Per Page
- Sort Options
Item Applications of Model-Averaging for High Dimensional Inference(2021-05) Wyneken, HenryThis dissertation builds up to and develops Model-Averaged Inferential Learning (MAIL), a generally useful method of inference for linear regression problems when $p >> n$. The first chapter adds to the literature on using model averaging for variable selection diagnostics. The second chapter compares inferential results from post-selection methods to un-adjusted methods from the best data-driven model. The third chapter proposes and demonstrates the theoretical and practical value of MAIL. MAIL is shown to give valid confidence intervals for the full linear targets of selected variables across a wide range of challenging simulation settings.Item Robust Variance Component Models and Powerful Variable Selection Methods for Addressing Missing Heritability(2018-08) Arbet, JaronThe development of a complex human disease is an intricate interplay of genetic and environmental factors. Broadly speaking, “heritability” is defined as the proportion of total trait variance due to genetic factors within a given population. Over the past 50 years, studies involving monozygotic and dizygotic twins have estimated the heritability of over 17,800 human traits [1]. Genetic association studies that measure thousands to millions of genetic “markers” have attempted to determine the exact markers that explain a given trait’s heritability. However, often the identified set of “statistically-significant” markers fails to explain more than 10% of the estimated heritability of a trait [2], which has been defined as the “missing heritability” problem [3][4]. “Missing heritability’ implies that many genetic markers that contribute to disease risk are still waiting to be discovered. Identification of the exact genetic markers associated with a disease is important for the development of pharmaceutical drugs that may target these markers (see [5] for recent examples). Additionally, “missing heritability” may imply that we are inaccurately estimating heritability in the first place [3, 4, 6], thus motivating the development of more robust models for estimating heritability. This dissertation focuses on two objectives that attempt to address the missing heritability problem: (1) develop a more robust framework for estimating heritability; and (2) develop powerful association tests in attempt to find more genetic markers associated with a given trait. Specifically: in Chapter 2, robust variance component models are developed for estimating heritability in twin studies using second-order generalized estimating equations (GEE2). We demonstrate that GEE2 can improve coverage rates of the true heritability parameter for non-normally distributed outcomes, and can easily incorporate both mean and variance-level covariate effects (e.g. let heritability vary by sex or age). In Chapter 3, penalized regression is used to jointly model all genetic markers. It is demonstrated that jointly modeling all markers can improve power to detect individual associated markers compared to conventional methods that model each marker “one-at-a-time.” Chapter 4 expands on this work by developing a more flexible nonparametric Bayesian variable selection model that can account for non-linear or non-additive effects, and can also test biologically meaningful groups of markers for an association with the outcome. We demonstrate how the nonparametric Bayesian method can detect markers with complex association structures that more conventional models might miss.Item Statistical Methods for Organ Transplant(2021-07) McKearnan, ShannonIn this dissertation, we propose novel statistical methods to improve clinical decision support for organ transplant donors and recipients, using data from the United Network for Organ Sharing national registry. In our first project, we develop a feature selection method for support vector regression in order to benefit from the method’s flexibility while combating overfitting. Support vector regression is advantageous due to its use of a kernel for flexibility and computational efficiency; penalized methods for feature selection limit the choice in kernel to finite dimensional transformations and are thus insufficient. We propose a novel feature selection method for support vector regression based on a genetic algorithm that iteratively searches across potential subsets of covariates to find those that yield the best performance according to a user-defined fitness function. We apply our method to predict donor kidney function one year after transplant. In our second project, we develop an estimator for marginal survival under a dynamic treatment regime for organ transplant, where treatment is defined as the patient’s decision to accept or decline an organ when it is offered to them. We apply our method to kidney transplant patients to recommend thresholds of the quality of organ for acceptance. In our third project, we again utilize the genetic algorithm’s flexible optimization, this time to identify optimal treatment regimes. We define the treatment regime as a decision list in order to develop our method. We apply our method to identify treatment regimes for liver transplant patients who may wish to undergo a simultaneous kidney transplant. Overall, we develop novel methods in diverse fields of statistics tailored for the organ transplantation context, and we demonstrate their performance and meaningful clinical implications via simulations and real data examples.Item Understanding Gaussian Process Fits and Some Model Building Tools Using an Approximate Form of the Restricted Likelihood(2016-07) Bose, MaitreyeeGaussian processes (GPs) are widely used in statistical modeling, often as random effects in a linear mixed model, with their unknowns estimated by maximizing the restricted likelihood or doing a Bayesian analysis, which are closely related. However, it is unclear how a GP's variance and range and the error variance are fit to features in the data. To get a better understanding of that, we applied the spectral approximation to the intercept-only GP. The restricted likelihood from this approximate model has a simple interpretable form, which is identical to the likelihood arising from a gamma-errors generalized linear model with the identity link. If there are covariates in the model, we regress them out and approximate the residuals using an intercept-only GP. Incorporating ideas from linear models, we propose a few tools for systematic model building in linear mixed models where the random effect is a Gaussian process. We present analyses of simulated data and forest inventory data using the spectral basis representation together with added variable plots as diagnostic tools for identifying missing covariates and assessing general goodness of fit.