On variable selection diagnostics for high-dimensional regression models

Loading...
Thumbnail Image

Persistent link to this item

Statistics
View Statistics

Published Date

Publisher

Abstract

Because model selection is ubiquitous in data analysis, the reproducibility of statistical results requires that we be able to evaluate the reliability of the employed model selection method, regardless of the model’s apparent good properties. Instability measures have been proposed for evaluating model selection uncertainty. However, low instability does not necessarily indicate that the selected model is trustworthy, because low instability can also arise when a method tends to select an overly parsimonious model. F- and G- measures have become increasingly popular for assessing variable selection performance in theoretical studies and simulation results. However, they are not computable in practice. In this dissertation work, we propose an estimation method for F- and G-measures and prove their desirable properties of uniform consistency. This gives the data analyst a valuable tool to compare different variable selection methods based on the data at hand. Extensive simulations are conducted to show the very good finite-sample performance of our approach. We apply our methods to several microarray gene expression data sets, with intriguing results.We also extend the work of Nan and Yang (2014) on variable selection deviation (VSD) measures and Yu et al. (2022) on F- and G-measures to a broader class of models in the exponential dispersion family, including, for example, the Poisson and compound Poisson-gamma models. In particular, we consider the Tweedie family of models that possesses a power mean-variance relationship, for its wide spectrum of applications in fields such as insurance, ecology, political science and health and biomedical studies. We propose methods based on information criteria and adaptive regression by mixing (ARM) to compute the weights of the candidate models that are adaptive to their predictive performance for the Poisson and Tweedie regression models. Our extensive empirical studies show that the proposed diagnostic measures (including VSD, F- and G-measures) are reasonable metrics of variable selection performance and the weighting methods work very well in recovering the true variable selection deviations. An R package named PAVI is developed to calculate the various variable selection diagnostic measures for all members of the generalized linear models. Three most widely used weighting methods based on AIC, BIC and ARM are supported. Parallel computation mechanism and procedures for dealing with convergence issues of the numerical optimization algorithm in R’s glm function are implemented to smoothly carry out the weighting procedures. Extensive numerical experiments conducted using this package show that it is stable and delivers expected results.

Description

University of Minnesota Ph.D. dissertation. January 2023. Major: Statistics. Advisor: Yuhong Yang. 1 computer file (PDF); x, 117 pages.

Related to

item.page.replaces

License

Collections

Series/Report Number

Funding Information

item.page.isbn

DOI identifier

Previously Published Citation

Other identifiers

Suggested Citation

Yu, Yanjia. (2023). On variable selection diagnostics for high-dimensional regression models. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/271373.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.