On variable selection diagnostics for high-dimensional regression models

Because model selection is ubiquitous in data analysis, the reproducibility of statistical results requires that we be able to evaluate the reliability of the employed model selection method, regardless of the model’s apparent good properties. Instability measures have been proposed for evaluating model selection uncertainty. However, low instability does not necessarily indicate that the selected model is trustworthy, because low instability can also arise when a method tends to select an overly parsimonious model. F- and G- measures have become increasingly popular for assessing variable selection performance in theoretical studies and simulation results. However, they are not computable in practice. In this dissertation work, we propose an estimation method for F- and G-measures and prove their desirable properties of uniform consistency. This gives the data analyst a valuable tool to compare different variable selection methods based on the data at hand. Extensive simulations are conducted to show the very good finite-sample performance of our approach. We apply our methods to several microarray gene expression data sets, with intriguing results.We also extend the work of Nan and Yang (2014) on variable selection deviation (VSD) measures and Yu et al. (2022) on F- and G-measures to a broader class of models in the exponential dispersion family, including, for example, the Poisson and compound Poisson-gamma models. In particular, we consider the Tweedie family of models that possesses a power mean-variance relationship, for its wide spectrum of applications in fields such as insurance, ecology, political science and health and biomedical studies. We propose methods based on information criteria and adaptive regression by mixing (ARM) to compute the weights of the candidate models that are adaptive to their predictive performance for the Poisson and Tweedie regression models. Our extensive empirical studies show that the proposed diagnostic measures (including VSD, F- and G-measures) are reasonable metrics of variable selection performance and the weighting methods work very well in recovering the true variable selection deviations. An R package named PAVI is developed to calculate the various variable selection diagnostic measures for all members of the generalized linear models. Three most widely used weighting methods based on AIC, BIC and ARM are supported. Parallel computation mechanism and procedures for dealing with convergence issues of the numerical optimization algorithm in R’s glm function are implemented to smoothly carry out the weighting procedures. Extensive numerical experiments conducted using this package show that it is stable and delivers expected results.

Keywords

F and G Measures

Generalized Linear Models

High-dimensional Statistics

Model Averaging

Reproducibility

Variable Selection

Description

University of Minnesota Ph.D. dissertation. January 2023. Major: Statistics. Advisor: Yuhong Yang. 1 computer file (PDF); x, 117 pages.

Collections

Dissertations

Suggested Citation

Yu, Yanjia. (2023). On variable selection diagnostics for high-dimensional regression models. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/271373.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.

On variable selection diagnostics for high-dimensional regression models

View/Download File

Persistent link to this item

Statistics

Title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

item.page.replaces

License

Collections

Series/Report Number

Funding Information

item.page.isbn

DOI identifier

Previously Published Citation

Other identifiers

Suggested Citation

University of Minnesota Twin Cities

On variable selection diagnostics for high-dimensional regression models

View/Download File

Persistent link to this item

Statistics

Title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

item.page.replaces

License

Collections

Series/Report Number

Funding Information

item.page.isbn

DOI identifier

Previously Published Citation

Other identifiers

Suggested Citation