Classification has long been an important research topic for statisticians. Nowadays, scientists are further challenged by classification problems for high-dimensional datasets in various fields, ranging from genomics, economics to machine learning. For such massive datasets, classical classification techniques may be inefficient or even infeasible, while new techniques are highly sought-after.
My dissertation work tackles high-dimensional classification problems by utilizing variable selection. In particular, three methods are proposed and studied: direct sparse discriminant analysis, semiparametric sparse discriminant analysis and the Kolmogorov filter.
In the proposal of direct sparse discriminant analysis (DSDA), I first point out the disadvantage in many current methods that they ignore the correlation structure between predictors. Then DSDA is proposed to extend the well-known linear discriminant analysis to high dimensions, fully respecting the correlation structure. The proposal is efficient and consistent, with excellent numerical performance. In addition to the proposal of DSDA, I also study its connection to many popular proposals of linear discriminant analysis in high dimensions, including the L1-Fisher's discriminant analysis and the sparse optimal scoring.
Semiparametric sparse discriminant analysis (SeSDA) extends DSDA by relaxing the normality assumption, which is fundamental for any method requiring the linear discriminant analysis model. SeSDA is more robust than DSDA, while it preserves the good properties of DSDA. Along with the development of SeSDA, a new concentration inequality is obtained that can provide theoretical justifications for methods based on Gaussian copulas.
Moreover, the Kolmogorov filter is proposed as a fully nonparametric method that performs variable selection for high-dimensional classification. It requires minimal assumptions on the distribution of the predictors, and is supported by both theoretical and numerical examples. Also, some potential future work is discussed on variable selection in classification.