Three Topics in Statistical Learning

2022-07
Loading...
Thumbnail Image

Persistent link to this item

Statistics
View Statistics

Journal Title

Journal ISSN

Volume Title

Title

Three Topics in Statistical Learning

Published Date

2022-07

Publisher

Type

Thesis or Dissertation

Abstract

This dissertation discusses three different topics in statistical learning. First, we propose an empirical Bayes method for spike inference. Spike inference is the process of inferring neuron activity times from the noisy output of calcium imaging, a popular tool for monitoring neuron activities. We show that it significantly outperforms a state-of-the-art method developed by Jewell et. al. (2020) while its computation time is comparable. We also demonstrate that the performance improvement is partly due to our better modeling and estimation of the prior distribution on spikes, whose importance has not been recognized in the spike inference literature. Second, consider high-dimensional linear regression with error-contaminated covariates. Some sparse regularized regression methods have been recently developed under the assumption that the covariance matrix of measurement errors is either known or can be accurately estimated, which is often unavailable in practice. Moreover, the existing work focused only on estimating the regression coefficients but not on prediction. We propose to overcome these limitations by resorting to a different type of sparsity assumption for high-dimensional linear regression. We consider the representation of the conditional mean as a linear combination of principal components. Assuming the corresponding coefficient vector resides in an $\ell_q$ ball, we show that an $l_1$ penalized method has a prediction performance on error-contaminated data that reaches the clean-data minimax-optimal rate, without requiring any knowledge of the covariance matrix of measurement errors. Our theoretical result also reveals an interesting blessing of dimensionality: the impact of measurement errors on prediction performance diminishes as the number of covariates increases. Finally, we discuss minimax estimation with imbalanced binary data. In a wide range of binary prediction and estimation tasks, data exhibits a significant imbalance between the sample sizes of the two classes, which greatly hinders the performance of standard machine learning methods. In spite of a vast collection of methods aiming to achieve better performance on imbalanced data, the theoretical limit of estimation with imbalanced data remains unknown. This chapter provides some insights into the imbalanced classification problem by establishing the minimax risk of log-odds function estimation. Our minimax bounds reveal a notion of effective sample size. We further construct a sampling technique and prove that a minimax-rate optimal method for balanced data combined with the sample technique achieves minimax-rate performance on imbalanced data.

Keywords

Description

University of Minnesota Ph.D. dissertation. July 2022. Major: Statistics. Advisor: Hui Zou. 1 computer file (PDF); v, 95 pages.

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation

Song, Yang. (2022). Three Topics in Statistical Learning. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/241707.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.