Three Topics in Statistical Learning
2022-07
Loading...
View/Download File
Persistent link to this item
Statistics
View StatisticsJournal Title
Journal ISSN
Volume Title
Title
Three Topics in Statistical Learning
Authors
Published Date
2022-07
Publisher
Type
Thesis or Dissertation
Abstract
This dissertation discusses three different topics in statistical learning. First, we propose an empirical Bayes method for spike inference. Spike inference is the process of inferring neuron activity times from the noisy output of calcium imaging, a popular tool for monitoring neuron activities. We show that it significantly outperforms a state-of-the-art method developed by Jewell et. al. (2020) while its computation time is comparable. We also demonstrate that the performance improvement is partly due to our better modeling and estimation of the prior distribution on spikes, whose importance has not been recognized in the spike inference literature. Second, consider high-dimensional linear regression with error-contaminated covariates. Some sparse regularized regression methods have been recently developed under the assumption that the covariance matrix of measurement errors is either known or can be accurately estimated, which is often unavailable in practice. Moreover, the existing work focused only on estimating the regression coefficients but not on prediction. We propose to overcome these limitations by resorting to a different type of sparsity assumption for high-dimensional linear regression. We consider the representation of the conditional mean as a linear combination of principal components. Assuming the corresponding coefficient vector resides in an $\ell_q$ ball, we show that an $l_1$ penalized method has a prediction performance on error-contaminated data that reaches the clean-data minimax-optimal rate, without requiring any knowledge of the covariance matrix of measurement errors. Our theoretical result also reveals an interesting blessing of dimensionality: the impact of measurement errors on prediction performance diminishes as the number of covariates increases. Finally, we discuss minimax estimation with imbalanced binary data. In a wide range of binary prediction and estimation tasks, data exhibits a significant imbalance between the sample sizes of the two classes, which greatly hinders the performance of standard machine learning methods. In spite of a vast collection of methods aiming to achieve better performance on imbalanced data, the theoretical limit of estimation with imbalanced data remains unknown. This chapter provides some insights into the imbalanced classification problem by establishing the minimax risk of log-odds function estimation. Our minimax bounds reveal a notion of effective sample size. We further construct a sampling technique and prove that a minimax-rate optimal method for balanced data combined with the sample technique achieves minimax-rate performance on imbalanced data.
Keywords
Description
University of Minnesota Ph.D. dissertation. July 2022. Major: Statistics. Advisor: Hui Zou. 1 computer file (PDF); v, 95 pages.
Related to
Replaces
License
Collections
Series/Report Number
Funding information
Isbn identifier
Doi identifier
Previously Published Citation
Other identifiers
Suggested citation
Song, Yang. (2022). Three Topics in Statistical Learning. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/241707.
Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.