Song, Yang2022-09-262022-09-262022-07https://hdl.handle.net/11299/241707University of Minnesota Ph.D. dissertation. July 2022. Major: Statistics. Advisor: Hui Zou. 1 computer file (PDF); v, 95 pages.This dissertation discusses three different topics in statistical learning. First, we propose an empirical Bayes method for spike inference. Spike inference is the process of inferring neuron activity times from the noisy output of calcium imaging, a popular tool for monitoring neuron activities. We show that it significantly outperforms a state-of-the-art method developed by Jewell et. al. (2020) while its computation time is comparable. We also demonstrate that the performance improvement is partly due to our better modeling and estimation of the prior distribution on spikes, whose importance has not been recognized in the spike inference literature. Second, consider high-dimensional linear regression with error-contaminated covariates. Some sparse regularized regression methods have been recently developed under the assumption that the covariance matrix of measurement errors is either known or can be accurately estimated, which is often unavailable in practice. Moreover, the existing work focused only on estimating the regression coefficients but not on prediction. We propose to overcome these limitations by resorting to a different type of sparsity assumption for high-dimensional linear regression. We consider the representation of the conditional mean as a linear combination of principal components. Assuming the corresponding coefficient vector resides in an $\ell_q$ ball, we show that an $l_1$ penalized method has a prediction performance on error-contaminated data that reaches the clean-data minimax-optimal rate, without requiring any knowledge of the covariance matrix of measurement errors. Our theoretical result also reveals an interesting blessing of dimensionality: the impact of measurement errors on prediction performance diminishes as the number of covariates increases. Finally, we discuss minimax estimation with imbalanced binary data. In a wide range of binary prediction and estimation tasks, data exhibits a significant imbalance between the sample sizes of the two classes, which greatly hinders the performance of standard machine learning methods. In spite of a vast collection of methods aiming to achieve better performance on imbalanced data, the theoretical limit of estimation with imbalanced data remains unknown. This chapter provides some insights into the imbalanced classification problem by establishing the minimax risk of log-odds function estimation. Our minimax bounds reveal a notion of effective sample size. We further construct a sampling technique and prove that a minimax-rate optimal method for balanced data combined with the sample technique achieves minimax-rate performance on imbalanced data.enThree Topics in Statistical LearningThesis or Dissertation