Gu, Qilong2020-09-222020-09-222020-04https://hdl.handle.net/11299/216391University of Minnesota Ph.D. dissertation. April 2020. Major: Computer Science. Advisor: Arindam Banerjee. 1 computer file (PDF); xi, 162 pages.Over the last few years, advances have been made in fields like scalable algorithms, novel learning models, and machine learning software frameworks, which resulted in further breakthroughs in computer vision, natural language processing, and recommendation systems. In entering the era of big data, large scale machine learning became increasingly important in training a big model on a big dataset. However, even with modern distributed systems, the large amount of data used today cannot be fully loaded into memory or manipulated directly. Stochastic optimization algorithms can reduce the size of model and data in learning process, therefore plays a key role in building a large scale machine learning system. Modern machine learning faces two kinds of big data: (1) the number of features is much larger than the sample size, in other words, high-dimension; (2) massive sample size. The two types of data need different treatments. For (1), the challenge lies in finding the intrinsic structure of the model used. Norm regularized estimators like Lasso is widely used in finding a low complex parameter with high statistical accuracy. More complicated structures can be constructed using simple structures like sparsity and low-rank, which are shown to be more robust than single structured models. For (2), carefully designed algorithms that make use of only a small portion of data in each operation are necessary for computation. Stochastic gradient descent (SGD) and its variants are well known for their practical effectiveness. Prior information can be incorporated in algorithms to improve their efficiency while maintaining convergence guarantee. In this thesis, we target a unified understanding of stochastic algorithms in both (1) and (2). For high-dimensional data, we consider estimation from superposition structure and locally low-rank structure. Both structures are constructed from simple structures like sparsity and low-rankness. A parameter is superposition structured if it can be written as a sum of multiple component parameters, each with its own structure. We present a novel estimator that allows the sum of any number of component parameters to be accurately estimated, characterize sample complexity of the estimator, and give high probability non-asymptotic bounds on the componentwise estimation error. A matrix is Locally rank if there are submatrices of the matrix that are low rank. We consider a convex relaxation of LLR structure and propose an efficient algorithm based on dual projected gradient descent (D-PGD) for computing the proximal operator. We show that the proposed D-PGD algorithm has a geometrical convergence rate. We present several practical ways to further speed up the computations. We develop several novel stochastic optimization algorithms for large empirical risk minimization problems. We study sketched iterative algorithms for the generalized linear model, in particular, sketched projected gradient descent (S-PGD), and sketched stochastic variance reduced gradient (S-SVRG) for the structured generalized linear model, and illustrate that these methods have geometric convergence to the statistical error under suitable assumptions. Secondly, we propose Yet Another Adaptive SGD (YAA-SGD) algorithm based on the behavior of expected change in function value in terms of the first and second moments of stochastic gradients. YAA-SGD step sizes depend on the ratio of the first and second moments and do not need to be monotonically decreasing. Further, the algorithm provably converges at a $O(1/\sqrt{T})$ rate for non-convex problems. We also present empirical results supporting the effectiveness of the algorithm.enStochastic Algorithms for Large Scale Machine LearningThesis or Dissertation