Browsing by Subject "SGD"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Empirical Analysis of Optimization and Generalization of Deep Neural Networks(2022-03) Li, XinyanDeep neural networks (DNNs) have gained increasing attention and popularity in the past decade. This is mainly due to their tremendous success in numerous commercial, scientific, and societal tasks. Despite the success of DNNs in practice, several aspects of the optimization dynamics and generalization are still not well understood. In practice, DNNs are usually heavily over-parameterized with far more parameters than training samples, making them easier to memorize all the training examples without learning. In fact, Zhang et al. have shown that DNNs indeed can fit training data perfectly. Training DNNs also requires first-order optimization methods such as gradient descent (GD) and stochastic gradient descent (SGD) to solve a highly non-convex optimization problem. The fact that such heavily over-parameterized DNNs trained by simple GD/SGD are still able to learn and generalize well deeply puzzles the deep learning community. In this thesis, we explore the optimization dynamics and generalization behavior of over-parameterized DNNs trained by SGD from two unique directions. First, we focus on studying the topology of the loss landscape of those DNNs through the analysis of the Hessian of the training loss (with respect to the parameters). We empirically study the second-moment matrix $M_t$ constructed by the outer product of the stochastic gradients (SGs), as well as the Hessian of the loss $H_f(\theta_t)$. With the help of existing tools such as the Lanczos method and the R-operator, we can compute the eigenvalues and the corresponding eigenvectors of both the Hessian matrix and the second-moment matrix efficiently. This allows us to reveal the relationship between the Hessian of the loss and the second moment of SGs. Besides, we discover the ``low-rank'' structure in both the eigenvalue-spectrum of the Hessian and in the stochastic gradients themselves. Such observations directly lead to the development of a new PAC-bayes generalization bound which considers the structure of the Hessian at minima obtained from SGD, as well as a novel noisy truncated stochastic gradient descent (NT-SGD) algorithm, aiming to tackle the communication bottleneck in the large-scale distributed setting. Next, we dive into the debate on whether it is sufficient to explain the success of DNNs in practice by their behavior in the infinite-width limit. On one hand, there has been a rich literature understanding of the infinite width limit of DNNs. Such analysis simplifies the learning dynamics of very wide neural networks by a linear model obtained from the first-order Taylor expansion around its initial parameters. As a result, the DNN training occurs in a ``lazy'' regime. On the other hand, both theoretical and empirical evidence has been presented, pointing out the limitations of lazy training. Those results suggest that training DNNs with gradient descent actually occurs in a ``rich'' regime which captures much richer inductive biases and the behavior of such models cannot be fully described by their infinite-width kernel equivalence. As an empirical complement of the recent work studying the transition from the lazy regime to the rich regime, we study generalization and optimization behaviors of commonly used DNNs, focusing on varying the width and to some extent the depth, and show what happens in typical DNNs used in practice in the rich regime. We also extensively study the smallest eigenvalues of the Neural Tangent Kernel, a crucial element that appeared in many recently theoretical analyses related to both the training and generalization of DNNs. Hopefully, our empirical study could provide fodder for new theoretical advances on understanding generalization and optimization in both rich and lazy regimes.Item Matrix Completion via Nonconvex Factorization: Algorithms and Theory(2015-05) Sun, RuoyuLearning low-rank structure of the data matrix is a powerful method to deal with ``big data''. However, in many modern applications such as recommendation systems and sensor localization, it is impossible or too costly to obtain all data, resulting in a data matrix with most entries missing. A problem of great interest is then to infer the missing data based on very few observations, usually under the assumption that the true data matrix is low rank, which is widely known as the matrix completion problem. The most popular solution for large-scale matrix completion is the matrix factorization (MF) formulation,which has achieved great empirical success. However, due to the non-convexity caused by the factorization model, little is known about whether and when the algorithms for the MF formulation will generate a good solution. In this thesis, we will study the non convex MF formulation for matrix completion, both algorithmically and theoretically. First, we empirically analyze several standard algorithms for the MF formulation. We present a novel finding that the recovery ability of an algorithm mainly depends on whether it can control the row-norms (or incoherence constants) of the iterates. Motivated by this finding, we propose a new formulation that either adds constraints or penalty-function-type regularizers to directly control the row-norms. Simulation results show that the algorithms for the new formulation can recover the matrix in the regime very close to the fundamental limit, outperforming the standard algorithms. We then establish a theoretical guarantee for the new MF formulation to correctly recover the underlying low-rank matrix. In particular, we show that under similar conditions to those in previous works, many standard optimization algorithms converge to the global optima of the new MF formulation, and recover the true low-rank matrix. This result is rather surprising since we prove convergence to global optima for a nonconvex problem. To the best of our knowledge, our result is the first one that provides exact recovery guarantee for many standard algorithms. A major difference of our work from the existing results is that we do not need resampling (i.e., using independent samples at each iteration) in either the algorithm or its analysis. Technically, we develop a novel perturbation analysis for matrix factorization which may be of independent interest.