Empirical Analysis of Optimization and Generalization of Deep Neural Networks

Li, Xinyan2022-08-292022-08-292022-03https://hdl.handle.net/11299/241425University of Minnesota Ph.D. dissertation. March 2022. Major: Computer Science. Advisor: Arindam Banerjee. 1 computer file (PDF); x, 163 pages.Deep neural networks (DNNs) have gained increasing attention and popularity in the past decade. This is mainly due to their tremendous success in numerous commercial, scientific, and societal tasks. Despite the success of DNNs in practice, several aspects of the optimization dynamics and generalization are still not well understood. In practice, DNNs are usually heavily over-parameterized with far more parameters than training samples, making them easier to memorize all the training examples without learning. In fact, Zhang et al. have shown that DNNs indeed can fit training data perfectly. Training DNNs also requires first-order optimization methods such as gradient descent (GD) and stochastic gradient descent (SGD) to solve a highly non-convex optimization problem. The fact that such heavily over-parameterized DNNs trained by simple GD/SGD are still able to learn and generalize well deeply puzzles the deep learning community. In this thesis, we explore the optimization dynamics and generalization behavior of over-parameterized DNNs trained by SGD from two unique directions. First, we focus on studying the topology of the loss landscape of those DNNs through the analysis of the Hessian of the training loss (with respect to the parameters). We empirically study the second-moment matrix $M_t$ constructed by the outer product of the stochastic gradients (SGs), as well as the Hessian of the loss $H_f(\theta_t)$. With the help of existing tools such as the Lanczos method and the R-operator, we can compute the eigenvalues and the corresponding eigenvectors of both the Hessian matrix and the second-moment matrix efficiently. This allows us to reveal the relationship between the Hessian of the loss and the second moment of SGs. Besides, we discover the ``low-rank'' structure in both the eigenvalue-spectrum of the Hessian and in the stochastic gradients themselves. Such observations directly lead to the development of a new PAC-bayes generalization bound which considers the structure of the Hessian at minima obtained from SGD, as well as a novel noisy truncated stochastic gradient descent (NT-SGD) algorithm, aiming to tackle the communication bottleneck in the large-scale distributed setting. Next, we dive into the debate on whether it is sufficient to explain the success of DNNs in practice by their behavior in the infinite-width limit. On one hand, there has been a rich literature understanding of the infinite width limit of DNNs. Such analysis simplifies the learning dynamics of very wide neural networks by a linear model obtained from the first-order Taylor expansion around its initial parameters. As a result, the DNN training occurs in a ``lazy'' regime. On the other hand, both theoretical and empirical evidence has been presented, pointing out the limitations of lazy training. Those results suggest that training DNNs with gradient descent actually occurs in a ``rich'' regime which captures much richer inductive biases and the behavior of such models cannot be fully described by their infinite-width kernel equivalence. As an empirical complement of the recent work studying the transition from the lazy regime to the rich regime, we study generalization and optimization behaviors of commonly used DNNs, focusing on varying the width and to some extent the depth, and show what happens in typical DNNs used in practice in the rich regime. We also extensively study the smallest eigenvalues of the Neural Tangent Kernel, a crucial element that appeared in many recently theoretical analyses related to both the training and generalization of DNNs. Hopefully, our empirical study could provide fodder for new theoretical advances on understanding generalization and optimization in both rich and lazy regimes.enDeep learningGeneralizationNeural Tangent KernelOptimizationSGDEmpirical Analysis of Optimization and Generalization of Deep Neural NetworksThesis or Dissertation