Empirical Analysis of Optimization and Generalization of Deep Neural Networks

Deep neural networks (DNNs) have gained increasing attention and popularity in the past decade. This is mainly due to their tremendous success in numerous commercial, scientific, and societal tasks. Despite the success of DNNs in practice, several aspects of the optimization dynamics and generalization are still not well understood. In practice, DNNs are usually heavily over-parameterized with far more parameters than training samples, making them easier to memorize all the training examples without learning. In fact, Zhang et al. have shown that DNNs indeed can fit training data perfectly. Training DNNs also requires first-order optimization methods such as gradient descent (GD) and stochastic gradient descent (SGD) to solve a highly non-convex optimization problem. The fact that such heavily over-parameterized DNNs trained by simple GD/SGD are still able to learn and generalize well deeply puzzles the deep learning community. In this thesis, we explore the optimization dynamics and generalization behavior of over-parameterized DNNs trained by SGD from two unique directions. First, we focus on studying the topology of the loss landscape of those DNNs through the analysis of the Hessian of the training loss (with respect to the parameters). We empirically study the second-moment matrix $M_t$ constructed by the outer product of the stochastic gradients (SGs), as well as the Hessian of the loss $H_f(\theta_t)$. With the help of existing tools such as the Lanczos method and the R-operator, we can compute the eigenvalues and the corresponding eigenvectors of both the Hessian matrix and the second-moment matrix efficiently. This allows us to reveal the relationship between the Hessian of the loss and the second moment of SGs. Besides, we discover the ``low-rank'' structure in both the eigenvalue-spectrum of the Hessian and in the stochastic gradients themselves. Such observations directly lead to the development of a new PAC-bayes generalization bound which considers the structure of the Hessian at minima obtained from SGD, as well as a novel noisy truncated stochastic gradient descent (NT-SGD) algorithm, aiming to tackle the communication bottleneck in the large-scale distributed setting. Next, we dive into the debate on whether it is sufficient to explain the success of DNNs in practice by their behavior in the infinite-width limit. On one hand, there has been a rich literature understanding of the infinite width limit of DNNs. Such analysis simplifies the learning dynamics of very wide neural networks by a linear model obtained from the first-order Taylor expansion around its initial parameters. As a result, the DNN training occurs in a ``lazy'' regime. On the other hand, both theoretical and empirical evidence has been presented, pointing out the limitations of lazy training. Those results suggest that training DNNs with gradient descent actually occurs in a ``rich'' regime which captures much richer inductive biases and the behavior of such models cannot be fully described by their infinite-width kernel equivalence. As an empirical complement of the recent work studying the transition from the lazy regime to the rich regime, we study generalization and optimization behaviors of commonly used DNNs, focusing on varying the width and to some extent the depth, and show what happens in typical DNNs used in practice in the rich regime. We also extensively study the smallest eigenvalues of the Neural Tangent Kernel, a crucial element that appeared in many recently theoretical analyses related to both the training and generalization of DNNs. Hopefully, our empirical study could provide fodder for new theoretical advances on understanding generalization and optimization in both rich and lazy regimes.

Keywords

Deep learning

Generalization

Neural Tangent Kernel

Optimization

SGD

Description

University of Minnesota Ph.D. dissertation. March 2022. Major: Computer Science. Advisor: Arindam Banerjee. 1 computer file (PDF); x, 163 pages.

Collections

Dissertations

Suggested citation

Li, Xinyan. (2022). Empirical Analysis of Optimization and Generalization of Deep Neural Networks. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/241425.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.

University Digital Conservancy

Empirical Analysis of Optimization and Generalization of Deep Neural Networks

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Alternative title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation

University Digital Conservancy

University of Minnesota Twin Cities

Empirical Analysis of Optimization and Generalization of Deep Neural Networks

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Alternative title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation