Special Seminar in Computing and Mathematical Sciences

Series: Special Seminars in Computing + Mathematical Sciences

Modeling and Learning Deep Representations, in Theory and in Practice

Stefano Soatto, Professor, Computer Science, University of California, Los Angeles,

A few things about Deep Learning I find puzzling: 1) How can deep neural networks — optimized by stochastic gradient descent (SGD) agnostic of concepts of invariance, minimality, disentanglement — somehow manage to learn representations that exhibit those traits? 2) How do these networks, that can overfit random labels, somehow manage to generalize? 3) How can the "flatness" of minima of the training loss be related to generalization, when flatness is coordinate-dependent?

To tackle these questions I will 1) describe a tight bound between the amount of information in the weights of the network and total correlation (a measure of disentanglement), minimality and invariance of the resulting representation; 2) show that if complexity is measured by the information in the parameters (not their number), deep networks follow the bias-variance tradeoff faithfully and there is no need to "rethink" generalization; 3) show that the nuclear norm of the Hessian (a measure of flatness) bounds the information in the weights, which is the regularizer that guarantees the representation to be minimal, sufficient, invariant and maximally disentangled. The resulting information-theoretic framework can predict a sharp phase transition between overfitting and underfitting for random labels, and quantify the amount of information needed to overfit a given dataset with a given network to the fraction of a NAT. The theory has connections with variational Bayesian inference, the Information Bottleneck principle, and PAC-Bayes bounds.

Once a regularized loss function is in place, learning a representation amounts to solving a high-dimensional, non-convex optimization problem. In the second part of the talk, I will highlight some peculiarities of the geometry of the loss surface, and describe Entropy-SGD, an algorithm designed to exploit them using insights from statistical physics. As it turns out, Entropy-SGD computes the solution of a viscous Hamilton-Jacobi partial differential equation (PDE), which leads to a stochastic optimal control version of SGD that is faster than the vanilla version. In the non-viscous limit, the PDE leads to the classical proximal point iteration via the Hopf-Lax formula. The analysis establishes connections between statistical physics, non-convex optimization and the theory of PDEs. Moreover, Entropy-SGD includes as special cases distributed algorithms popular in Deep Learning, e.g., Elastic-SGD, which it outperforms leading to state-of-the-art generalization error with optimal convergence rates, without extra hyper-parameters to tune.

------

Joint work with Alessandro Achille, Pratik Chaudhari (UCLA Vision Lab), Adam Oberman (Montreal), Riccardo Zecchina, Carlo Baldassi (Milano), Anna Choromanska, Yann LeCun (NYU & FB), Stanley Osher (UCLA).

Event Sponsors:

Computing and Mathematical Sciences (CMS) - More events from this Sponsor

For more information, please contact Monica Nolasco by phone at 395.4140 or by email at [email protected] or visit Modeling and Learning Deep Representations, in Theory and in Practice.