A neural scaling law from the dimension of the data manifold

U Sharma, J Kaplan - arXiv preprint arXiv:2004.10802, 2020 - arxiv.org
arXiv preprint arXiv:2004.10802, 2020arxiv.org
When data is plentiful, the loss achieved by well-trained neural networks scales as a power-
law $ L\propto N^{-\alpha} $ in the number of network parameters $ N $. This empirical
scaling law holds for a wide variety of data modalities, and may persist over many orders of
magnitude. The scaling law can be explained if neural models are effectively just performing
regression on a data manifold of intrinsic dimension $ d $. This simple theory predicts that
the scaling exponents $\alpha\approx 4/d $ for cross-entropy and mean-squared error …
When data is plentiful, the loss achieved by well-trained neural networks scales as a power-law in the number of network parameters . This empirical scaling law holds for a wide variety of data modalities, and may persist over many orders of magnitude. The scaling law can be explained if neural models are effectively just performing regression on a data manifold of intrinsic dimension . This simple theory predicts that the scaling exponents for cross-entropy and mean-squared error losses. We confirm the theory by independently measuring the intrinsic dimension and the scaling exponents in a teacher/student framework, where we can study a variety of and by dialing the properties of random teacher networks. We also test the theory with CNN image classifiers on several datasets and with GPT-type language models.
arxiv.org
Résultat de recherche le plus pertinent Voir tous les résultats