https://arxiv.org/pdf/2502.05172 loss scaling law:

$$ \begin{equation*} L(N_{act}, D, \hat{E}) = 35.91\hat{E}^{0.2285}N_{act}^{0.1889-0.0098\ln(\hat{E})} + 35.98\hat{E}^{-0.5529}D^{0.1775+0.0259\ln(\hat{E})} + 1.3637 \end{equation*} $$

learning rate scaling formula:

$$ LR(N_{act/e},E) = exp(8.39−0.81 ln(N_{act/e})−0.25 ln(E)) $$

where:

i like this statement tbh

Screenshot 2025-02-13 at 2.54.36 PM.png

compute optimiality

most of the compute optimality depends on the number of experts and not really the dataset size and model size i mean they are proportional but to get the minimum loss on a compute constraint i think we can trade off on them

Screenshot 2025-02-13 at 3.07.24 PM.png

these were there training budgets and tests that they did

here N^{opt}_{act} is the optimal number of active params and $D^{opt}$ is the optimal number of tokens in the dataset

Screenshot 2025-02-13 at 3.09.05 PM.png

we can infer as we increase the number of experts the active param gets smaller (obviously) but we also have to increase the dataset size accordingly for good loss. there must be some ratios for the active param too let's read (this is chinchilla functional btw so calculations should be easier)

nanoMOE ftw