https://arxiv.org/pdf/2502.05172 loss scaling law:
$$ \begin{equation*} L(N_{act}, D, \hat{E}) = 35.91\hat{E}^{0.2285}N_{act}^{0.1889-0.0098\ln(\hat{E})} + 35.98\hat{E}^{-0.5529}D^{0.1775+0.0259\ln(\hat{E})} + 1.3637 \end{equation*} $$
learning rate scaling formula:
$$ LR(N_{act/e},E) = exp(8.39−0.81 ln(N_{act/e})−0.25 ln(E)) $$
where:
i like this statement tbh
most of the compute optimality depends on the number of experts and not really the dataset size and model size i mean they are proportional but to get the minimum loss on a compute constraint i think we can trade off on them
these were there training budgets and tests that they did
here N^{opt}_{act} is the optimal number of active params and $D^{opt}$ is the optimal number of tokens in the dataset
we can infer as we increase the number of experts the active param gets smaller (obviously) but we also have to increase the dataset size accordingly for good loss. there must be some ratios for the active param too let's read (this is chinchilla functional btw so calculations should be easier)
nanoMOE ftw