MOE scaling laws

https://arxiv.org/pdf/2502.05172 loss scaling law:

$$ \begin{equation*} L(N_{act}, D, \hat{E}) = 35.91\hat{E}^{0.2285}N_{act}^{0.1889-0.0098\ln(\hat{E})} + 35.98\hat{E}^{-0.5529}D^{0.1775+0.0259\ln(\hat{E})} + 1.3637 \end{equation*} $$

learning rate scaling formula:

$$ LR(N_{act/e},E) = exp(8.39−0.81 ln(N_{act/e})−0.25 ln(E)) $$

where:

L represents the final training loss.
LR represents the learning rate
N_{act} denotes the number of active parameters
D is the dataset size, in terms of training token number
E is the number of experts

i like this statement tbh

Screenshot 2025-02-13 at 2.54.36 PM.png

compute optimiality

most of the compute optimality depends on the number of experts and not really the dataset size and model size i mean they are proportional but to get the minimum loss on a compute constraint i think we can trade off on them

Screenshot 2025-02-13 at 3.07.24 PM.png

these were there training budgets and tests that they did

here N^{opt}_{act} is the optimal number of active params and $D^{opt}$ is the optimal number of tokens in the dataset

Screenshot 2025-02-13 at 3.09.05 PM.png

we can infer as we increase the number of experts the active param gets smaller (obviously) but we also have to increase the dataset size accordingly for good loss. there must be some ratios for the active param too let's read (this is chinchilla functional btw so calculations should be easier)

nanoMOE ftw