浅了解MOE | 湛蓝与蔚蓝

status

type

date

slug

summary

category

icon

password

优势:

Pretrained faster, inference faster

Cons:

High VRAM requirement

Fine-tuning —> instruction-tuning is promising

The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.

Mixture of Experts enable models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model. In particular, a MoE model should achieve the same quality as its dense counterpart much faster during pretraining.

Single FFN(feed-forward network) layer → Sparse MOE layers (multiple network each is a FFN expert) → hierarchical MoEs possibly

Gate network or router → decide which token is send to which network(can be to multiple)

router is trained with the pretrained model

Challenges:

Overfit during fine-tuning, couldn’t generalize

Higher VRAM Mixtral 8*7B → 78B < 7*8 = 56 Since FFN layers are used and shared concurrently for tokens, the inference is faster than going through each of them

Uneven batch size → learned gating network G decides which expert E

Noisy Top-k gating

x * W_g(gate学习的参数) + noise(x * W_noise) → V(x) of i expert

为什么top k不是单最大那一个 routing to multiple teach gate to route to different ones

can’t use only a few popular expert → add auxiliary loss → all expert receive equal training

包括设定 threshold 一个expert最多承载多少token

Expert capacity: If both experts at capacity, token is overflowed and sent to the next layer via residual or 彻底drop掉 重要!

Parallel computation with MOE FFN shared across machines

Struggle with training and fine-tuning → Switch Transformers

2 token + four expert → Single expert chosen by each token→

Increasing the capacity will lead to more expensive inter-device communication, so it’s a trade-off to keep in mind. In particular, Switch Transformers perform well at low capacity factors (1-1.25)

Also included auxiliary loss → include in the total model loss to encourage uniform routing

They used Top-2 routing and much larger capacity factors. In addition, they explored the capacity factor as a metric one can change during training and evaluation depending on how much computing one wants to use.

Router z-loss, introduced in ST-MoE, significantly improves training stability without quality degradation by penalizing large logits entering the gating network. Since this loss encourages absolute magnitude of values to be smaller, roundoff errors are reduced, which can be quite impactful for exponential functions such as the gating.

ST-MoE: Designing Stable and Transferable Sparse Expert Models https://arxiv.org/abs/2202.08906

expert specialization in encoder (less in decoder)

No specialization of specific language in multi-language setting

Diminishing gain in efficiency boost of more experts

Higher overfitting probability and therefore higher dropout and regularization for experts

Token dropping could help too, 所以有可能capacity调小有好处来直接丢掉多余token

Perform better at knowledge-heavy tasks such as QA, do well in larger tasks?

One could experiment with freezing all non-expert weights. That is, we'll only update the MoE layers. This leads to a huge performance drop. We could try the opposite: freezing only the parameters in MoE layers, which worked almost as well as updating all parameters. This can help speed up and reduce memory for fine-tuning. This can be somewhat counter-intuitive as 80% of the parameters are in the MoE layers (in the ST-MoE project). Their hypothesis for that architecture is that, as expert layers only occur every 1/4 layers, and each token sees at most two experts per layer, updating the MoE parameters affects much fewer layers than updating other parameters.

MoEs Meets Instruction Tuning (July 2023), performs experiments doing:

Single task fine-tuning

Multi-task instruction-tuning

Multi-task instruction-tuning followed by single-task fine-tuning

MoEs might benefit much more from instruction tuning than dense models

Let’s do a brief review of parallelism:

Data parallelism: the same weights are replicated across all cores, and the data is partitioned across cores.

Model parallelism: the model is partitioned across cores, and the data is replicated across cores.

Model and data parallelism: we can partition the model and the data across cores. Note that different cores process different batches of data.

Expert parallelism: experts are placed on different workers. If combined with data parallelism, each core has a different expert and the data is partitioned across all cores

参考:

https://huggingface.co/blog/moe