status
type
date
slug
summary
tags
category
icon
password
优势:
Pretrained faster, inference faster
 
Cons:
High VRAM requirement
Fine-tuning —> instruction-tuning is promising
 
The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.
 
Mixture of Experts enable models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model. In particular, a MoE model should achieve the same quality as its dense counterpart much faster during pretraining.
 
Single FFN(feed-forward network) layer → Sparse MOE layers (multiple network each is a FFN expert) → hierarchical MoEs possibly
 
Gate network or router → decide which token is send to which network(can be to multiple)
router is trained with the pretrained model
 
Challenges:
  1. Overfit during fine-tuning, couldn’t generalize
  1. Higher VRAM Mixtral 8*7B → 78B < 7*8 = 56 Since FFN layers are used and shared concurrently for tokens, the inference is faster than going through each of them
 
 
Uneven batch size → learned gating network G decides which expert E
 
notion image
notion image
 
Noisy Top-k gating
 
notion image
 
 
x * W_g(gate学习的参数) + noise(x * W_noise) → V(x) of i expert
 
为什么top k不是单最大那一个 routing to multiple teach gate to route to different ones
 
can’t use only a few popular expert → add auxiliary loss → all expert receive equal training
包括设定 threshold 一个expert最多承载多少token
 
Expert capacity: If both experts at capacity, token is overflowed and sent to the next layer via residual or 彻底drop掉 重要!
 
 
Parallel computation with MOE FFN shared across machines
 
 
notion image
 
 
Struggle with training and fine-tuning → Switch Transformers
notion image
 
2 token + four expert → Single expert chosen by each token→
 
notion image
Increasing the capacity will lead to more expensive inter-device communication, so it’s a trade-off to keep in mind. In particular, Switch Transformers perform well at low capacity factors (1-1.25)
 
Also included auxiliary loss → include in the total model loss to encourage uniform routing
 
They used Top-2 routing and much larger capacity factors. In addition, they explored the capacity factor as a metric one can change during training and evaluation depending on how much computing one wants to use.
 
Router z-loss, introduced in ST-MoE, significantly improves training stability without quality degradation by penalizing large logits entering the gating network. Since this loss encourages absolute magnitude of values to be smaller, roundoff errors are reduced, which can be quite impactful for exponential functions such as the gating.
 
ST-MoE: Designing Stable and Transferable Sparse Expert Models https://arxiv.org/abs/2202.08906
 
expert specialization in encoder (less in decoder)
No specialization of specific language in multi-language setting
 
Diminishing gain in efficiency boost of more experts
 
Higher overfitting probability and therefore higher dropout and regularization for experts
Token dropping could help too, 所以有可能capacity调小有好处来直接丢掉多余token
 
Perform better at knowledge-heavy tasks such as QA, do well in larger tasks?
 
notion image
 
One could experiment with freezing all non-expert weights. That is, we'll only update the MoE layers. This leads to a huge performance drop. We could try the opposite: freezing only the parameters in MoE layers, which worked almost as well as updating all parameters. This can help speed up and reduce memory for fine-tuning. This can be somewhat counter-intuitive as 80% of the parameters are in the MoE layers (in the ST-MoE project). Their hypothesis for that architecture is that, as expert layers only occur every 1/4 layers, and each token sees at most two experts per layer, updating the MoE parameters affects much fewer layers than updating other parameters.
 
 
MoEs Meets Instruction Tuning (July 2023), performs experiments doing:
  • Single task fine-tuning
  • Multi-task instruction-tuning
  • Multi-task instruction-tuning followed by single-task fine-tuning
 
MoEs might benefit much more from instruction tuning than dense models
 
Let’s do a brief review of parallelism:
  • Data parallelism: the same weights are replicated across all cores, and the data is partitioned across cores.
  • Model parallelism: the model is partitioned across cores, and the data is replicated across cores.
  • Model and data parallelism: we can partition the model and the data across cores. Note that different cores process different batches of data.
  • Expert parallelism: experts are placed on different workers. If combined with data parallelism, each core has a different expert and the data is partitioned across all cores
 
notion image
 
 
 
 
 
 
参考:
通过SSH Tunneling 连接本地ollama和远程服务器 致2025的你
Loading...
ran2323
ran2323
忘掉名字吧
Latest posts
SFT + DPO 塔罗解读
2025-4-14
Backtracking
2025-4-14
Leetcode 0001-1000 分组
2025-4-14
mcp 记录(1)
2025-4-14
DPO 相关
2025-3-29
今日paper(3/25) - MAGPIE
2025-3-27
Announcement
 
 
 
 
暂时没有新的内容