模型	#Experts	top-k数量	备注
Mixtral 8x22B	8	2	专家更大，推理仍保持稀疏
DeepSeek-V3 (670B A37B)	256+1	8+1	开源中文 MoE，提供高效推理规模
Qwen3-235B-A22B	128	8	阿里开源

MoE代码示例：Gating Network

import torch
import torch.nn as nn

class TopKGating(nn.Module):
    def __init__(self, hidden_dim, num_experts, k=2):
        super().__init__()
        self.router = nn.Linear(hidden_dim, num_experts, bias=False)
        self.k = k

    def forward(self, x):
        scores = self.router(x)                         # [batch, experts]
        topk_scores, topk_idx = torch.topk(scores, self.k, dim=-1)
        probs = torch.softmax(topk_scores, dim=-1)
        gate = torch.zeros_like(scores)
        gate.scatter_(-1, topk_idx, probs)
        return gate, topk_idx

router 根据输入生成专家打分，通过topk仅保留稀疏激活
gate中的概率用于加权专家输出，topk_idx决定调用哪些expert

格式	尾数	指数	近似数值范围	常见场景
FP32	23	8	1.18e-38 ~ 3.4e38	全精度训练、关键评估
BF16	7	8	1.18e-38 ~ 3.4e38	大规模训练，保留 FP32 动态范围
FP16	10	5	6.10e-5 ~ 6.55e4	混合精度训练、推理
FP8 (E4M3 / E5M2)	3 / 2	4 / 5	2.4e-2 ~ 4.48e2 / 5.96e-4 ~ 5.73e4	最新 GPU 上的高效推理/训练

LLM智能应用开发

LLM结构的学习路径

Transformer经典结构

Transformer经典结构

Mixer-of-Experts (MoE)

Example of mixture of experts

NLP模型中的Mixture of experts

MoE中的稀疏

MoE中的稀疏

MoE中的稀疏

MoE结构示例

MoE与Transformers

MoE与Transformers

主流MoE模型的experts数量

如何确定MoE中的expert

前馈神经网络(FFN)

标准FFN实现

MoE代码示例：Gating Network

MoE的训练

Low-rank adaptation (LoRA)

LoRA基本思路

LoRA推理

LoRA实现

LoRA实现

浮点数表示

浮点数表示

浮点数表示

浮点数表示

浮点格式对比

精度与量化衔接

混合精度示意

量化的核心思路

量化示意