一维的输入,归一化后的输出

“药效”:加速训练收敛,让输入更“规整”,降低过拟合(overfitting),增强泛化(generalization)

Normalization:“调整数据分布”

当前流行的LayerNorm:RMSNorm
如何利用torch算子自行实现RMSNorm?
手搓
input = input.to(torch.float32)
variance = input.pow(2).mean(-1, keepdim=True)
hidden_states = input * torch.rsqrt(variance + variance_epsilon)
直接调用
layerNorm = nn.RMSNorm([4])
hidden_states = layerNorm(input)
RoPE的2D实现

RoPE的n维实现

目标:实现RoPE(对应的
构建RoPE矩阵:

def rotate_half(x):
"""Rotates half the hidden dims of the input."""
x1 = x[..., : x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2 :]
return torch.cat((-x2, x1), dim=-1)
def apply_rotary_pos_emb:
cos = cos.unsqueeze(unsqueeze_dim)
sin = sin.unsqueeze(unsqueeze_dim)
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)

(mlp): LlamaMLP(
(gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
(up_proj): Linear(in_features=2048, out_features=8192, bias=False)
(down_proj): Linear(in_features=8192, out_features=2048, bias=False)
(act_fn): SiLU()
)


摘抄自transformers/src/models/modeling_llama.py
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)

down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
https://marp.app/

* SiLU:$x\cdot\sigma(x)$,兼具平滑和门控性质,常作为SwiGLU中的激活