AI生成Minecraft(open-oasis) 代码浅读 - 2(DIT)

status

type

date

slug

summary

repeat 将shift batch repeat x[0](train batch) 次, keeping dims after unmodified

control how much of one tensor (x) should pass through based on another tensor (g)

selectively turn “on” and “off” input tensor

gate 做 shift, scale 一样处理

return g * x

hidden dim做最后一次mod 之后proj

最终输出dim为 patch_size * patch_size * out_channels

没什么好说的先feature加spatial emb对图片自身全局总结一下再过mlp

再对加完temporal emb的自己做attend **注意只有q,k加了emb

**还要注意每次attend自己之前都做了modulate

c存储的信息从哪里来？为什么信息量足够支持生成6 * inner_feature_dim 大小

c其实就是 timestep信息因此会反复在每一个block中对x做condition

此外 c可能包含 external cond

还有这里面每层attend或mlp都有gate 每层都对新加入数据有量上控制同样从c产生

两组 shift, scale, gate

gate用来控制 residual shift, scale 近似 activation

在每一次 attend, mlp 之前

在final_layer时也是在 final mlp 之前确保每一层layer后都有

(adaptive activation function) 体现在生成mod的linear中 weight会被初始化成0

自适应学习mod

在squeeze-and-excitation中也会用到

x 先 patch_embed 到 emb_dim (b, t, h, w, d)

t 通过 t_embed 到condition (b, t) → (N, D) → (b, t, D)

c += external_cond

x, c 通过 SpatioTemporalDiTBlock

x_本身没有加 spatial_emb(针对patch排列位置)

只在spatial_axial_attend(s_attend)中对q位置进行标识?

((b, t) h, w, d) → ((b, t) h, w, p_s, p_s, c) → ((b, t) c, H, W)