LRM 代码精读 3-[models]

status

type

date

slug

summary

block.py

1. Basic attention block

LayerNorm1 + Self + LayerNorm2 +

MLP :

(2* (Linear(Inner_dim → mlp_ratio倍再回归)+Dropout))

2. Cross attention block(take in cond)

同时每个attend或mlp之前有residual

LayerNorm1 + Cross + LayerNorm2 + Self + Layernorm3 +

MLP :

(2* (Linear(Inner_dim → mlp_ratio倍再回归)+Dropout))

注意cross-attention的定义包含kdim和vdim 都是condition的值

最后总结一下nn.MultiheadAttention所需参数

embed_dim, num_heads, kdim(opt), vdim(opt),

dropout, bias, batch_first()

**这里k,v 的dim其实可以不一样

atten(q,k,v,need_weights)

Return type → self_attend()[0]

3. Condition Modulation Bloack

包含condition + modulation(:调整) vectors

mod会作用在norm上

将layernorm 替换成了ModLN

根据mod生成shift&scale

这里用modulation做norm

原本implement在DIT中

Scalable Diffusion Models with Transformers (DiT)

We train latent diffusion models, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches

TransformerDecoder

process the input(with condition and modulation)

assert a, ”b” assertion error of b

logger.debug() 申明当前状态

logger.info()

create all blocks with the same block type and put them in forward pass one by one

create all block instances with(dim and cond_dim) with partial

LRM model

1. Get encoder

Camera embedding

Convert raw camera dim(为什么12+4) to camera_embed

from the DIT without pos_embedding

Pos embedding

self.pos_embed = nn.Parameter(torch.randn

(1, 3*triplane_low_res**2, transformer_dim)

* (1. / transformer_dim) ** 0.5)

1, total_pixiel_count * 3(planes), transformer_dim

triplane_low_res → H, W of each triplane image

camera_dim → transformer_dim

总结一下

image_feature = encoder(image) (dino) → encoder_feat_dim

camera_embedding = camera_embedder(camera)

D_cam_raw(16) → camera_embed_dim

Transformer

pos_embed → (cond_MHA + modulation_norm)

(N, L = triplane_dim= 3*width^2, transformer_dim)

→ (N, L, )

Plane generation

x = tokens.view(N, 3, H, W, transformer_dim)

回归triplane图片 dim

Upsampling

triplane_dim = 3维每张图片深度dim (例如64)

kernal size & stride = 2 Double image size

planes → (N, 3, D, H*, W*) H*, W* = 2H, 2W

Synthesize(Render)

Output → rendered images (N, M(camera pos count), C_img(3?), H, W)

Summary of Dimensions at Each Stage

Input Image:

Shape: [N, 3, H_img, W_img]

Image Features:

After Encoder: [N, encoder_feat_dim]

Camera Embeddings:

After Camera Embedder: [N, camera_embed_dim]

Transformer Input:

Positional Embeddings Repeated: [N, L, transformer_dim]

Transformer Output (Tokens):

Shape: [N, L, transformer_dim]

Tokens Reshaped for Planes:

Before Upsampling: [N, 3, H, W, transformer_dim]

After Reordering: [3*N, transformer_dim, H, W]

Upsampled Planes:

After Upsampler: [3*N, triplane_dim, 2*H, 2*W]

Final Planes:

Reshaped Back: [N, 3, triplane_dim, 2*H, 2*W]

Rendered Images:

Shape: [N, M, C_img, H_render, W_render]

Role of Camera and Image Dimensions

Image Dimensions:

The input images are encoded into a compact feature representation, capturing the visual content.
The feature vector is used to condition the transformer, influencing the generation of triplanar features.

Camera Dimensions:

Raw camera parameters are embedded into a feature space.
These embeddings modulate the transformer, allowing it to account for viewpoint information during plane generation.

Transformer Integration:

The transformer combines positional embeddings, image features, and camera embeddings to produce tokens that represent triplanar features.
The sequence length L is directly related to the resolution of the triplanes.

Plane Generation:

Tokens are reshaped and upsampled to create high-resolution triplanar feature planes.
These planes encapsulate 3D information and are used for rendering novel views.

‣