type
status
date
slug
summary
tags
category
icon
password
Models 是最重要的部分 计划仔细讲一下
计划分别花一天看rendering, models 还会顺便看看dino的架构
block.py
1. Basic attention block
LayerNorm1 + Self + LayerNorm2 +
MLP :
(2* (Linear(Inner_dim → mlp_ratio倍 再回归)+Dropout))
2. Cross attention block(take in cond)
同时每个attend或mlp之前有residual
LayerNorm1 + Cross + LayerNorm2 + Self + Layernorm3 +
MLP :
(2* (Linear(Inner_dim → mlp_ratio倍 再回归)+Dropout))
注意cross-attention的定义 包含kdim和vdim 都是condition的值
最后总结一下nn.MultiheadAttention所需参数
embed_dim, num_heads, kdim(opt), vdim(opt),
dropout, bias, batch_first()
**这里k,v 的dim其实可以不一样
atten(q,k,v,need_weights)
Return type → self_attend()[0]
3. Condition Modulation Bloack
包含condition + modulation(:调整) vectors
mod会作用在norm上
将layernorm 替换成了ModLN
根据mod生成shift&scale
这里用modulation做norm
原本implement在DIT中
Scalable Diffusion Models with Transformers (DiT)
We train latent diffusion models, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches
TransformerDecoder
process the input(with condition and modulation)
assert a, ”b” assertion error of b
logger.debug() 申明当前状态
logger.info()
create all blocks with the same block type and put them in forward pass one by one
create all block instances with(dim and cond_dim) with partial
LRM model
1. Get encoder
Camera embedding
Convert raw camera dim(为什么12+4) to camera_embed
from the DIT without pos_embedding
Pos embedding
self.pos_embed = nn.Parameter(torch.randn
(1, 3*triplane_low_res**2, transformer_dim)
* (1. / transformer_dim) ** 0.5)
1, total_pixiel_count * 3(planes), transformer_dim
triplane_low_res → H, W of each triplane image
camera_dim → transformer_dim
总结一下
- image_feature = encoder(image) (dino) → encoder_feat_dim
- camera_embedding = camera_embedder(camera)
D_cam_raw(16) → camera_embed_dim
- Transformer
pos_embed → (cond_MHA + modulation_norm)
(N, L = triplane_dim= 3*width^2, transformer_dim)
→ (N, L, )
- Plane generation
x = tokens.view(N, 3, H, W, transformer_dim)
回归triplane图片 dim
Upsampling
triplane_dim = 3维 每张图片 深度dim (例如64)
kernal size & stride = 2 Double image size
planes → (N, 3, D, H*, W*) H*, W* = 2H, 2W
- Synthesize(Render)
Output → rendered images (N, M(camera pos count), C_img(3?), H, W)
Summary of Dimensions at Each Stage
- Input Image:
- Shape:
[N, 3, H_img, W_img]
- Image Features:
- After Encoder:
[N, encoder_feat_dim]
- Camera Embeddings:
- After Camera Embedder:
[N, camera_embed_dim]
- Transformer Input:
- Positional Embeddings Repeated:
[N, L, transformer_dim]
- Transformer Output (Tokens):
- Shape:
[N, L, transformer_dim]
- Tokens Reshaped for Planes:
- Before Upsampling:
[N, 3, H, W, transformer_dim]
- After Reordering:
[3*N, transformer_dim, H, W]
- Upsampled Planes:
- After Upsampler:
[3*N, triplane_dim, 2*H, 2*W]
- Final Planes:
- Reshaped Back:
[N, 3, triplane_dim, 2*H, 2*W]
- Rendered Images:
- Shape:
[N, M, C_img, H_render, W_render]
Role of Camera and Image Dimensions
- Image Dimensions:
- The input images are encoded into a compact feature representation, capturing the visual content.
- The feature vector is used to condition the transformer, influencing the generation of triplanar features.
- Camera Dimensions:
- Raw camera parameters are embedded into a feature space.
- These embeddings modulate the transformer, allowing it to account for viewpoint information during plane generation.
- Transformer Integration:
- The transformer combines positional embeddings, image features, and camera embeddings to produce tokens that represent triplanar features.
- The sequence length
L
is directly related to the resolution of the triplanes.
- Plane Generation:
- Tokens are reshaped and upsampled to create high-resolution triplanar feature planes.
- These planes encapsulate 3D information and are used for rendering novel views.
‣
- Author:ran2323
- URL:https://www.blueif.me//article/13f71a79-6e22-802c-b347-d72d78b5463f
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!