type
status
date
slug
summary
tags
category
icon
password
Models 是最重要的部分 计划仔细讲一下
计划分别花一天看rendering, models 还会顺便看看dino的架构
notion image
 

block.py

 

1. Basic attention block

LayerNorm1 + Self + LayerNorm2 +
MLP :
(2* (Linear(Inner_dim → mlp_ratio倍 再回归)+Dropout))
 

2. Cross attention block(take in cond)

同时每个attend或mlp之前有residual
LayerNorm1 + Cross + LayerNorm2 + Self + Layernorm3 +
MLP :
(2* (Linear(Inner_dim → mlp_ratio倍 再回归)+Dropout))
 
notion image
 
注意cross-attention的定义 包含kdim和vdim 都是condition的值
最后总结一下nn.MultiheadAttention所需参数
embed_dim, num_heads, kdim(opt), vdim(opt),
dropout, bias, batch_first()
**这里k,v 的dim其实可以不一样
notion image
 
atten(q,k,v,need_weights)
notion image
 
Return type → self_attend()[0]
notion image

3. Condition Modulation Bloack

包含condition + modulation(:调整) vectors

 
mod会作用在norm上
将layernorm 替换成了ModLN
notion image
根据mod生成shift&scale
这里用modulation做norm
 
原本implement在DIT中
Scalable Diffusion Models with Transformers (DiT)
We train latent diffusion models, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches
 
notion image
 

TransformerDecoder

 
process the input(with condition and modulation)
assert a, ”b” assertion error of b
 
logger.debug() 申明当前状态
logger.info()
 
create all blocks with the same block type and put them in forward pass one by one
 
create all block instances with(dim and cond_dim) with partial
 
 

LRM model

 

1. Get encoder

 

Camera embedding

Convert raw camera dim(为什么12+4) to camera_embed
 
from the DIT without pos_embedding
 
 

Pos embedding

self.pos_embed = nn.Parameter(torch.randn
(1, 3*triplane_low_res**2, transformer_dim)
* (1. / transformer_dim) ** 0.5)
 
1, total_pixiel_count * 3(planes), transformer_dim
 
triplane_low_res → H, W of each triplane image
 
 
camera_dim → transformer_dim
 
总结一下
 
  1. image_feature = encoder(image) (dino) → encoder_feat_dim
 
  1. camera_embedding = camera_embedder(camera)
D_cam_raw(16) → camera_embed_dim
 
  1. Transformer
 
pos_embed → (cond_MHA + modulation_norm)
(N, L = triplane_dim= 3*width^2, transformer_dim)
 
→ (N, L, )
 
 
  1. Plane generation
x = tokens.view(N, 3, H, W, transformer_dim)
回归triplane图片 dim
 
 
Upsampling
triplane_dim = 3维 每张图片 深度dim (例如64)
kernal size & stride = 2 Double image size
 
planes → (N, 3, D, H*, W*) H*, W* = 2H, 2W
 
  1. Synthesize(Render)
 
 
Output → rendered images (N, M(camera pos count), C_img(3?), H, W)
 
 

Summary of Dimensions at Each Stage

  1. Input Image:
      • Shape: [N, 3, H_img, W_img]
  1. Image Features:
      • After Encoder: [N, encoder_feat_dim]
  1. Camera Embeddings:
      • After Camera Embedder: [N, camera_embed_dim]
  1. Transformer Input:
      • Positional Embeddings Repeated: [N, L, transformer_dim]
  1. Transformer Output (Tokens):
      • Shape: [N, L, transformer_dim]
  1. Tokens Reshaped for Planes:
      • Before Upsampling: [N, 3, H, W, transformer_dim]
      • After Reordering: [3*N, transformer_dim, H, W]
  1. Upsampled Planes:
      • After Upsampler: [3*N, triplane_dim, 2*H, 2*W]
  1. Final Planes:
      • Reshaped Back: [N, 3, triplane_dim, 2*H, 2*W]
  1. Rendered Images:
      • Shape: [N, M, C_img, H_render, W_render]
 
 

Role of Camera and Image Dimensions

  • Image Dimensions:
    • The input images are encoded into a compact feature representation, capturing the visual content.
    • The feature vector is used to condition the transformer, influencing the generation of triplanar features.
  • Camera Dimensions:
    • Raw camera parameters are embedded into a feature space.
    • These embeddings modulate the transformer, allowing it to account for viewpoint information during plane generation.
  • Transformer Integration:
    • The transformer combines positional embeddings, image features, and camera embeddings to produce tokens that represent triplanar features.
    • The sequence length L is directly related to the resolution of the triplanes.
  • Plane Generation:
    • Tokens are reshaped and upsampled to create high-resolution triplanar feature planes.
    • These planes encapsulate 3D information and are used for rendering novel views.
 
 
 
 
手写LORA LRM 代码精读 2-[Loss]
Loading...
ran2323
ran2323
我们再来一次, 这一次, 好好来!
Latest posts
Leetcode记录「2」
2024-12-27
Flutter 基础 记录
2024-12-25
Flutter tutorial 记录
2024-12-25
Privicy policy for GitHub To Text (Chrome Extension)
2024-12-22
一些 Kubernetes 笔记
2024-12-21
一些 docker 笔记
2024-12-20
Announcement
 
 
 
 
暂时没有新的内容