“Hướng dẫn huấn luyện LoRA cho Wan 2.1 bằng musubi‑tuner (nhánh main)”

0. Vì sao chọn musubi‑tuner?

Ưu điểm	Nhược điểm
Giảm VRAM mạnh nhờ fp8 + block‑swap (24 GB đủ train ảnh 720×1280) github	CLI dài, nhiều tuỳ chọn chưa kiểm chứng github
Hỗ trợ Wan 2.1 tất cả task (T2V, I2V, Fun‑Control) github	Cần tự build môi trường (chưa có GUI ổn định)
Dataset builder linh hoạt (ảnh, video, control‑video) github	Docs còn rời rạc; phải đọc README + Issue

1. Cài đặt musubi‑tuner & phụ thuộc

git clone https://github.com/kohya-ss/musubi-tuner.git
cd musubi-tuner
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt          # Torch 2.4 + CUDA 12

Nếu VRAM < 16 GB hãy cài thêm bitsandbytes và dùng --fp8_base / adamw8bit để tiết kiệm bộ nhớ.

2. Tải checkpoint Wan 2.1

Loại trọng số	Đường dẫn	Bắt buộc?
DiT (`*.safetensors`)	Hugging Face Comfy‑Org/Wan_2.1_ComfyUI_repackaged github	✔
VAE	`Wan_2.1_VAE.pth` hoặc `wan_2.1_vae.safetensors` github	✔
T5 encoder	`models_t5_umt5-xxl-enc-bf16.pth` github	✔
CLIP (I2V)	`models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth`	I2V bắt buộc

Lưu vào thư mục models/wan.

3. Chuẩn bị dữ liệu & file TOML

3.1 Tuỳ chọn tốc độ khung hình

Khuyến nghị: video 16 fps khi train Wan 2.1 github.

3.2 Mẫu video dataset đơn giản

[general]
resolution = [720, 1280]      # W, H (phải chia hết cho 16)
batch_size = 1
enable_bucket = true          # tự chia bucket khi frame không đều

[[datasets]]
video_directory   = "/data/videos"
caption_extension = ".txt"    # 1 file .txt cùng tên .mp4
cache_directory   = "/cache/wan"
target_frames     = [1, 25, 49, 65]   # mỗi clip 4×16 + 1
frame_extraction  = "head"

3.3 Dataset có control‑video (đào tạo Fun‑Control)

[[datasets]]
video_directory   = "/data/v"          # video.mp4
control_directory = "/data/edge"       # edge map tương ứng
cache_directory   = "/cache/wan_ctrl"
target_frames     = [1, 33, 65]

github

4. Tiền xử lý (tùy chọn)

4.1 Cache latent

python wan_cache_latents.py \
  --dataset_config dataset.toml --vae models/wan/wan_2.1_vae.safetensors \
  --fp8_vae        # nếu thiếu VRAM

4.2 Cache text‑encoder

python wan_cache_text_encoder_outputs.py \
  --dataset_config dataset.toml \
  --t5 models/wan/models_t5_umt5-xxl-enc-bf16.pth \
  --batch_size 16  --fp8_t5        # tuỳ VRAM

github

5. Lệnh huấn luyện LoRA

accelerate launch --num_cpu_threads_per_process 1 --mixed_precision bf16 \
  wan_train_network.py \
  --task i2v-14B \
  --dit  models/wan/wan2.1_i2v_720p_14B_bf16.safetensors \
  --dataset_config dataset.toml \
  --network_module networks.lora_wan \
  --network_dim 32 \
  --optimizer_type adamw8bit --learning_rate 2e-4 \
  --gradient_checkpointing \
  --fp8_base --sdpa \
  --timestep_sampling shift --discrete_flow_shift 3.0 \
  --max_train_epochs 10 --save_every_n_epochs 1 \
  --output_dir lora_out --output_name my_wan_lora

Tham số quan trọng github:

Flag	Gợi ý
`--network_dim`	16–32 cho style nhẹ; tăng nếu muốn giữ chi tiết
`--fp8_base`	chạy DiT ở FP8, giảm ~40 % VRAM
`--gradient_checkpointing`	giảm VRAM, chậm hơn ~10 %
`timestep_sampling` + `discrete_flow_shift`	cần test; bắt đầu `shift / 3.0` rồi tinh chỉnh

6. Theo dõi & khắc phục

Loss nên giảm về ~0 .4 sau 1 000 bước.
Dùng --sample_prompts sample.txt --sample_every_n_epochs 1 để sinh clip kiểm tra github.
OOM? Thêm --blocks_to_swap 28 hoặc hạ batch_size.
Output lẫn LoRA vào prompt lạ → giảm weight <lora:my_wan_lora:0.6> khi suy luận.

7. Xuất & dùng LoRA

Chuyển định dạng (nếu cần):

python convert_lora.py --save_safetensors \
  --model lora_out/my_wan_lora.safetensors \
  --output lora_out/my_wan_lora_converted.safetensors

Suy luận:

python wan_generate_video.py --fp8 --task i2v-14B \
  --lora lora_out/my_wan_lora_converted.safetensors;1.0 \
  ... (các flag infer như bình thường)

8. Kết luận

Với musubi‑tuner, việc tinh chỉnh LoRA cho Wan 2.1 trở nên khả thi trên GPU 24 GB (thậm chí 12 GB nếu block‑swap). Quan trọng nhất là:

Chuẩn hoá dataset 16 fps & TOML đúng cú pháp.
Kiểm soát VRAM bằng fp8, checkpointing, blocks‑swap.
Thử nghiệm giá trị flow shift và timestep_sampling để tìm thiết lập tối ưu cho phong cách của bạn.

Chúc bạn tạo được những đoạn video độc đáo mang dấu ấn riêng!