VideoGrain is a zero-shot method for class-level, instance-level, and part-level video editing.
- Multi-grained Video Editing
- class-level: Editing objects within the same class (previous SOTA limited to this level)
- instance-level: Editing each individual instance to distinct object
- part-level: Adding new objects or modifying existing attributes at the part-level
- Training-Free
- Does not require any training/fine-tuning
- One-Prompt Multi-region Control & Deep investigations about cross/self attn
- modulating cross-attn for multi-regions control (visualizations available)
- modulating self-attn for feature decoupling (clustering are available)
![]() |
![]() |
![]() |
|||
class level | instance level | part level | animal instances | ||
![]() |
![]() |
![]() |
|||
animal instances | human instances | part-level modification |
videograin.mp4
- [2025/2/25] Our VideoGrain is posted and recommended by Gradio on LinkedIn and Twitter, and recommended by AK.
- [2025/2/25] Our VideoGrain is submited by AK to HuggingFace-daily papers, and rank #1 paper of that day.
- [2025/2/24] We release our paper on arxiv, we also release code and full-data on google drive.
- [2025/1/23] Our paper is accepted to ICLR2025! Welcome to watch π this repository for the latest updates.
Our method is tested using cuda12.1, fp16 of accelerator and xformers on a single L40.
# Step 1: Create and activate Conda environment
conda create -n videograin python==3.10
conda activate videograin
# Step 2: Install PyTorch, CUDA and Xformers
conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install --pre -U xformers==0.0.27
# Step 3: Install additional dependencies with pip
pip install -r requirements.txt
xformers
is recommended to save memory and running time.
You may download all the base model checkpoints using the following bash command
## download sd 1.5, controlnet depth/pose v10/v11
bash download_all.sh
Click for ControlNet annotator weights (if you can not access to huggingface)
You can download all the annotator checkpoints (such as DW-Pose, depth_zoe, depth_midas, and OpenPose, cost around 4G) from baidu or google Then extract them into ./annotator/ckpts
We have provided all the video data and layout masks in VideoGrain
at following link. Please download unzip the data and put them in the `./data' root directory.
gdown https://drive.google.com/file/d/1dzdvLnXWeMFR3CE2Ew0Bs06vyFSvnGXA/view?usp=drive_link
tar -zxvf videograin_data.tar.gz
prepare video to frames If the input video is mp4 file, using the following command to process it to frames:
python image_util/sample_video2frames.py --video_path 'your video path' --output_dir './data/video_name/video_name'
prepare layout masks
We segment videos using our ReLER lab's SAM-Track. I suggest using the app.py
in SAM-Track for graio
mode to manually select which region in the video your want to edit. Here, we also provided an script image_util/process_webui_mask.py
to process masks from SAM-Track path to VideoGrain path.
Your can reproduce the instance + part level results in our teaser by running:
bash test.sh
#or
CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config config/part_level/adding_new_object/run_two_man/spider_polar_sunglass.yaml
For other instance/part/class results in VideoGrain project page or teaser, we provide all the data (video frames and layout masks) and corresponding configs to reproduce, check results in πMulti-Grained Video Editing.
The result is saved at `./result` . (Click for directory structure)
result
βββ run_two_man
β βββ control # control conditon
β βββ infer_samples
β βββ input # the input video frames
β βββ masked_video.mp4 # check whether edit regions are accuratedly covered
β βββ sample
β βββ step_0 # result image folder
β βββ step_0.mp4 # result video
β βββ source_video.mp4 # the input video
β βββ visualization_denoise # cross attention weight
β βββ sd_study # cluster inversion feature
VideoGrain is a training-free framework. To run VideoGrain on your video, modify ./config/demo_config.yaml
based on your needs:
- Replace your pretrained model path and controlnet path in your config. you can change the control_type to
dwpose
ordepth_zoe
ordepth
(midas). - Prepare your video frames and layout masks (edit regions) using SAM-Track or SAM2 in dataset config.
- Change the
prompt
, and extract eachlocal prompt
in the editing prompts. the local prompt order should be same as layout masks order. - Your can change flatten resolution with 1->64, 2->16, 4->8. (commonly, flatten at 64 worked best)
- To ensure temporal consistency, you can set
use_pnp: True
andinject_step:5/10
. (Note: pnp>10 steps will be bad for multi-regions editing) - If you want to visualize the cross attn weight, set
vis_cross_attn: True
- If you want to cluster DDIM Inversion spatial temporal video feature, set
cluster_inversion_feature: True
bash test.sh
#or
CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config /path/to/the/config
You can get multi-grained definition result, using the following command:
CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config /config/class_level/running_two_man/man2spider.yaml #class-level
# /config/instance_level/running_two_man/4cls_spider_polar.yaml #instance-level
#config/part_level/adding_new_object/run_two_man/spider_polar_sunglass.yaml #part-level
source video | class level | instance level | part level |
![]() |
![]() |
![]() |
![]() |
You can get instance-level video editing results, using the following command:
CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config config/instance_level/running_two_man/running_3cls_iron_spider.yaml
You can get part-level video editing results, using the following command:
CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config config/part_level/modification/man_text_message/blue_shirt.yaml
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
source video | blue shirt | black suit | source video | ginger head | ginger body |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
source video | superman | superman + cap | source video | superman | superman + sunglasses |
You can get class-level video editing results, using the following command:
CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config config/class_level/wolf/wolf.yaml
![]() |
![]() |
![]() |
![]() |
![]() |
input | pig | husky | bear | tiger |
![]() |
![]() |
![]() |
![]() |
![]() |
input | iron man | Batman + snow court + iced wall | input | posche |
You can get soely video editing results, using the following command:
CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config config/instance_level/soely_edit/only_left.yaml
#--config config/instance_level/soely_edit/only_right.yaml
#--config config/instance_level/soely_edit/joint_edit.yaml
![]() |
![]() |
![]() |
![]() |
source video | leftβIron Man | rightβSpiderman | joint edit |
You can get visulize attention weight editing results, using the following command:
#setting vis_cross_attn: True in your config
CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config config/instance_level/running_two_man/3cls_spider_polar_vis_weight.yaml
![]() |
![]() |
![]() |
![]() |
![]() |
source video | leftβspiderman, rightβpolar bear, treesβcherry blossoms | spiderman weight | bear weight | cherry weight |
If you think this project is helpful, please feel free to leave a starβοΈβοΈβοΈ and cite our paper:
@article{yang2025videograin,
title={VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing},
author={Yang, Xiangpeng and Zhu, Linchao and Fan, Hehe and Yang, Yi},
journal={arXiv preprint arXiv:2502.17258},
year={2025}
}
Xiangpeng Yang @knightyxp, email: knightyxp@gmail.com/Xiangpeng.Yang@student.uts.edu.au
- This code builds on diffusers, and FateZero. Thanks for open-sourcing!
- We would like to thank AK(@_akhaliq) and Gradio team for recommendation!