Skip to content

[ICLR 2025] VideoGrain: This repo is the official implementation of "VideoGrain: Modulating Space-Time Attention for Multi-Grained Video Editing"

License

Notifications You must be signed in to change notification settings

knightyxp/VideoGrain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

96 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VideoGrain: Modulating Space-Time Attention for Multi-Grained Video Editing (ICLR 2025)

arXiv HuggingFace Daily Papers Top1 Project page Full Data visitors Demo Video - VideoGrain

Introduction

VideoGrain is a zero-shot method for class-level, instance-level, and part-level video editing.

  • Multi-grained Video Editing
    • class-level: Editing objects within the same class (previous SOTA limited to this level)
    • instance-level: Editing each individual instance to distinct object
    • part-level: Adding new objects or modifying existing attributes at the part-level
  • Training-Free
    • Does not require any training/fine-tuning
  • One-Prompt Multi-region Control & Deep investigations about cross/self attn
    • modulating cross-attn for multi-regions control (visualizations available)
    • modulating self-attn for feature decoupling (clustering are available)
class level instance level part level animal instances
animal instances human instances part-level modification

πŸ“€ Demo Video

videograin.mp4

πŸ“£ News

  • [2025/2/25] Our VideoGrain is posted and recommended by Gradio on LinkedIn and Twitter, and recommended by AK.
  • [2025/2/25] Our VideoGrain is submited by AK to HuggingFace-daily papers, and rank #1 paper of that day.
  • [2025/2/24] We release our paper on arxiv, we also release code and full-data on google drive.
  • [2025/1/23] Our paper is accepted to ICLR2025! Welcome to watch πŸ‘€ this repository for the latest updates.

🍻 Setup Environment

Our method is tested using cuda12.1, fp16 of accelerator and xformers on a single L40.

# Step 1: Create and activate Conda environment
conda create -n videograin python==3.10 
conda activate videograin

# Step 2: Install PyTorch, CUDA and Xformers
conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install --pre -U xformers==0.0.27
# Step 3: Install additional dependencies with pip
pip install -r requirements.txt

xformers is recommended to save memory and running time.

You may download all the base model checkpoints using the following bash command

## download sd 1.5, controlnet depth/pose v10/v11
bash download_all.sh
Click for ControlNet annotator weights (if you can not access to huggingface)

You can download all the annotator checkpoints (such as DW-Pose, depth_zoe, depth_midas, and OpenPose, cost around 4G) from baidu or google Then extract them into ./annotator/ckpts

⚑️ Prepare all the data

Full VideoGrain Data

We have provided all the video data and layout masks in VideoGrain at following link. Please download unzip the data and put them in the `./data' root directory.

gdown https://drive.google.com/file/d/1dzdvLnXWeMFR3CE2Ew0Bs06vyFSvnGXA/view?usp=drive_link
tar -zxvf videograin_data.tar.gz

Customize Your Own Data

prepare video to frames If the input video is mp4 file, using the following command to process it to frames:

python image_util/sample_video2frames.py --video_path 'your video path' --output_dir './data/video_name/video_name'

prepare layout masks We segment videos using our ReLER lab's SAM-Track. I suggest using the app.py in SAM-Track for graio mode to manually select which region in the video your want to edit. Here, we also provided an script image_util/process_webui_mask.py to process masks from SAM-Track path to VideoGrain path.

πŸ”₯πŸ”₯πŸ”₯ VideoGrain Editing

🎨 Inference

Your can reproduce the instance + part level results in our teaser by running:

bash test.sh 
#or 
CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config config/part_level/adding_new_object/run_two_man/spider_polar_sunglass.yaml

For other instance/part/class results in VideoGrain project page or teaser, we provide all the data (video frames and layout masks) and corresponding configs to reproduce, check results in πŸš€Multi-Grained Video Editing.

The result is saved at `./result` . (Click for directory structure)
result
β”œβ”€β”€ run_two_man
β”‚   β”œβ”€β”€ control                         # control conditon 
β”‚   β”œβ”€β”€ infer_samples
β”‚           β”œβ”€β”€ input                   # the input video frames
β”‚           β”œβ”€β”€ masked_video.mp4        # check whether edit regions are accuratedly covered
β”‚   β”œβ”€β”€ sample
β”‚           β”œβ”€β”€ step_0                  # result image folder
β”‚           β”œβ”€β”€ step_0.mp4              # result video
β”‚           β”œβ”€β”€ source_video.mp4        # the input video
β”‚           β”œβ”€β”€ visualization_denoise   # cross attention weight
β”‚           β”œβ”€β”€ sd_study                # cluster inversion feature

Editing guidance for YOUR Video

πŸ”›prepare your config

VideoGrain is a training-free framework. To run VideoGrain on your video, modify ./config/demo_config.yaml based on your needs:

  1. Replace your pretrained model path and controlnet path in your config. you can change the control_type to dwpose or depth_zoe or depth(midas).
  2. Prepare your video frames and layout masks (edit regions) using SAM-Track or SAM2 in dataset config.
  3. Change the prompt, and extract each local prompt in the editing prompts. the local prompt order should be same as layout masks order.
  4. Your can change flatten resolution with 1->64, 2->16, 4->8. (commonly, flatten at 64 worked best)
  5. To ensure temporal consistency, you can set use_pnp: True and inject_step:5/10. (Note: pnp>10 steps will be bad for multi-regions editing)
  6. If you want to visualize the cross attn weight, set vis_cross_attn: True
  7. If you want to cluster DDIM Inversion spatial temporal video feature, set cluster_inversion_feature: True

😍Editing your video

bash test.sh 
#or 
CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config  /path/to/the/config

πŸš€Multi-Grained Video Editing Results

🌈 Multi-Grained Definition

You can get multi-grained definition result, using the following command:

CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config /config/class_level/running_two_man/man2spider.yaml   #class-level
                                                # /config/instance_level/running_two_man/4cls_spider_polar.yaml  #instance-level
                                      #config/part_level/adding_new_object/run_two_man/spider_polar_sunglass.yaml #part-level
source video class level instance level part level

πŸ’ƒ Instance-level Video Editing

You can get instance-level video editing results, using the following command:

CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config  config/instance_level/running_two_man/running_3cls_iron_spider.yaml
running_two_man/3cls_iron_spider.yaml 2_monkeys/2cls_teddy_bear_koala.yaml
badminton/2cls_wonder_woman_spiderman.yaml soap-box/soap-box.yaml
2_cats/4cls_panda_vs_poddle.yaml 2_cars/left_firetruck_right_bus.yaml

πŸ•Ί Part-level Video Editing

You can get part-level video editing results, using the following command:

CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config config/part_level/modification/man_text_message/blue_shirt.yaml
source video blue shirt black suit source video ginger head ginger body
source video superman superman + cap source video superman superman + sunglasses

πŸ₯³ Class-level Video Editing

You can get class-level video editing results, using the following command:

CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config config/class_level/wolf/wolf.yaml
input pig husky bear tiger
input iron man Batman + snow court + iced wall input posche

Soely Edit on specific subjects, keep background unchanged

You can get soely video editing results, using the following command:

CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config config/instance_level/soely_edit/only_left.yaml
                                                #--config config/instance_level/soely_edit/only_right.yaml
                                                #--config config/instance_level/soely_edit/joint_edit.yaml
source video left→Iron Man right→Spiderman joint edit

πŸ” Visualize Cross Attention Weight

You can get visulize attention weight editing results, using the following command:

#setting vis_cross_attn: True in your config 
CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config config/instance_level/running_two_man/3cls_spider_polar_vis_weight.yaml
source video left→spiderman, right→polar bear, trees→cherry blossoms spiderman weight bear weight cherry weight

✏️ Citation

If you think this project is helpful, please feel free to leave a star⭐️⭐️⭐️ and cite our paper:

@article{yang2025videograin,
  title={VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing},
  author={Yang, Xiangpeng and Zhu, Linchao and Fan, Hehe and Yang, Yi},
  journal={arXiv preprint arXiv:2502.17258},
  year={2025}
}

πŸ“ž Contact Authors

Xiangpeng Yang @knightyxp, email: knightyxp@gmail.com/Xiangpeng.Yang@student.uts.edu.au

✨ Acknowledgements

⭐️ Star History

Star History Chart

About

[ICLR 2025] VideoGrain: This repo is the official implementation of "VideoGrain: Modulating Space-Time Attention for Multi-Grained Video Editing"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages