https://github.com/Picsart-AI-Research/Text2Video-Zero

Skip to content Toggle navigation
 
Sign up

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
      + Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
      + For
      + Enterprise
      + Teams
      + Startups
      + Education
      + By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
      + Case Studies
      + Customer Stories
      + Resources
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
      + Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

[                    ] 

  *  
    #
    In this repository All GitHub |
    Jump to |

  * No suggested jump to results

  *  
    #
    In this repository All GitHub |
    Jump to |
  *  
    #
    In this organization All GitHub |
    Jump to |
  *  
    #
    In this repository All GitHub |
    Jump to |

Sign in
Sign up
{{ message }}
Picsart-AI-Research / Text2Video-Zero Public

  * Notifications
  * Fork 64
  * Star 1.6k

Text-to-Image Diffusion Models are Zero-Shot Video Generators

License

View license
1.6k stars 64 forks
Star
Notifications

  * Code
  * Issues 6
  * Pull requests 0
  * Actions
  * Projects 0
  * Security
  * Insights

More

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Security
  * Insights

Picsart-AI-Research/Text2Video-Zero

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
main
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags

Name already in use

A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
1 branch 0 tags
Code

  * Local
  * Codespaces

  *  
    Clone
    HTTPS GitHub CLI
    [https://github.com/P]

    Use Git or checkout with SVN using the web URL.

    [gh repo clone Picsar]

    Work fast with our official CLI. Learn more.

  * Open with GitHub Desktop
  * Download ZIP

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Roberto more information about installation and app added
...
18879e5 Mar 29, 2023
more information about installation and app added
18879e5

Git stats

  * 28 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
__assets__
Release of the entire code
March 28, 2023 17:17
annotator
Release of the entire code
March 28, 2023 17:17
text_to_video
dilation loaded
March 29, 2023 02:21
.gitignore
Release of the entire code
March 28, 2023 17:17
LICENSE
license added
March 28, 2023 17:29
README.md
more information about installation and app added
March 29, 2023 13:19
app.py
more information about installation and app added
March 29, 2023 13:19
app_canny.py
chunk size can now be set also in the gradio app
March 29, 2023 03:28
app_canny_db.py
chunk size can now be set also in the gradio app
March 29, 2023 03:28
app_pix2pix_video.py
chunk size can now be set also in the gradio app
March 29, 2023 03:28
app_pose.py
chunk size can now be set also in the gradio app
March 29, 2023 03:28
app_text_to_video.py
chunk size can now be set also in the gradio app
March 29, 2023 03:28
config.py
Release of the entire code
March 28, 2023 17:17
environment.yaml
Release of the entire code
March 28, 2023 17:17
gradio_utils.py
new contribute section, cleanup of examples
March 29, 2023 09:37
model.py
chunk size can now be set also in the gradio app
March 29, 2023 03:28
requirements.txt
more information about installation and app added
March 29, 2023 13:19
share.py
Release of the entire code
March 28, 2023 17:17
text_to_video_generator_canny.py
Release of the entire code
March 28, 2023 17:17
text_to_video_generator_pose.py
Release of the entire code
March 28, 2023 17:17
utils.py
Release of the entire code
March 28, 2023 17:17
View code
[                    ]
Text2Video-Zero News Contribute Setup Text-To-Video with Edge
Guidance and Dreambooth Inference API Text-To-Video Hyperparameters
(Optional) Text-To-Video with Pose Control Text-To-Video with Edge
Control Hyperparameters Text-To-Video with Edge Guidance and
Dreambooth specialization Video Instruct-Pix2Pix Low Memory Inference
Ablation Study Inference using Gradio Results Text-To-Video
Text-To-Video with Pose Guidance Text-To-Video with Edge Guidance
Text-To-Video with Edge Guidance and Dreambooth specialization Video
Instruct Pix2Pix License BibTeX

README.md

 Text2Video-Zero

This repository is the official implementation of Text2Video-Zero.

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators
Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto
Henschel, Zhangyang Wang, Shant Navasardyan, Humphrey Shi

Paper | Video | Hugging Face Spaces

                           [teaser_final]
 Our method Text2Video-Zero enables zero-shot video generation using
  (i) a textual prompt (see rows 1, 2), (ii) a prompt combined with
   guidance from poses or edges (see lower right), and (iii) Video
 Instruct-Pix2Pix, i.e., instruction-guided video editing (see lower
   left). Results are temporally consistent and follow closely the
                    guidance and textual prompts.

 News

  * [03/23/2023] Paper Text2Video-Zero released!
  * [03/25/2023] The first version of our huggingface demo
    (containing zero-shot text-to-video generation and Video Instruct
    Pix2Pix) released!
  * [03/27/2023] The full version of our huggingface demo released!
    Now also included: text and pose conditional video generation,
    text and canny-edge conditional video generation, and text,
    canny-edge and dreambooth conditional video generation.
  * [03/28/2023] Code for all our generation methods released! We
    added a new low-memory setup. Minimum required GPU VRAM is
    currently 12 GB. It will be further reduced in the upcoming
    releases.
  * [03/29/2023] Improved Huggingface demo! (i) For text-to-video
    generation, any base model for stable diffusion hosted on
    huggingface can now be loaded (including dreambooth models!).
    (ii) The generated videos can have arbitrary length. (iii) We
    improved the quality of Video Instruct-Pix2Pix. (iv) We added two
    longer examples for Video Instruct-Pix2Pix.

 Contribute

We are on a journey to democratize AI and empower the creativity of
everyone, and we believe Text2Video-Zero is a great research
direction to unleash the zero-shot video generation and editing
capacity of the amazing text-to-image models!

To achieve this goal, all contributions are welcome. Please check out
these external implementations and extensions of Text2Video-Zero. We
thank the authors for their efforts and contributions:

  * https://github.com/JiauZhang/Text2Video-Zero
  * https://github.com/camenduru/text2video-zero-colab
  * https://github.com/SHI-Labs/Text2Video-Zero-sd-webui

 Setup

 1. Clone this repository and enter:

git clone https://github.com/Picsart-AI-Research/Text2Video-Zero.git
cd Text2Video-Zero/

 2. Install requirements using Python 3.9

virtualenv --system-site-packages -p python3.9 venv
source venv/bin/activate
pip install -r requirements.txt

 Text-To-Video with Edge Guidance and Dreambooth

Integrate a SD1.4 Dreambooth model into ControlNet using this
procedure. Load the model into models/control_db/. Dreambooth models
can be obtained, for instance, from CIVITAI.

We provide already prepared model files derived from CIVITAI for
anime (keyword 1girl), arcane style (keyword arcane style) avatar
(keyword avatar style) and gta-5 style (keyword gtav style).

 Inference API

To run inferences create an instance of Model class

import torch
from model import Model

model = Model(device = "cuda", dtype = torch.float16)

---------------------------------------------------------------------

 Text-To-Video

To directly call our text-to-video generator, run this python command
which stores the result in tmp/text2video/
A_horse_galloping_on_a_street.mp4 :

prompt = "A horse galloping on a street"
params = {"t0": 44, "t1": 47 , "motion_field_strength_x" : 12, "motion_field_strength_y" : 12, "video_length": 8}

out_path, fps = f"./text2video_{prompt.replace(' ','_')}.mp4", 4
model.process_text2video(prompt, fps = fps, path = out_path, **params)

 Hyperparameters (Optional)

You can define the following hyperparameters:

  * Motion field strength: motion_field_strength_x = $\delta_x$ and
    motion_field_strength_y = $\delta_x$ (see our paper, Sect.
    3.3.1). Default: motion_field_strength_x=motion_field_strength_y=
    12.
  * $T$ and $T'$ (see our paper, Sect. 3.3.1). Define values t0 and
    t1 in the range {0,...,50}. Default: t0=44, t1=47 (DDIM steps).
    Corresponds to timesteps 881 and 941, respectively.
  * Video length: Define the number of frames video_length to be
    generated. Default: video_length=8.

---------------------------------------------------------------------

 Text-To-Video with Pose Control

To directly call our text-to-video generator with pose control, run
this python command:

from pathlib import Path

prompt = 'an astronaut dancing in outer space'
motion_path = '__assets__/poses_skeleton_gifs/dance1_corr.mp4'
out_path = f"./text2video_pose_guidance_{prompt.replace(' ','_')}.gif"
model.process_controlnet_pose(motion_path, prompt=prompt, save_path=out_path)

---------------------------------------------------------------------

 Text-To-Video with Edge Control

To directly call our text-to-video generator with edge control, run
this python command:

prompt = 'oil painting of a deer, a high-quality, detailed, and professional photo'
video_path = '__assets__/canny_videos_mp4/deer.mp4'
out_path = f'./text2video_edge_guidance_{prompt}.mp4'
model.process_controlnet_canny(video_path, prompt=prompt, save_path=out_path)

 Hyperparameters

You can define the following hyperparameters for Canny edge
detection:

  * low threshold. Define value low_threshold in the range $(0, 255)
    $. Default: low_threshold=100.
  * high threshold. Define value high_threshold in the range $(0,
    255)$. Default: high_threshold=200. Make sure that high_threshold
    > low_threshold.

You can give hyperparameters as arguments to
model.process_controlnet_canny

---------------------------------------------------------------------

 Text-To-Video with Edge Guidance and Dreambooth specialization

Load a dreambooth model then proceed as described in Text-To-Video
with Edge Guidance

prompt = 'your prompt'
video_path = 'path/to/your/video'
dreambooth_model_path = 'path/to/your/dreambooth/model'
out_path = f'./text2video_edge_db_{prompt}.gif'
model.process_controlnet_canny_db(dreambooth_model_path, video_path, prompt=prompt, save_path=out_path)

The value video_path can be the path to a mp4 file. To use one of the
example videos provided, set video_path="woman1", video_path=
"woman2", video_path="woman3", or video_path="man1".

The value dreambooth_model_path can either be a link to a diffuser
model file, or the name of one of the dreambooth models provided. To
this end, set dreambooth_model_path = "Anime DB",
dreambooth_model_path = "Avatar DB", dreambooth_model_path = "GTA-5
DB", or dreambooth_model_path = "Arcane DB". The corresponding
keywords are: 1girl (for Anime DB), arcane style (for Arcane DB)
avatar style (for Avatar DB) and gta-5 style (for GTA-5 DB).

If the model file is not in diffuser format, it must be converted.

---------------------------------------------------------------------

 Video Instruct-Pix2Pix

To perform pix2pix video editing, run this python command:

prompt = 'make it Van Gogh Starry Night'
video_path = '__assets__/pix2pix video/camel.mp4'
out_path = f'./video_instruct_pix2pix_{prompt}.mp4'
model.process_pix2pix(video_path, prompt=prompt, save_path=out_path)

---------------------------------------------------------------------

 Low Memory Inference

Each of the above introduced interface can be run in a low memory
setup. In the minimal setup, a GPU with 12 GB VRAM is sufficient.

To reduce the memory usage, add chunk_size=k as additional parameter
when calling one of the above defined inference APIs. The integer
value k must be in the range {2,...,video_length}. It defines the
number of frames that are processed at once (without any loss in
quality). The lower the value the less memory is needed.

When using the gradio app, set chunk_size in the Advanced options.

We plan to release soon a new version that further reduces the memory
usage.

---------------------------------------------------------------------

 Ablation Study

To replicate the ablation study, add additional parameters when
calling the above defined inference APIs.

  * To deactivate cross-frame attention: Add use_cf_attn=False to the
    parameter list.
  * To deactivate enriching latent codes with motion dynamics: Add
    use_motion_field=False to the parameter list.

Note: Adding smooth_bg=True activates background smoothing. However,
our code does not include the salient object detector necessary to
run that code.

---------------------------------------------------------------------

 Inference using Gradio

From the project root folder, run this shell command:

python app.py

Then access the app locally with a browser.

To access the app remotely, run this shell command:

python app.py --public_access

For security information about public access we refer to the
documentation of gradio [https://gradio.app/sharing-your-app/#
security-and-file-access].

 Results

 Text-To-Video

[cat_runnin]     [playing]        [running]        [skii]
   "A cat is       "A panda is       "A man is     "An astronaut is
 running on the   playing guitar   running in the   skiing down the
     grass"      on times square"      snow"             hill"
[panda_surf]     [bear_danci]     [bicycle]        [horse_gall]
"A panda surfing "A bear dancing  "A man is riding     "A horse
on a wakeboard"  on times square" a bicycle in the  galloping on a
                                     sunshine"          street"
[tiger_walk]     [panda_surf]     [horse_gall]     [cat_walkin]
"A tiger walking "A panda surfing     "A horse        "A cute cat
 alone down the  on a wakeboard"   galloping on a    running in a
    street"                           street"      beatiful meadow"
[horse_gall]     [panda_walk]     [dog_walkin]     [astronaut]
    "A horse     "A panda walking    "A dog is     "An astronaut is
 galloping on a   alone down the  walking down the waving his hands
    street"          street"          street"        on the moon"

 Text-To-Video with Pose Guidance

[img_bot_le]     [img_bot_ri]      [img_top_le]     [img_top_ri]
[pose_bot_l]     [pose_bot_r]      [pose_top_l]     [pose_top_r]
"A bear dancing  "An alien dancing "A panda dancing   "An astronaut
on the concrete"  under a flying    in Antarctica"   dancing in the
                      saucer"                         outer space"

 Text-To-Video with Edge Guidance

[butterfly]      [head][head_edge] [jelly]          [mask][mask_edge]
[butterfly_]                       [jelly_edge]
     "White      "Beautiful girl"   "A jellyfish"    "beautiful girl
   butterfly"                                       halloween style"
[fox][fix_edge]  [head_2]          [santa]          [dear][dear_edge]
                 [head_2_edg]      [santa_edge]
  "Wild fox is   "Oil painting of
    walking"     a beautiful girl  "A santa claus"      "A deer"
                     close-up"

 Text-To-Video with Edge Guidance and Dreambooth specialization

[anime_styl]     [arcane_sty]      [gta-5_man_]     [img_bot_ri]
[anime_edge]     [arcane_edg]      [gta-5_man_]     [edge_bot_r]
 "anime style"    "arcane style"     "gta-5 man"     "avatar style"

 Video Instruct Pix2Pix

[up_left][bot_left]    [up_mid][bot_mid]       [up_right][bot_right]
  "Replace man with       "Make it Van Gogh       "Make it Picasso
      chimpanze"         Starry Night style"           style"
[up_left][bot_left]    [up_mid][bot_mid]       [up_right][bot_right]
"Make it Expressionism     "Make it night"        "Make it autumn"
        style"

 License

Our code is published under the CreativeML Open RAIL-M license. The
license provided in this repository applies to all additions and
contributions we make upon the original stable diffusion code. The
original stable diffusion code is under the CreativeML Open RAIL-M
license, which can found here.

 BibTeX

If you use our work in your research, please cite our publication:

@article{text2video-zero,
    title={Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators},
    author={Khachatryan, Levon and Movsisyan, Andranik and Tadevosyan, Vahram and Henschel, Roberto and Wang, Zhangyang and Navasardyan, Shant and Shi, Humphrey},
    journal={arXiv preprint arXiv:2303.13439},
    year={2023}
}

About

Text-to-Image Diffusion Models are Zero-Shot Video Generators

Resources

Readme

License

View license

Stars

1.6k stars

Watchers

46 watching

Forks

64 forks

Releases

No releases published

Packages 0

No packages published

Contributors 4

  * @rob-hen rob-hen
  * @honghuis honghuis Humphrey Shi
  * @mickelliu mickelliu mickelliu
  * @levon-khachatryan levon-khachatryan

Languages

  * Python 100.0%

Footer

 (c) 2023 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session.