https://github.com/Picsart-AI-Research/Text2Video-Zero Skip to content Toggle navigation Sign up * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code + Explore + All features + Documentation + GitHub Skills + Blog * Solutions + For + Enterprise + Teams + Startups + Education + By Solution + CI/CD & Automation + DevOps + DevSecOps + Case Studies + Customer Stories + Resources * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles + Repositories + Topics + Trending + Collections * Pricing [ ] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this organization All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} Picsart-AI-Research / Text2Video-Zero Public * Notifications * Fork 64 * Star 1.6k Text-to-Image Diffusion Models are Zero-Shot Video Generators License View license 1.6k stars 64 forks Star Notifications * Code * Issues 6 * Pull requests 0 * Actions * Projects 0 * Security * Insights More * Code * Issues * Pull requests * Actions * Projects * Security * Insights Picsart-AI-Research/Text2Video-Zero This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags Name already in use A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch? Cancel Create 1 branch 0 tags Code * Local * Codespaces * Clone HTTPS GitHub CLI [https://github.com/P] Use Git or checkout with SVN using the web URL. [gh repo clone Picsar] Work fast with our official CLI. Learn more. * Open with GitHub Desktop * Download ZIP Sign In Required Please sign in to use Codespaces. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching Xcode If nothing happens, download Xcode and try again. Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit Roberto more information about installation and app added ... 18879e5 Mar 29, 2023 more information about installation and app added 18879e5 Git stats * 28 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time __assets__ Release of the entire code March 28, 2023 17:17 annotator Release of the entire code March 28, 2023 17:17 text_to_video dilation loaded March 29, 2023 02:21 .gitignore Release of the entire code March 28, 2023 17:17 LICENSE license added March 28, 2023 17:29 README.md more information about installation and app added March 29, 2023 13:19 app.py more information about installation and app added March 29, 2023 13:19 app_canny.py chunk size can now be set also in the gradio app March 29, 2023 03:28 app_canny_db.py chunk size can now be set also in the gradio app March 29, 2023 03:28 app_pix2pix_video.py chunk size can now be set also in the gradio app March 29, 2023 03:28 app_pose.py chunk size can now be set also in the gradio app March 29, 2023 03:28 app_text_to_video.py chunk size can now be set also in the gradio app March 29, 2023 03:28 config.py Release of the entire code March 28, 2023 17:17 environment.yaml Release of the entire code March 28, 2023 17:17 gradio_utils.py new contribute section, cleanup of examples March 29, 2023 09:37 model.py chunk size can now be set also in the gradio app March 29, 2023 03:28 requirements.txt more information about installation and app added March 29, 2023 13:19 share.py Release of the entire code March 28, 2023 17:17 text_to_video_generator_canny.py Release of the entire code March 28, 2023 17:17 text_to_video_generator_pose.py Release of the entire code March 28, 2023 17:17 utils.py Release of the entire code March 28, 2023 17:17 View code [ ] Text2Video-Zero News Contribute Setup Text-To-Video with Edge Guidance and Dreambooth Inference API Text-To-Video Hyperparameters (Optional) Text-To-Video with Pose Control Text-To-Video with Edge Control Hyperparameters Text-To-Video with Edge Guidance and Dreambooth specialization Video Instruct-Pix2Pix Low Memory Inference Ablation Study Inference using Gradio Results Text-To-Video Text-To-Video with Pose Guidance Text-To-Video with Edge Guidance Text-To-Video with Edge Guidance and Dreambooth specialization Video Instruct Pix2Pix License BibTeX README.md Text2Video-Zero This repository is the official implementation of Text2Video-Zero. Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, Humphrey Shi Paper | Video | Hugging Face Spaces [teaser_final] Our method Text2Video-Zero enables zero-shot video generation using (i) a textual prompt (see rows 1, 2), (ii) a prompt combined with guidance from poses or edges (see lower right), and (iii) Video Instruct-Pix2Pix, i.e., instruction-guided video editing (see lower left). Results are temporally consistent and follow closely the guidance and textual prompts. News * [03/23/2023] Paper Text2Video-Zero released! * [03/25/2023] The first version of our huggingface demo (containing zero-shot text-to-video generation and Video Instruct Pix2Pix) released! * [03/27/2023] The full version of our huggingface demo released! Now also included: text and pose conditional video generation, text and canny-edge conditional video generation, and text, canny-edge and dreambooth conditional video generation. * [03/28/2023] Code for all our generation methods released! We added a new low-memory setup. Minimum required GPU VRAM is currently 12 GB. It will be further reduced in the upcoming releases. * [03/29/2023] Improved Huggingface demo! (i) For text-to-video generation, any base model for stable diffusion hosted on huggingface can now be loaded (including dreambooth models!). (ii) The generated videos can have arbitrary length. (iii) We improved the quality of Video Instruct-Pix2Pix. (iv) We added two longer examples for Video Instruct-Pix2Pix. Contribute We are on a journey to democratize AI and empower the creativity of everyone, and we believe Text2Video-Zero is a great research direction to unleash the zero-shot video generation and editing capacity of the amazing text-to-image models! To achieve this goal, all contributions are welcome. Please check out these external implementations and extensions of Text2Video-Zero. We thank the authors for their efforts and contributions: * https://github.com/JiauZhang/Text2Video-Zero * https://github.com/camenduru/text2video-zero-colab * https://github.com/SHI-Labs/Text2Video-Zero-sd-webui Setup 1. Clone this repository and enter: git clone https://github.com/Picsart-AI-Research/Text2Video-Zero.git cd Text2Video-Zero/ 2. Install requirements using Python 3.9 virtualenv --system-site-packages -p python3.9 venv source venv/bin/activate pip install -r requirements.txt Text-To-Video with Edge Guidance and Dreambooth Integrate a SD1.4 Dreambooth model into ControlNet using this procedure. Load the model into models/control_db/. Dreambooth models can be obtained, for instance, from CIVITAI. We provide already prepared model files derived from CIVITAI for anime (keyword 1girl), arcane style (keyword arcane style) avatar (keyword avatar style) and gta-5 style (keyword gtav style). Inference API To run inferences create an instance of Model class import torch from model import Model model = Model(device = "cuda", dtype = torch.float16) --------------------------------------------------------------------- Text-To-Video To directly call our text-to-video generator, run this python command which stores the result in tmp/text2video/ A_horse_galloping_on_a_street.mp4 : prompt = "A horse galloping on a street" params = {"t0": 44, "t1": 47 , "motion_field_strength_x" : 12, "motion_field_strength_y" : 12, "video_length": 8} out_path, fps = f"./text2video_{prompt.replace(' ','_')}.mp4", 4 model.process_text2video(prompt, fps = fps, path = out_path, **params) Hyperparameters (Optional) You can define the following hyperparameters: * Motion field strength: motion_field_strength_x = $\delta_x$ and motion_field_strength_y = $\delta_x$ (see our paper, Sect. 3.3.1). Default: motion_field_strength_x=motion_field_strength_y= 12. * $T$ and $T'$ (see our paper, Sect. 3.3.1). Define values t0 and t1 in the range {0,...,50}. Default: t0=44, t1=47 (DDIM steps). Corresponds to timesteps 881 and 941, respectively. * Video length: Define the number of frames video_length to be generated. Default: video_length=8. --------------------------------------------------------------------- Text-To-Video with Pose Control To directly call our text-to-video generator with pose control, run this python command: from pathlib import Path prompt = 'an astronaut dancing in outer space' motion_path = '__assets__/poses_skeleton_gifs/dance1_corr.mp4' out_path = f"./text2video_pose_guidance_{prompt.replace(' ','_')}.gif" model.process_controlnet_pose(motion_path, prompt=prompt, save_path=out_path) --------------------------------------------------------------------- Text-To-Video with Edge Control To directly call our text-to-video generator with edge control, run this python command: prompt = 'oil painting of a deer, a high-quality, detailed, and professional photo' video_path = '__assets__/canny_videos_mp4/deer.mp4' out_path = f'./text2video_edge_guidance_{prompt}.mp4' model.process_controlnet_canny(video_path, prompt=prompt, save_path=out_path) Hyperparameters You can define the following hyperparameters for Canny edge detection: * low threshold. Define value low_threshold in the range $(0, 255) $. Default: low_threshold=100. * high threshold. Define value high_threshold in the range $(0, 255)$. Default: high_threshold=200. Make sure that high_threshold > low_threshold. You can give hyperparameters as arguments to model.process_controlnet_canny --------------------------------------------------------------------- Text-To-Video with Edge Guidance and Dreambooth specialization Load a dreambooth model then proceed as described in Text-To-Video with Edge Guidance prompt = 'your prompt' video_path = 'path/to/your/video' dreambooth_model_path = 'path/to/your/dreambooth/model' out_path = f'./text2video_edge_db_{prompt}.gif' model.process_controlnet_canny_db(dreambooth_model_path, video_path, prompt=prompt, save_path=out_path) The value video_path can be the path to a mp4 file. To use one of the example videos provided, set video_path="woman1", video_path= "woman2", video_path="woman3", or video_path="man1". The value dreambooth_model_path can either be a link to a diffuser model file, or the name of one of the dreambooth models provided. To this end, set dreambooth_model_path = "Anime DB", dreambooth_model_path = "Avatar DB", dreambooth_model_path = "GTA-5 DB", or dreambooth_model_path = "Arcane DB". The corresponding keywords are: 1girl (for Anime DB), arcane style (for Arcane DB) avatar style (for Avatar DB) and gta-5 style (for GTA-5 DB). If the model file is not in diffuser format, it must be converted. --------------------------------------------------------------------- Video Instruct-Pix2Pix To perform pix2pix video editing, run this python command: prompt = 'make it Van Gogh Starry Night' video_path = '__assets__/pix2pix video/camel.mp4' out_path = f'./video_instruct_pix2pix_{prompt}.mp4' model.process_pix2pix(video_path, prompt=prompt, save_path=out_path) --------------------------------------------------------------------- Low Memory Inference Each of the above introduced interface can be run in a low memory setup. In the minimal setup, a GPU with 12 GB VRAM is sufficient. To reduce the memory usage, add chunk_size=k as additional parameter when calling one of the above defined inference APIs. The integer value k must be in the range {2,...,video_length}. It defines the number of frames that are processed at once (without any loss in quality). The lower the value the less memory is needed. When using the gradio app, set chunk_size in the Advanced options. We plan to release soon a new version that further reduces the memory usage. --------------------------------------------------------------------- Ablation Study To replicate the ablation study, add additional parameters when calling the above defined inference APIs. * To deactivate cross-frame attention: Add use_cf_attn=False to the parameter list. * To deactivate enriching latent codes with motion dynamics: Add use_motion_field=False to the parameter list. Note: Adding smooth_bg=True activates background smoothing. However, our code does not include the salient object detector necessary to run that code. --------------------------------------------------------------------- Inference using Gradio From the project root folder, run this shell command: python app.py Then access the app locally with a browser. To access the app remotely, run this shell command: python app.py --public_access For security information about public access we refer to the documentation of gradio [https://gradio.app/sharing-your-app/# security-and-file-access]. Results Text-To-Video [cat_runnin] [playing] [running] [skii] "A cat is "A panda is "A man is "An astronaut is running on the playing guitar running in the skiing down the grass" on times square" snow" hill" [panda_surf] [bear_danci] [bicycle] [horse_gall] "A panda surfing "A bear dancing "A man is riding "A horse on a wakeboard" on times square" a bicycle in the galloping on a sunshine" street" [tiger_walk] [panda_surf] [horse_gall] [cat_walkin] "A tiger walking "A panda surfing "A horse "A cute cat alone down the on a wakeboard" galloping on a running in a street" street" beatiful meadow" [horse_gall] [panda_walk] [dog_walkin] [astronaut] "A horse "A panda walking "A dog is "An astronaut is galloping on a alone down the walking down the waving his hands street" street" street" on the moon" Text-To-Video with Pose Guidance [img_bot_le] [img_bot_ri] [img_top_le] [img_top_ri] [pose_bot_l] [pose_bot_r] [pose_top_l] [pose_top_r] "A bear dancing "An alien dancing "A panda dancing "An astronaut on the concrete" under a flying in Antarctica" dancing in the saucer" outer space" Text-To-Video with Edge Guidance [butterfly] [head][head_edge] [jelly] [mask][mask_edge] [butterfly_] [jelly_edge] "White "Beautiful girl" "A jellyfish" "beautiful girl butterfly" halloween style" [fox][fix_edge] [head_2] [santa] [dear][dear_edge] [head_2_edg] [santa_edge] "Wild fox is "Oil painting of walking" a beautiful girl "A santa claus" "A deer" close-up" Text-To-Video with Edge Guidance and Dreambooth specialization [anime_styl] [arcane_sty] [gta-5_man_] [img_bot_ri] [anime_edge] [arcane_edg] [gta-5_man_] [edge_bot_r] "anime style" "arcane style" "gta-5 man" "avatar style" Video Instruct Pix2Pix [up_left][bot_left] [up_mid][bot_mid] [up_right][bot_right] "Replace man with "Make it Van Gogh "Make it Picasso chimpanze" Starry Night style" style" [up_left][bot_left] [up_mid][bot_mid] [up_right][bot_right] "Make it Expressionism "Make it night" "Make it autumn" style" License Our code is published under the CreativeML Open RAIL-M license. The license provided in this repository applies to all additions and contributions we make upon the original stable diffusion code. The original stable diffusion code is under the CreativeML Open RAIL-M license, which can found here. BibTeX If you use our work in your research, please cite our publication: @article{text2video-zero, title={Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators}, author={Khachatryan, Levon and Movsisyan, Andranik and Tadevosyan, Vahram and Henschel, Roberto and Wang, Zhangyang and Navasardyan, Shant and Shi, Humphrey}, journal={arXiv preprint arXiv:2303.13439}, year={2023} } About Text-to-Image Diffusion Models are Zero-Shot Video Generators Resources Readme License View license Stars 1.6k stars Watchers 46 watching Forks 64 forks Releases No releases published Packages 0 No packages published Contributors 4 * @rob-hen rob-hen * @honghuis honghuis Humphrey Shi * @mickelliu mickelliu mickelliu * @levon-khachatryan levon-khachatryan Languages * Python 100.0% Footer (c) 2023 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.