https://github.com/fudan-generative-vision/hallo Skip to content Navigation Menu Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + GitHub Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Resources + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Enterprise + Enterprise platform AI-powered developer platform Available add-ons + Advanced Security Enterprise-grade security features + GitHub Copilot Enterprise-grade AI features + Premium Support Enterprise-grade 24/7 support * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} fudan-generative-vision / hallo Public * Notifications You must be signed in to change notification settings * Fork 127 * Star 1k * Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation fudan-generative-vision.github.io/hallo/ License MIT license 1k stars 127 forks Branches Tags Activity Star Notifications You must be signed in to change notification settings * Code * Issues 14 * Pull requests 1 * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Actions * Projects * Security * Insights fudan-generative-vision/hallo This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main BranchesTags Go to file Code Folders and files Last commit Last Name Name message commit date Latest commit History 20 Commits .github/workflows .github/workflows assets assets configs configs examples examples hallo hallo scripts scripts .gitignore .gitignore .pre-commit-config.yaml .pre-commit-config.yaml .pylintrc .pylintrc LICENSE LICENSE README.md README.md requirements.txt requirements.txt setup.py setup.py View all files Repository files navigation * README * MIT license Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation Mingwang Xu^1* Hui Li^1* Qingkun Su^1* Hanlin Shang^1 Liwei Zhang ^1 Ce Liu^3 Jingdong Wang^2 Yao Yao^4 Siyu Zhu^1 ^1Fudan University ^2Baidu Inc ^3ETH Zurich ^4Nanjing University [6874747073] [6874747073] [6874747073] [6874747073] [6874747073] Showcase head.mp4 Framework abstract framework News * 2024/06/15: Release the first version on GitHub. * 2024/06/15: Release some images and audios for inference testing on Huggingface. Installation * System requirement: Ubuntu 20.04/Ubuntu 22.04, Cuda 12.1 * Tested GPUs: A100 Create conda environment: conda create -n hallo python=3.10 conda activate hallo Install packages with pip pip install -r requirements.txt pip install . Besides, ffmpeg is also need: apt-get install ffmpeg Inference The inference entrypoint script is scripts/inference.py. Before testing your cases, there are two preparations need to be completed: 1. Download all required pretrained models. 2. Prepare source image and driving audio pairs. 3. Run inference. Download pretrained models You can easily get all pretrained models required by inference from our HuggingFace repo. Clone the the pretrained models into ${PROJECT_ROOT}/ pretrained_models directory by cmd below: git lfs install git clone https://huggingface.co/fudan-generative-ai/hallo pretrained_models Or you can download them separately from their source repo: * hallo: Our checkpoints consist of denoising UNet, face locator, image & audio proj. * audio_separator: Kim_Vocal_2 MDX-Net vocal removal model by KimberleyJensen. (Thanks to runwayml) * insightface: 2D and 3D Face Analysis placed into pretrained_models/face_analysis/models/. (Thanks to deepinsight) * face landmarker: Face detection & mesh model from mediapipe placed into pretrained_models/face_analysis/models. * motion module: motion module from AnimateDiff. (Thanks to guoyww ). * sd-vae-ft-mse: Weights are intended to be used with the diffusers library. (Thanks to stablilityai) * StableDiffusion V1.5: Initialized and fine-tuned from Stable-Diffusion-v1-2. (Thanks to runwayml) * wav2vec: wav audio to vector model from Facebook. Finally, these pretrained models should be organized as follows: ./pretrained_models/ |-- audio_separator/ | `-- Kim_Vocal_2.onnx |-- face_analysis/ | `-- models/ | |-- face_landmarker_v2_with_blendshapes.task # face landmarker model from mediapipe | |-- 1k3d68.onnx | |-- 2d106det.onnx | |-- genderage.onnx | |-- glintr100.onnx | `-- scrfd_10g_bnkps.onnx |-- motion_module/ | `-- mm_sd_v15_v2.ckpt |-- sd-vae-ft-mse/ | |-- config.json | `-- diffusion_pytorch_model.safetensors |-- stable-diffusion-v1-5/ | |-- feature_extractor/ | | `-- preprocessor_config.json | |-- model_index.json | |-- unet/ | | |-- config.json | | `-- diffusion_pytorch_model.safetensors | `-- v1-inference.yaml `-- wav2vec/ |-- wav2vec2-base-960h/ | |-- config.json | |-- feature_extractor_config.json | |-- model.safetensors | |-- preprocessor_config.json | |-- special_tokens_map.json | |-- tokenizer_config.json | `-- vocab.json Prepare Inference Data Hallo has a few simple requirements for input data: For the source image: 1. It should be cropped into squares. 2. The face should be the main focus, making up 50%-70% of the image. 3. The face should be facing forward, with a rotation angle of less than 30deg (no side profiles). For the driving audio: 1. It must be in WAV format. 2. It must be in English since our training datasets are only in this language. 3. Ensure the vocals are clear; background music is acceptable. We have provided some samples for your reference. Run inference Simply to run the scripts/inference.py and pass source_image and driving_audio as input: python scripts/inference.py --source_image examples/source_images/1.jpg --driving_audio examples/driving_audios/1.wav Animation results will be saved as ${PROJECT_ROOT}/.cache/output.mp4 by default. You can pass --output to specify the output file name. You can find more examples for inference at examples folder. For more options: usage: inference.py [-h] [-c CONFIG] [--source_image SOURCE_IMAGE] [--driving_audio DRIVING_AUDIO] [--output OUTPUT] [--pose_weight POSE_WEIGHT] [--face_weight FACE_WEIGHT] [--lip_weight LIP_WEIGHT] [--face_expand_ratio FACE_EXPAND_RATIO] options: -h, --help show this help message and exit -c CONFIG, --config CONFIG --source_image SOURCE_IMAGE source image --driving_audio DRIVING_AUDIO driving audio --output OUTPUT output video file name --pose_weight POSE_WEIGHT weight of pose --face_weight FACE_WEIGHT weight of face --lip_weight LIP_WEIGHT weight of lip --face_expand_ratio FACE_EXPAND_RATIO face region Roadmap Status Milestone ETA Inference source code meet everyone on GitHub 2024-06-15 Pretrained models on Huggingface 2024-06-15 Traning: data preparation and training scripts 2024-06-25 Optimize inference performance in Mandarin TBD Citation If you find our work useful for your research, please consider citing the paper: @misc{xu2024hallo, title={Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation}, author={Mingwang Xu and Hui Li and Qingkun Su and Hanlin Shang and Liwei Zhang and Ce Liu and Jingdong Wang and Yao Yao and Siyu zhu}, year={2024}, eprint={2406.08801}, archivePrefix={arXiv}, primaryClass={cs.CV} } Opportunities available Multiple research positions are open at the Generative Vision Lab, Fudan University! Include: * Research assistant * Postdoctoral researcher * PhD candidate * Master students Interested individuals are encouraged to contact us at siyuzhu@fudan.edu.cn for further information. Social Risks and Mitigations The development of portrait image animation technologies driven by audio inputs poses social risks, such as the ethical implications of creating realistic portraits that could be misused for deepfakes. To mitigate these risks, it is crucial to establish ethical guidelines and responsible use practices. Privacy and consent concerns also arise from using individuals' images and voices. Addressing these involves transparent data usage policies, informed consent, and safeguarding privacy rights. By addressing these risks and implementing mitigations, the research aims to ensure the responsible and ethical development of this technology. About Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation fudan-generative-vision.github.io/hallo/ Topics image-animation face-animation video-animation Resources Readme License MIT license Activity Custom properties Stars 1k stars Watchers 83 watching Forks 127 forks Report repository Releases No releases published Packages 0 No packages published Contributors 3 * @AricGamma AricGamma AricGamma * @siyuzhu-fudan siyuzhu-fudan Siyu Zhu * @crystallee-ai crystallee-ai crystallee Languages * Python 100.0% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.