https://github.com/PKU-YuanGroup/Video-LLaVA

Skip to content Toggle navigation
 
Sign up

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    For
      + Enterprise
      + Teams
      + Startups
      + Education
    By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
    Resources
      + Learning Pathways
      + White papers, Ebooks, Webinars
      + Customer Stories
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session. Dismiss alert
{{ message }}
PKU-YuanGroup / Video-LLaVA Public

  * Notifications
  * Fork 23
  * Star 566

Video-LLaVA: Learning United Visual Representation by Alignment
Before Projection

arxiv.org/pdf/2311.10122.pdf

License

Apache-2.0 license
566 stars 23 forks Activity
Star
Notifications

  * Code
  * Issues 2
  * Pull requests 0
  * Actions
  * Projects 0
  * Security
  * Insights

Additional navigation options

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Security
  * Insights

PKU-YuanGroup/Video-LLaVA

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
main
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags

Name already in use

A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
2 branches 0 tags
Code

  * Local
  * Codespaces

  *  
    Clone
    HTTPS GitHub CLI
    [https://github.com/P]

    Use Git or checkout with SVN using the web URL.

    [gh repo clone PKU-Yu]

    Work fast with our official CLI. Learn more about the CLI.

  * Open with GitHub Desktop
  * Download ZIP

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

@LinB203
LinB203 Update README.md
...
ba85761 Nov 21, 2023
Update README.md
ba85761

Git stats

  * 74 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
assets
Add files via upload
November 21, 2023 10:45
llava
fix video
November 21, 2023 10:35
scripts
Update finetune.sh
November 21, 2023 10:38
LICENSE
Create LICENSE
November 17, 2023 16:44
README.md
Update README.md
November 21, 2023 22:22
TRAIN_AND_VALIDATE.md
Update TRAIN_AND_VALIDATE.md
November 20, 2023 10:05
pyproject.toml
update code and sample
November 16, 2023 23:01
View code
[                    ]
Video-LLaVA: Learning United Visual Representation by Alignment
Before Projection If you like our project, please give us a star  on
GitHub for latest update.  News  Highlights  Simple baseline,
learning united visual representation by alignment before projection
 High performance, complementary learning with video and image 
Demo  Main Results Image understanding Video understanding [?]
Requirements and Installation  API Inference for image Inference for
video [?] Training & Validating  Acknowledgement  Related Projects 
License [?] Citation  Star History Contributors

README.md

                       [68747470733a2f2f7a31]

   Video-LLaVA: Learning United Visual Representation by Alignment
                          Before Projection

    If you like our project, please give us a star  on GitHub for
                           latest update.

                                   

hf_space Replicate demo and cloud API zhihu zhihu arXiv License Hits 
                 GitHub issues GitHub closed issues

PWC
PWC
PWC

 I also have other video-language projects that may interest you .

    LanguageBind: Extending Video-Language Pretraining to N-modality
    by Language-based Semantic Alignment
    Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang,
    Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang,
    Zhifeng Li, Wei Liu, Li Yuan

    Chat-UniVi: Unified Visual Representation Empowers Large Language
    Models with Image and Video Understanding
    Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, Li Yuan

  News

  * [2023.11.20] Demo and code are available now! Welcome to watch 
    this repository for the latest updates.

  Highlights

Video-LLaVA exhibits remarkable interactive capabilities between
images and videos, despite the absence of image-video pairs in the
dataset.

  Simple baseline, learning united visual representation by
alignment before projection

  * With the binding of unified visual representations to the
    language feature space, we enable an LLM to perform visual
    reasoning capabilities on both images and videos simultaneously.

  High performance, complementary learning with video and image

  * Extensive experiments demonstrate the complementarity of
    modalities, showcasing significant superiority when compared to
    models specifically designed for either images or videos.

[main]

  Demo

  * Gradio Web UI

Highly recommend trying out our web demo by the following command,
which incorporates all features currently supported by Video-LLaVA.
We also provide online demo in Huggingface Spaces.

python -m  llava.serve.gradio_web_server

demo.mp4

  * CLI Inference

python -m llava.serve.cli --model-path "LanguageBind/Video-LLaVA-7B" --video-file "path/to/your/video.mp4" --load-4bit

[videocli]

python -m llava.serve.cli --model-path "LanguageBind/Video-LLaVA-7B" --image-file "path/to/your/image.jpg" --load-4bit

[imagecli]

  Main Results

 Image understanding

[res_img]

 Video understanding

[res_vi]

 [?] Requirements and Installation

  * Python >= 3.10
  * Pytorch == 2.0.1
  * CUDA Version >= 11.7
  * Install required packages:

git clone https://github.com/PKU-YuanGroup/Video-LLaVA
cd Video-LLaVA
conda create -n videollava python=3.10 -y
conda activate videollava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install decord opencv-python git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d

  API

We open source all codes. If you want to load the model (e.g.
LanguageBind/Video-LLaVA-7B) on local, you can use the following code
snippets.

 Inference for image

import torch
from llava.constants import X_TOKEN_INDEX, DEFAULT_X_TOKEN
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.builder import load_pretrained_model
from llava.utils import disable_torch_init
from llava.mm_utils import tokenizer_X_token, get_model_name_from_path, KeywordsStoppingCriteria

def main():
    disable_torch_init()
    image = 'llava/serve/examples/extreme_ironing.jpg'
    inp = 'What is unusual about this image?'
    model_path = 'LanguageBind/Video-LLaVA-7B'
    device = 'cuda'
    load_4bit, load_8bit = True, False
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, processor, context_len = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device)
    image_processor = processor['image']
    conv_mode = "llava_v1"
    conv = conv_templates[conv_mode].copy()
    roles = conv.roles

    image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values']
    if type(image_tensor) is list:
        tensor = [image.to(model.device, dtype=torch.float16) for image in image_tensor]
    else:
        tensor = image_tensor.to(model.device, dtype=torch.float16)
    key = ['image']

    print(f"{roles[1]}: {inp}")
    inp = DEFAULT_X_TOKEN['IMAGE'] + '\n' + inp
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    input_ids = tokenizer_X_token(prompt, tokenizer, X_TOKEN_INDEX['IMAGE'], return_tensors='pt').unsqueeze(0).cuda()
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=[tensor, key],
            do_sample=True,
            temperature=0.2,
            max_new_tokens=1024,
            use_cache=True,
            stopping_criteria=[stopping_criteria])

    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
    print(outputs)

if __name__ == '__main__':
    main()

 Inference for video

import torch
from llava.constants import X_TOKEN_INDEX, DEFAULT_X_TOKEN
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.builder import load_pretrained_model
from llava.utils import disable_torch_init
from llava.mm_utils import tokenizer_X_token, get_model_name_from_path, KeywordsStoppingCriteria

def main():
    disable_torch_init()
    video = 'llava/serve/examples/sample_demo_1.mp4'
    inp = 'Why is this video funny?'
    model_path = 'LanguageBind/Video-LLaVA-7B'
    device = 'cuda'
    load_4bit, load_8bit = True, False
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, processor, context_len = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device)
    video_processor = processor['video']
    conv_mode = "llava_v1"
    conv = conv_templates[conv_mode].copy()
    roles = conv.roles

    video_tensor = video_processor(video, return_tensors='pt')['pixel_values']
    if type(video_tensor) is list:
        tensor = [video.to(model.device, dtype=torch.float16) for video in video_tensor]
    else:
        tensor = video_tensor.to(model.device, dtype=torch.float16)
    key = ['video']

    print(f"{roles[1]}: {inp}")
    inp = DEFAULT_X_TOKEN['VIDEO'] + '\n' + inp
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    input_ids = tokenizer_X_token(prompt, tokenizer, X_TOKEN_INDEX['VIDEO'], return_tensors='pt').unsqueeze(0).cuda()
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=[tensor, key],
            do_sample=True,
            temperature=0.1,
            max_new_tokens=1024,
            use_cache=True,
            stopping_criteria=[stopping_criteria])

    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
    print(outputs)

if __name__ == '__main__':
    main()

 [?] Training & Validating

The training & validating instruction is in TRAIN_AND_VALIDATE.md.

  Acknowledgement

  * LLaVA The codebase we built upon and it is an efficient large
    language and vision assistant.
  * Video-ChatGPT Great job contributing the evaluation code and
    dataset.

  Related Projects

  * LanguageBind An open source five modalities language-based
    retrieval framework.
  * Chat-UniVi This framework empowers the model to efficiently
    utilize a limited number of visual tokens.

  License

  * The majority of this project is released under the Apache 2.0
    license as found in the LICENSE file.
  * The service is a research preview intended for non-commercial use
    only, subject to the model License of LLaMA, Terms of Use of the
    data generated by OpenAI, and Privacy Practices of ShareGPT.
    Please contact us if you find any potential violation.

 [?] Citation

If you find our paper and code useful in your research, please
consider giving a star  and citation .

@misc{lin2023videollava,
      title={Video-LLaVA: Learning United Visual Representation by Alignment Before Projection},
      author={Bin Lin and Bin Zhu and Yang Ye and Munan Ning and Peng Jin and Li Yuan},
      year={2023},
      eprint={2311.10122},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{zhu2023languagebind,
      title={LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment},
      author={Bin Zhu and Bin Lin and Munan Ning and Yang Yan and Jiaxi Cui and HongFa Wang and Yatian Pang and Wenhao Jiang and Junwu Zhang and Zongwei Li and Wancai Zhang and Zhifeng Li and Wei Liu and Li Yuan},
      year={2023},
      eprint={2310.01852},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

  Star History

Star History

 Contributors

[6874747073]

About

Video-LLaVA: Learning United Visual Representation by Alignment
Before Projection

arxiv.org/pdf/2311.10122.pdf

Topics

multi-modal instruction-tuning large-vision-language-model

Resources

Readme

License

Apache-2.0 license
Activity

Stars

566 stars

Watchers

7 watching

Forks

23 forks
Report repository

Releases

No releases published

Packages 0

No packages published

Contributors 4

  * @LinB203 LinB203 lb203
  * @JessyTsu1 JessyTsu1
  * @eltociear eltociear Ikko Eltociear Ashimine
  * @nateraw nateraw Nathan Raw

Languages

  * Python 94.6%
  * Shell 3.4%
  * Other 2.0%

Footer

 (c) 2023 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.