https://github.com/ShaShekhar/aaiela

Skip to content

Navigation Menu

Toggle navigation
 
Sign in

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        GitHub Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    By size
      + Enterprise
      + Teams
      + Startups
    By industry
      + Healthcare
      + Financial services
      + Manufacturing
    By use case
      + CI/CD & Automation
      + DevOps
      + DevSecOps
  * Resources
    Resources
      + Learning Pathways
      + White papers, Ebooks, Webinars
      + Customer Stories
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Enterprise
      +  
        Enterprise platform
        AI-powered developer platform
    Available add-ons
      +  
        Advanced Security
        Enterprise-grade security features
      +  
        GitHub Copilot
        Enterprise-grade AI features
      +  
        Premium Support
        Enterprise-grade 24/7 support
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session. Dismiss alert
{{ message }}
ShaShekhar / aaiela Public

  * Notifications You must be signed in to change notification
    settings
  * Fork 3
  * Star 78

License

Apache-2.0 license
78 stars 3 forks Branches Tags Activity
Star
Notifications You must be signed in to change notification settings

  * Code
  * Issues 0
  * Pull requests 0
  * Actions
  * Projects 0
  * Security
  * Insights

Additional navigation options

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Security
  * Insights

ShaShekhar/aaiela

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
 main
BranchesTags
  
Go to file
Code

Folders and files

      Name              Name          Last commit       Last commit
                                        message            date
Latest commit

 

History

5 Commits
 
configs           configs                              
models            models                               
tests             tests                                
web_frontend      web_frontend                         
weights           weights                              
.env.example      .env.example                         
.gitignore        .gitignore                           
Dockerfile        Dockerfile                           
INSTALL.md        INSTALL.md                           
LICENSE           LICENSE                              
NOTICE            NOTICE                               
README.md         README.md                            
__init__.py       __init__.py                          
app.py            app.py                               
main.py           main.py                              
requirements.txt  requirements.txt                     
View all files

Repository files navigation

  * README
  * Apache-2.0 license

AAIELA: AI Assisted Image Editing with Language and Audio

 

This project empowers users to modify images using just audio
commands.

By leveraging open-source AI models for computer vision,
speech-to-text, large language models (LLMs), and text-to-image
inpainting, we have created a seamless editing experience that
bridges the gap between spoken language and visual transformation.

demo.mp4

Project Structure

 

  * detectron2: The Detectron2 library for object detection, keypoint
    detection, instance/panoptic segmentation etc.
  * faster_whisper: Contains the faster_whisper which is
    implementation of OpenAI Whisper for audio transcription/
    translation.
  * language_model: Using small Language model like Phi3 or any of
    the LLM API: Gemini, Claude, GPT4 etc to extract object, action
    and prompt from natural language instruction.
  * sd_inpainting: Include Text conditioned Stable Diffusion v1.5
    Inpainting model.

Installation:

 

See installation instructions.

API Keys: Create a .env file in the root directory of the project.
Fill in API keys if intend to use API-based language models. Use the
provided .env.example file as a template.

Or to use a small language model like Phi-3, set the
active_model:local in config file.

To run individual test files:

$ python -m tests.<test_file_name>

Configuration: adjust some settings in the aaiela.yaml config file
e.g., device, active_model. Toggle between using an API-based model
or a local LLM by modifying the active_model parameter.

  * Run the project's main script to load the model and start the web
    interface.

    python app.py

Project Workflow

 

 1. Upload: User uploads an image.
 2. Segmentation: Detectron2 performs segmentation.
 3. Audio Input: User records an audio command (e.g., "Replace the
    sky with a starry night.").
 4. Transcription: Faster Whisper transcribes the audio into text.
 5. Language Understanding: The LLM (Gemini, GPT4, Phi3 etc.) to
    extracts object, action, and prompt from the text.
 6. Image Inpainting:
      + Relevant masks are selected from the segmentation results.
      + Stable Diffusion Inpainting apply the desired changes.
 7. Output: The inpainted image.

Research

 

 1. The SDXL-Inpainting model requires retraining on a substantially
    larger dataset to achieve satisfactory results. The current model
    trained by HuggingFace shows limitations.

 2. context aware automatic mask generation for prompt like this "Add
    a cat sitting on the wooden chair." Incorporate domain knowledge
    or external knowledge bases (e.g., object attributes, spatial
    relationships) to guide mask generation.

 3. 'Segment Anything' model that could generate masks from text
    input was explored in research paper. This remains an active area
    of research.

 4. Contextual Reasoning: Understand relationships between objects
    and actions (e.g., "sitting" implies the cat should be on top of
    the chair).

 5. Multi-Object Mask generation: "Put a cowboy hat on the person in
    the right and a red scarf around their neck."

 6. Integrate Visual Language model such as BLIP, to provide another
    layer of interaction for the users.

      + If a voice command is unclear or ambiguous, the VLM can
        analyze the image and offer suggestions or ask clarifying
        questions.
      + The VLM can suggest adjustments to numerical parameters based
        on the image content. etc.

Todo

 

  * [ ] The current TensorRT integration for Stable Diffusion models
    lacks a working example of the text-to-image inpainting pipeline.

  * [ ] Integrate ControlNet conditioned on keypoints, depth, input
    scribbles, and other modalities.

  * [ ] Integrate Mediapipe Face Mesh to enable facial landmark
    detection, face geometry estimation, eye tracking, and other
    features for modifying facial features in response to audio
    commands (e.g., "Make me smile," "Change my eye color").

  * [ ] Integrate pose landmark detection capabilities.

  * [ ] Incorporate a super-resolution model for image upscaling.

  * [ ] Implement interactive mask editing using Segment Anything
    with simple click-based interactions followed by inpainting using
    audio instructions.

About

No description, website, or topics provided.

Resources

Readme

License

Apache-2.0 license
Activity

Stars

78 stars

Watchers

2 watching

Forks

3 forks
Report repository

Releases

No releases published

Packages 0

No packages published

Languages

  * Python 93.9%
  * Cuda 2.9%
  * C++ 2.1%
  * Shell 0.4%
  * JavaScript 0.3%
  * CSS 0.2%
  * Other 0.2%

Footer

 (c) 2024 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact
  * Manage cookies
  * Do not share my personal information

You can't perform that action at this time.