https://github.com/ShaShekhar/aaiela Skip to content Navigation Menu Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + GitHub Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions By size + Enterprise + Teams + Startups By industry + Healthcare + Financial services + Manufacturing By use case + CI/CD & Automation + DevOps + DevSecOps * Resources Resources + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Enterprise + Enterprise platform AI-powered developer platform Available add-ons + Advanced Security Enterprise-grade security features + GitHub Copilot Enterprise-grade AI features + Premium Support Enterprise-grade 24/7 support * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} ShaShekhar / aaiela Public * Notifications You must be signed in to change notification settings * Fork 3 * Star 78 License Apache-2.0 license 78 stars 3 forks Branches Tags Activity Star Notifications You must be signed in to change notification settings * Code * Issues 0 * Pull requests 0 * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Actions * Projects * Security * Insights ShaShekhar/aaiela This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main BranchesTags Go to file Code Folders and files Name Name Last commit Last commit message date Latest commit History 5 Commits configs configs models models tests tests web_frontend web_frontend weights weights .env.example .env.example .gitignore .gitignore Dockerfile Dockerfile INSTALL.md INSTALL.md LICENSE LICENSE NOTICE NOTICE README.md README.md __init__.py __init__.py app.py app.py main.py main.py requirements.txt requirements.txt View all files Repository files navigation * README * Apache-2.0 license AAIELA: AI Assisted Image Editing with Language and Audio This project empowers users to modify images using just audio commands. By leveraging open-source AI models for computer vision, speech-to-text, large language models (LLMs), and text-to-image inpainting, we have created a seamless editing experience that bridges the gap between spoken language and visual transformation. demo.mp4 Project Structure * detectron2: The Detectron2 library for object detection, keypoint detection, instance/panoptic segmentation etc. * faster_whisper: Contains the faster_whisper which is implementation of OpenAI Whisper for audio transcription/ translation. * language_model: Using small Language model like Phi3 or any of the LLM API: Gemini, Claude, GPT4 etc to extract object, action and prompt from natural language instruction. * sd_inpainting: Include Text conditioned Stable Diffusion v1.5 Inpainting model. Installation: See installation instructions. API Keys: Create a .env file in the root directory of the project. Fill in API keys if intend to use API-based language models. Use the provided .env.example file as a template. Or to use a small language model like Phi-3, set the active_model:local in config file. To run individual test files: $ python -m tests. Configuration: adjust some settings in the aaiela.yaml config file e.g., device, active_model. Toggle between using an API-based model or a local LLM by modifying the active_model parameter. * Run the project's main script to load the model and start the web interface. python app.py Project Workflow 1. Upload: User uploads an image. 2. Segmentation: Detectron2 performs segmentation. 3. Audio Input: User records an audio command (e.g., "Replace the sky with a starry night."). 4. Transcription: Faster Whisper transcribes the audio into text. 5. Language Understanding: The LLM (Gemini, GPT4, Phi3 etc.) to extracts object, action, and prompt from the text. 6. Image Inpainting: + Relevant masks are selected from the segmentation results. + Stable Diffusion Inpainting apply the desired changes. 7. Output: The inpainted image. Research 1. The SDXL-Inpainting model requires retraining on a substantially larger dataset to achieve satisfactory results. The current model trained by HuggingFace shows limitations. 2. context aware automatic mask generation for prompt like this "Add a cat sitting on the wooden chair." Incorporate domain knowledge or external knowledge bases (e.g., object attributes, spatial relationships) to guide mask generation. 3. 'Segment Anything' model that could generate masks from text input was explored in research paper. This remains an active area of research. 4. Contextual Reasoning: Understand relationships between objects and actions (e.g., "sitting" implies the cat should be on top of the chair). 5. Multi-Object Mask generation: "Put a cowboy hat on the person in the right and a red scarf around their neck." 6. Integrate Visual Language model such as BLIP, to provide another layer of interaction for the users. + If a voice command is unclear or ambiguous, the VLM can analyze the image and offer suggestions or ask clarifying questions. + The VLM can suggest adjustments to numerical parameters based on the image content. etc. Todo * [ ] The current TensorRT integration for Stable Diffusion models lacks a working example of the text-to-image inpainting pipeline. * [ ] Integrate ControlNet conditioned on keypoints, depth, input scribbles, and other modalities. * [ ] Integrate Mediapipe Face Mesh to enable facial landmark detection, face geometry estimation, eye tracking, and other features for modifying facial features in response to audio commands (e.g., "Make me smile," "Change my eye color"). * [ ] Integrate pose landmark detection capabilities. * [ ] Incorporate a super-resolution model for image upscaling. * [ ] Implement interactive mask editing using Segment Anything with simple click-based interactions followed by inpainting using audio instructions. About No description, website, or topics provided. Resources Readme License Apache-2.0 license Activity Stars 78 stars Watchers 2 watching Forks 3 forks Report repository Releases No releases published Packages 0 No packages published Languages * Python 93.9% * Cuda 2.9% * C++ 2.1% * Shell 0.4% * JavaScript 0.3% * CSS 0.2% * Other 0.2% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.