[HN Gopher] Show HN: AI assisted image editing with audio instru...
       ___________________________________________________________________
        
       Show HN: AI assisted image editing with audio instructions
        
       Excited to launch AAIELA, an AI-powered tool that understands your
       spoken commands and edits images accordingly. By leveraging open-
       source AI models for computer vision, speech-to-text, large
       language models (LLMs), and text-to-image inpainting, we have
       created a seamless editing experience that bridges the gap between
       spoken language and visual transformation.  Imagine the
       possibilities if Google Photos integrated voice assisted editing
       like AAIELA! Alongside Magic Eraser and other AI tools, editing
       with audio instruction could revolutionize how we interact with our
       photos.
        
       Author : ShaShekhar
       Score  : 76 points
       Date   : 2024-07-01 09:32 UTC (13 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | sgbeal wrote:
       | Wow! We're now just a hair's-width away from finally being able
       | to say, "Computer, enhance image!" without sounding like we're in
       | a bad sci-fi show.
        
         | sargstuff wrote:
         | Think the only thing historical science fiction/Blade Runner
         | photo inspect scene[0] didn't forsee was voically having AI
         | assist/analyze photo to summerize list of items/objects
         | avaliable to zoom/view. (vs. pan/zoom around). Although
         | altavista glasses / hand gestures[3] would have been a future
         | concept at the time, too.
         | 
         | ----
         | 
         | [0] : https://scifiinterfaces.com/2020/04/29/deckards-photo-
         | inspec...
         | 
         | [1] 'mirror reality' image / TERI[2] :
         | https://www.hackster.io/news/blade-runner-s-image-enhancemen...
         | 
         | [2] : TERI, almost IRL blade runner move image enhancement tool
         | : https://news.ycombinator.com/edit?id=40844595 /
         | https://github.com/iscilab2020/TERI-3DNLOS/tree/TERI
         | 
         | [3] : Gest : https://news.ycombinator.com/edit?id=40844704
        
         | jaggs wrote:
         | Make it so!
        
           | sargstuff wrote:
           | gMake it, you gAught it. (once there's enough bandwidth to go
           | around[0])
           | 
           | [0] : Intel CPU with OCI Chiplet Demoed with 4Tbps of
           | Bandwidth and 100M Reach :
           | https://news.ycombinator.com/item?id=40844616
        
         | throwaway4aday wrote:
         | Using Whisper as the voice interface, an LLM to understand the
         | prompt and issue function call commands and an image upscaler
         | you could build this in a weekend. Would it be useful? Not
         | especially by itself but I think there is a lot of promise in
         | voice interaction with LLM operated software.
        
       | leobg wrote:
       | I love how in the demo video, even the audio instructions
       | themselves are AI generated. No human in the loop, at all! :)
        
         | ShaShekhar wrote:
         | I did it intentionally. The video had my voice, but then I
         | decided to replace it with an AI voice.
        
       | ShaShekhar wrote:
       | Example instructions: 1. Replace the sky with a deep blue sky
       | then replace the mountain with a Himalayan mountain covered in
       | snow. 2. Stylize the car with a cyberpunk aesthetic, then change
       | the background to a neon-lit cityscape at night. 2. Replace the
       | person with sculpture complementing the architecture.
       | 
       | Check out the Research section for more complex instructions.
        
       | throwaway4aday wrote:
       | Love it! Voice interaction is a great modality for UI. A lot of
       | people have a bad taste left over from early attempts but I
       | expect to see a lot of progress made now that STT and natural
       | language understanding is so much better.
       | 
       | The biggest reason we should be adding conversational UI to
       | everything is the harm done by RSI and sedentary keyboard and
       | mouse interfaces. We're crippling entire generations of people by
       | sticking to outdated hardware. The good news is we can break free
       | of this now that we have huge improvements in LLMs and AR
       | hardware. We'll be back to healthy levels of activity in 5 to 10
       | years. Sorry Keeb builders, it's time to join the stamp
       | collectors and typewriter enthusiasts. We'll be working in the
       | park today.
        
         | prawn wrote:
         | I'd like to see a voice instruction layer that can work
         | independently of the mouse/keyboard later without stealing
         | focus. Things like moving files or preparing
         | windows/positioning prior to switching.
        
         | xyproto wrote:
         | Have you tried sitting in a park for hours, talking out loud
         | and seeing what happens?
        
           | throwaway4aday wrote:
           | Ignoring the snark. This will change as technology is
           | adopted, go back 40 years (or even less) and a person walking
           | around staring at a little black rectangle would have been
           | perceived as weird and anti-social. We used to make fun of
           | people talking on the phone via bluetooth headsets and now
           | everyone does it with AirPods or whatever.
           | 
           | If you've got the technology to enable you to seamlessly
           | transition from working in your home to working while sitting
           | outside at a cafe to working while sitting on a blanket under
           | a tree in the park to working wherever you feel like it then
           | there will be enough brave people that say "fuck what other
           | people think" and just do it so they can enjoy being active
           | and getting fresh air and eventually more and more people
           | will join them. Eventually we'll reach the point where
           | sitting inside at a desk for 8-12 hours will be the weird
           | thing.
        
           | N0b8ez wrote:
           | Isn't that just like taking a phone call? I'm not sure what
           | you're trying to imply.
        
         | mistermann wrote:
         | One big problem would be that in open office environments there
         | would be a lot of noise. I wonder if some sort of active noise
         | cancellation could be introduced so the voices of your co-
         | workers could be ~completely canceled out if you are wearing
         | special headphones?
        
           | throwaway4aday wrote:
           | When I consider my own LLM workflow the amount of time
           | reading/listening/thinking outweighs the amount of time spent
           | typing/speaking. If that's any indication of how a fully
           | fledged conversational workflow would work then I think open
           | plan offices wouldn't be a lot louder than they currently
           | are. Depending on how quickly agentic LLMs are developed I'm
           | not even sure we will be using offices the same way we are
           | now. We might only need to meet or checkin with our coworkers
           | and our LLM agents every few hours or once a day or maybe
           | even longer in order to realign and check on results. Maybe
           | we'll get occasional messages asking us to confirm something
           | or provide clarification, I could honestly see most knowledge
           | work evaporating and leaving behind only high level
           | coordination, research and ideation.
           | 
           | Before that, I'm certain we'll all be spending a lot more
           | time reviewing work, trying out prototypes and tweaking
           | prompts or specifications than we do typing or talking.
        
       | throwaway4aday wrote:
       | Forgot to share this link as well, not sure if you're aware of it
       | but it's a great write up on fine tuning small local models on
       | specific APIs and seems like it would be a perfect fit for your
       | project. https://bair.berkeley.edu/blog/2024/05/29/tiny-agent/
        
         | ShaShekhar wrote:
         | I did integrated and tested the microsoft phi3-mini and it
         | works really well. Having freedom to run locally without
         | sharing private photo is my utmost objective.
        
       | G1N wrote:
       | We're so close to being able to create our own Tayne
       | 
       | (https://www.youtube.com/watch?v=a8K6QUPmv8Q)
        
       | parentheses wrote:
       | soon the movie trope of saying "enhance" repeatedly could be a
       | real thing!
        
       | whatnotests2 wrote:
       | Zoom. Enhance!
        
       | beautifulfreak wrote:
       | It didn't just replace the sky and background, it replaced the
       | trees. That wasn't part of the instructions.
        
       | kveykva wrote:
       | This pitches a lot but only seems to support a specific
       | inpainting operation?
        
       | vunderba wrote:
       | Nice job. I actually experimented with a chat driven instruct2pix
       | sort of interface that connected via API to a stable diffusion
       | backend. The big problem is that it's difficult to know if the
       | inpainting job you've done is satisfactory to the user.
       | 
       | This is why usually when you're doing this sort of traditional
       | inpainting in automatic1111 you generate several iterations with
       | various mask blurs, whole picture vs only masked section, padding
       | and of course the optimal inpainting checkpoint model to use
       | depends on whether or not the original images is photorealistic
       | versus illustrated, etc.
        
       | benzguo wrote:
       | Super cool! We're building an API that makes it easy to build
       | chained multi-model workflows like this that run with zero
       | latency between tasks - https://www.substrate.run/
        
       ___________________________________________________________________
       (page generated 2024-07-01 23:01 UTC)