[HN Gopher] Show HN: Open-source macOS AI copilot (using vision ...
___________________________________________________________________
Show HN: Open-source macOS AI copilot (using vision and voice)
Heeey! I built a macOS copilot that has been useful to me, so I
open sourced it in case others would find it useful too. It's
pretty simple: - Use a keyboard shortcut to take a screenshot of
your active macOS window and start recording the microphone. -
Speak your question, then press the keyboard shortcut again to send
your question + screenshot off to OpenAI Vision - The Vision
response is presented in-context/overlayed over the active window,
and spoken to you as audio. - The app keeps running in the
background, only taking a screenshot/listening when activated by
keyboard shortcut. It's built with NodeJS/Electron, and uses
OpenAI Whisper, Vision and TTS APIs under the hood (BYO API key).
There's a simple demo and a longer walk-through in the GH readme
https://github.com/elfvingralf/macOSpilot-ai-assistant, and I also
posted a different demo on Twitter:
https://twitter.com/ralfelfving/status/1732044723630805212
Author : ralfelfving
Score : 333 points
Date : 2023-12-12 13:17 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| pyryt wrote:
| Do you have use case demo videos somewhere? Would be great to see
| this in action
| ralfelfving wrote:
| There's one at 00:30 in this YouTube video (timestamped the
| link): https://www.youtube.com/watch?v=1IdCWqTZLyA&t=32s
| faceless3 wrote:
| Wrote some similar scripts for my Linux setup, that I bind with
| XFCE keyboard shortcuts:
|
| https://github.com/samoylenkodmitry/Linux-AI-Assistant-scrip...
|
| F1 - ask ChatGPT API about current clipboard content F5 - same,
| but opens editor before asking num+ - starts/stops recording
| microphone, then passes to Whisper (locally installed), copies to
| clipboard
|
| I find myself rarely using them however.
| ralfelfving wrote:
| Nice!
| ProfessorZoom wrote:
| e-e-e-electron... for this..
| ralfelfving wrote:
| I don't know man. I'm new to development, it's what I chose,
| probably don't know any better. Tell me what you would have
| chosen instead?
| xNeil wrote:
| electron's a really nice option, specially for people that
| aren't interested in porting their apps or spending too much
| time on development
|
| this is a macOS specific app it seems - if you want better
| performance and more integration with the OS, i'd recommend
| using swift
| ralfelfving wrote:
| Time to learn learn Swift in the next project then! Thank
| you for the deets.
| Filligree wrote:
| The good news is you already have a tool to help you with
| inevitable XCode issues. _grin_
| lolinder wrote:
| Don't mind them--there's a certain subset of HN that is upset
| that web tech has taken over the world. There are some
| legitimate gripes about the performance of some electron
| apps, but with some people those have turned into compulsive
| shallow dismissals of any web app that they believe could
| have been native.
|
| There's nothing wrong with using web tech to build things!
| It's often easier, the documentation is more comprehensive,
| and if you ever wanted to make it cross-platform election
| makes it trivial.
|
| If you were working for a company it might be worth
| considering the trade-offs--do you need to support Macs with
| less RAM?--but for a side project that's for yourself and
| maybe some friends, just do what works for you!
| ralfelfving wrote:
| Thank you for the explanation! At the end of the day, I'm a
| newbie and I'm in it to learn something new with each
| project. Next time I'll probably try my hand at a different
| framework.
| millzlane wrote:
| I just watched a video about building a startup. One of
| the key points was to use what you know to get an MVP.
| Don't fret over which language or library to use (unless
| the goal is to learn a new framework). Just get building.
| I may not be a pro dev, but there is one thing I have
| learned over the years from hanging out amongst all of
| you. And that is, it doesnt matter if you are using emacs
| or vim, tabs vs spaces, or Java vs Python, the end
| product after all is what matters at the end of the day.
| Code can always be refactored.
|
| Good luck in your development journey.
| jdamon96 wrote:
| ignore the naysayers; nice job building out your idea
| ralfelfving wrote:
| Thank you! I got pretty thick skin, but always a bit of
| insecurity involved in doing something the first time --
| first public GH repo and Show HN :D
| airstrike wrote:
| I think the parent comment is a shallow dismissal, but since
| you're asking, I would have built in SwiftUI
| guytv wrote:
| What's important is to get an product out there. Nobody cares
| what stack you use. just us geeks. don't get discouraged. you
| did well :)
| programmarchy wrote:
| My two cents: I think you made a good, practical choice. If
| you're happy with Electron, I'd say stick with it, especially
| if you have cross-platform plans in the future.
|
| If you want to niche down into a more macOS specific app, you
| could learn AppKit and SwiftUI and build a fully native macOS
| app.
|
| If you want to stay cross-platform, but you're not happy with
| Electron, then it might be worth checking out Tauri. It
| provides a JavaScript-based API to display native UI
| components, but without packaging a V8 runtime with your app
| bundle. Instead, it uses a native JavaScript host e.g. on
| macOS it uses WebKit, so it significantly reduces the
| download size of your app.
|
| In terms of developing this into a product, on one hand it
| seems like deep integration with the host OS is the best way
| to build a "moat", but then again, Apple could release their
| own version and quickly blow a product like that out of the
| water.
| atraac wrote:
| Ah yes, cause what's better than building a real, working MVP?
| Learning Rust for half a year just so you can 'optimize' the f
| out of an app that does two REST calls.
| wtallis wrote:
| To be fair, this _does_ sound like the kind of app that would
| benefit from being able to launch instantly, and potentially
| registering with the OS as a service in a way that cross-
| platform frameworks like Electron cannot easily accommodate.
| But Rust would not be the easiest choice to avoid those
| limitations.
| havkom wrote:
| A lot of negative comments here. However, I liked it!
|
| Perfect Show HN and a great start of a product if the author
| wants to.
| ralfelfving wrote:
| Thank you, it's my first GH project & Show HN.. and.. yeah..
| learning here :D
| jonplackett wrote:
| Also think this is fun.
|
| In general I'm pretty excited about LLM as interface and what
| that is going to mean going forward.
|
| I think our kids are going to think mice and keyboards are
| hilariously primitive.
| ralfelfving wrote:
| Before we know it, even voice might be obsolete when we can
| just think :) But maybe at that point, even thinking
| becomes obsolete because the AI:s are doing all the
| thinking for us?!
| swiftcoder wrote:
| Worth mentioning that if you are in a corporate environment,
| running a service that sends arbitrary desktop screenshots to a
| 3rd party cloud service is going to run afoul of pretty much
| every security and regulatory control in existence
| thelittleone wrote:
| The control for that is endpoints should be locked down to
| prevent install of non approved apps. Any org under regulatory
| controls would have some variation of that. Safe to assume an
| orgs users are stupid or nefarious and build defences
| accordingly.
| ralfelfving wrote:
| I assume that anyone capable of cloning the app, starting the
| it on their machine and obtaining + adding an OpenAI API key
| understands that some data is being sent offsite -- and will be
| aware of their corporate policies. I think that's a fair
| assumption.
| greenie_beans wrote:
| that's a fair assumption. feels like swiftcoder is just
| trying to gotcha
| brookst wrote:
| True, but also true of other screen capture utilities that send
| data to the cloud. Your PSA is true, but hardly unique to this
| little utility. And probably not surprising to the intended
| audience.
| isoprophlex wrote:
| You're telling me... the cloud... is other people's computers?!
| abrichr wrote:
| This is exactly why in https://github.com/OpenAdaptAI/OpenAdapt
| we have implemented three separate PII scrubbing providers.
|
| Congrats to the op on shipping!
| jondwillis wrote:
| You should add an option for streaming text as the response
| instead of TTS. And also maybe text in place of the voice command
| as well. I have been tire-kicking a similar kind of copilot for
| awhile, hit me up on discord @jonwilldoit
| ralfelfving wrote:
| There's definitely some improvements to shuttling the data
| between interface<->API, all that was done in a few hours on
| day 1 and there's a few things I decided to fix later.
|
| I prefer speaking over typing, and I sit alone, so probably
| won't add a text input anytime soon. But I'll hit you up on
| Discord in a bit and share notes.
| jondwillis wrote:
| Yeah, just some features I could see adding value and not
| being too hard to implement :)
| tomComb wrote:
| > text in place of the voice command as well
|
| That would be great for people with Mac mini who don't have a
| mic.
| ralfelfving wrote:
| Hmmm... what if I added functionality that uses the webcam to
| read your lips?
|
| Just kidding. Text seem to be the most requested addition,
| and it wasn't on my own list :) Will see if I add it, should
| be fairly easy to make it configurable and render a text
| input window with a button instead of triggering the
| microphone.
|
| Won't make any promises, but might do it.
| amelius wrote:
| Please include "OpenAI-based" in the title. (Now many people here
| are disappointed).
| ralfelfving wrote:
| Fair point, didn't think it would matter so much. Can't edit it
| any more, otherwise I'd change it to add OpenAI to the title!
| ukuina wrote:
| This is very cool! Thank you for working on it and sharing it
| with us.
| ralfelfving wrote:
| Thank you for checking it out! <3
| netika wrote:
| Such a shame it uses Vision API, i.e. it can not be replaced by
| some random self-hosted LLM.
| ralfelfving wrote:
| It can be replaced with a self-hosted LLM, simply change the
| code where the Vision API is being called. That's true for all
| of the API calls in the app.
| freedomben wrote:
| Actually it's open source, so it _can_ be replaced by some
| random self-hosted LLM
| iandanforth wrote:
| For example, one of these:
|
| https://opencompass.org.cn/leaderboard-multimodal
| jackculpan wrote:
| This is awesome
| ralfelfving wrote:
| Thanks, glad you liked it!
| knowsuchagency wrote:
| This is brilliant!
| ralfelfving wrote:
| Glad you liked it!
| satchlj wrote:
| It's not working for me, I get a "Too many requests" http error
| ralfelfving wrote:
| Hmm.. OpenAI bunch a few things into some error. Iirc this
| could be because you're out of credits / don't have a valid
| payment method on file, but it could also be that you're
| hitting rate limits. The Vision API could be the culprit, while
| in beta you can only call it X amount of times per day (X
| varies by account).
|
| Make the console.log:s for the three API calls a bit more
| verbose to find out which call is causing this, and if there's
| more info in the error body.
| I_am_tiberius wrote:
| I would love to have something like this but using an open source
| model and without any network requests.
| trenchgun wrote:
| Probably in three months, approximately.
| dave1010uk wrote:
| LLaVA, Whisper and a few bash scripts should be able to do it.
| I don't know how helpful the model is with screenshots though.
|
| 1. Download LLaVA from https://github.com/Mozilla-
| Ocho/llamafile
|
| 2. Run Whisper locally for speech to text
|
| 3. Save screenshots and send to the model, with a script like
| https://til.dave.engineer/openai/gpt-4-vision/
| behat wrote:
| Nice! Built something similar earlier to get fixes from chatgpt
| for error messages on screen. No voice input because I don't like
| speaking. My approach then was Apple Computer Vision Kit for OCR
| + chatgpt. This reminds me to test out OpenAI's Vision API as a
| replacement.
|
| Thanks for sharing!
| ralfelfving wrote:
| Thanks! You could probably grab what I have, and tweak it a
| bit. Try checking if you can screenshot just the error message
| and check what the value of the window.owner is. It should be
| the name of the application, so you could just append `Can you
| help me with this error I get in ${window.owner}?` to the
| Vision API call.
| thomashop wrote:
| Just used it with the digital audio workstation Ableton Live. It
| is amazing! Its tips were spot-on.
|
| I can see how much time it will save me when I'm working with a
| software or domain I don't know very well.
|
| Here is the video of my interaction:
| https://www.youtube.com/watch?v=ikVdjom5t0E&feature=youtu.be
|
| Weird these negative comments. Did people actually try it?
| pelorat wrote:
| I mean it does send a screenshot of your screen off to a 3rd
| party, and that screenshot will most likely be used in future
| AI training sets.
|
| So... beware when you use it.
| thomashop wrote:
| Beware of it seeing a screenshot of my music set? OpenAI will
| start copying my song structure?
|
| You can turn it on and off. Not necessary to turn it on when
| editing confidential documents.
|
| You never enable screen-sharing in videoconferencing
| software?
| aaronscott wrote:
| I completely agree. A huge business with a singular focus
| isn't going to pivot into the music business (or any of the
| myriad use cases the general public throws at it). And if
| they did use someone's info, it's more likely an unethical
| employee than a genuine business tactic.
|
| Besides, the parent program uses the API, which allows
| opting out of training or retaining that data.
| mecsred wrote:
| Yes this makes perfect sense. As we know, businesses
| definitely do not treat data as a commodity and engage in
| selling/buying data sets on the open market as a "genuine
| business tactic". Therefore, since the company in
| question doesn't have a clear business case for data
| collection _currently_ , we can be sure this data will
| never be used against our interests by any company.
| zwily wrote:
| OpenAI claims that data sent via the API (as opposed to
| chatGPT) will not be used in training. Whether or not you
| believe them is a separate question, but that's the claim.
| ralfelfving wrote:
| So glad when I saw this, thanks for sharing this! It was
| exactly music production in Ableton was the spark that lit this
| idea in my head the other week. I tried to explain to a friend
| that don't use GPT much that with Vision, you can speed up your
| music production and learn how to use advanced tools like
| Ableton more quickly. He didn't believe me. So I grabbed a
| Ableton screenshot off Google and used ChatGPT -- then I felt
| there had to be a better way, I realized that I have my own
| use-cases, and it all evolved into this.
|
| I sent him your video, hopefully he'll believe me now :)
| thomashop wrote:
| You may be interested in two proof of concepts I've been
| working on. I work with generative AI and music at a company.
|
| MidiJourney: ChatGPT integrated into Ableton Live to create
| MIDI clips from prompts. https://github.com/korus-
| labs/MIDIjourney
|
| I have some work on a branch that makes ChatGPT a lot better
| at generating symbolic music (a better prompt and music
| notation).
|
| LayerMosaic allows you to allow MusicGen text-to-music loops
| with the music library of our company.
| https://layermosaic.pixelynx-ai.com/
| ralfelfving wrote:
| Oooh. Yes, very interested in MusicGen. I played with
| MusicGen for the first time the other week and created a
| little script that uses GPT to create the prompt and params
| which is stored to a text file along with the output. Let
| it loop for a few hours to get a few 100 output files that
| allowed me to learn a bit more about what kind of prompts
| that gave reasonable output (it was all bad, lol!)
| ralfelfving wrote:
| My brain read midjourney until I clicked on the GH link.
| What a great name, MIDIjourney!
| ralfelfving wrote:
| Oh LayerMosaic is dope. I'm not entirely sure how it works,
| but the sounds coming out of it is good -- so you have me
| intrigued! Can I read more about it somewhere, I might have
| a crazy idea I'd like to use this for.
| mikey_p wrote:
| Is it just me or is it incredibly useless?
|
| "Here's a list of effects. Here's a list of things that make a
| song. Is it good? Yes. What about my drum effects? Yes here's
| the name of the two effects you are using on your drum channel"
|
| None of this is really helpful and I can't get over how much it
| sounds like Eliza.
| thomashop wrote:
| I made that video right at the start but since then I've
| asked it for example what kind of compression parameters
| would fit with a certain track and it could explain to me how
| to find an expert function which I would have had to consult
| a manual for otherwise.
| e28eta wrote:
| Did you find that calling it "OSX" in the prompt worked better
| than macOS? Or was that just an early choice that you didn't
| spend much time on?
|
| I was skimming through the video you posted, and was curious.
|
| https://www.youtube.com/watch?v=1IdCWqTZLyA&t=32s
|
| code link: https://github.com/elfvingralf/macOSpilot-ai-
| assistant/blob/...
| ralfelfving wrote:
| No, this is an oversight by me. To be completely honest, up
| until the other day I thought it was still called OSX. So the
| project was literally called cOSXpilot, but at some point I
| double checked and realize it's been called macOS for many
| years. Updated the project, but apparently not the code :)
|
| I suspect OSX vs macOS has marginal impact on the outcome :)
| e28eta wrote:
| Haha, makes perfect sense, thanks for the reply!
| hot_gril wrote:
| Heh. I remember calling it Mac OS back in the day and getting
| corrected that it's actually OS X, as in "OS ten," and hasn't
| been called Mac OS since Mac OS 9. Glad Apple finally saw it my
| way (except it's cased macOS).
| qainsights wrote:
| Great. I created `kel` for terminal users. Please check it out at
| https://github.com/qainsights/kel
| causal wrote:
| Chatblade is another good one:
| https://github.com/npiv/chatblade
| dave1010uk wrote:
| Very cool! Have you had much luck with Llama models?
|
| I made Clipea, which is similar but has special integration
| with zsh.
|
| https://github.com/dave1010/clipea
| Jayakumark wrote:
| Was following these two projects by someuser on Github which
| makes similar things possible with Local models. Sending
| screenshot to openai is expensive , if done every few seconds or
| minutes.
|
| https://github.com/KoljaB/LocalAIVoiceChat
|
| While the below one uses openai - don't see why it can't be
| replaced with above project and local mode.
|
| https://github.com/KoljaB/Linguflex
| ralfelfving wrote:
| Nice! Although the productivity increase from being able to
| resolve blockers more quickly adds up to a lot (at least for
| me), local models would be more cost effective -- and probably
| feel less iffy for many people.
|
| I went for OpenAI because I wanted to build something quickly,
| but you should be able to replace the external API calls with
| calls to your internal models.
| stephenblum wrote:
| You made real-life Clippy! for the Mac. This would be great to be
| for other mac apps too. Add context of current running apps.
| ralfelfving wrote:
| It should work for any macOS app. It just takes a screenshot of
| the currently active window, you can even append the
| application name if you'd like.
| lordswork wrote:
| This looks very cool. Does anyone know of something similar for
| Windows? (or does OP intend to extend support to Windows?)
| ralfelfving wrote:
| Hey, OP here. I don't have a Windows machine so have not been
| able to confirm if it works, and probably won't be able to
| develop/test for it either -- sorry! :/
|
| I suspect that you should be able to take my code and only
| require a few tweaks to make it work tho, shouldn't be much
| about it that is macOS only.
| coolspot wrote:
| For testing/development, you can download a free Windows VM
| here: https://developer.microsoft.com/en-
| us/windows/downloads/virt...
| poorman wrote:
| Currently imagining my productivity while waiting 10 seconds for
| the results of the `ls` command.
| ralfelfving wrote:
| It's a basic demo to show people how it works. I think you can
| imagine many other examples where it'll save you a lot of time.
| hot_gril wrote:
| The demo on Twitter is a lot cooler, partially because you
| scroll to show the AI what the page has. Maybe there's a more
| impressive demo to put on the GH too?
| jamesmurdza wrote:
| Have you thought about integrating the macOS accessibility API
| for either reading text or performing actions?
| ralfelfving wrote:
| No, my thought process never really stretched outside of what I
| built. I had this particular idea, then sat down to build it. I
| had some idea of getting OpenAI to respond with keyboard
| shortcuts that the application could execute.
|
| E.g. in Photoshop: "How do I merge all layers" --> "To merge
| all layers you can use the keyboard shortcut Shift + command +
| E"
|
| If you can get that response in JSON, you could prompt the user
| if they want to take the suggested action. I don't see myself
| using it very often, so didn't think much further about it.
| quinncom wrote:
| I'd love to see a version of this that uses text input/output
| instead of voice. I often have someone sleeping in the room with
| me and don't want to speak.
| ralfelfving wrote:
| You're not the first to request. Might add it, can't promise
| tho.
| hackncheese wrote:
| Love it! Will definitely use this when a quick screenshot will
| help specify what I am confused about. Is there a way to hide the
| window when I am not using it? i.e. I hit cmd+shift+' and it
| shows the window, then when the response finishes reading, it
| hides again?
| ralfelfving wrote:
| There's a way for sure, it's just not implemented. Allowing for
| more configurability of the window(s) is on my list, because it
| annoys me too! :)
| hackncheese wrote:
| Annoyance Driven Development(tm)
| qup wrote:
| I have a tangential question: my dad is old. I would love to be
| able to have this feature, or any voice access to an LLM,
| available to him via an easy-to-press external button. Kind of
| like the big "easy button" from staples. Is there anything like
| that, that can be made to trigger a keypress perhaps?
| ralfelfving wrote:
| I personally have no experience with configuring or triggering
| keyboard shortcuts beyond what I learned and implemented in
| this project. But with that said, I'm very confident that what
| you're describing is not only possible but fairly easy.
| Art9681 wrote:
| Make sure to set OpenAI API spend limits when using this or
| you'll quickly find yourself learning the difference between the
| cost of the text models and vision models.
|
| EDIT: I checked again and it seems the pricing is comparable.
| Good stuff.
| ralfelfving wrote:
| I think a prompt cost estimator might be a nifty thing to add
| to the UI.
|
| Right now there's also a daily API limit on the Vision API too
| that kicks in before it gets really bad, 100+ requests
| depending on what your max spend limit is.
| qirpi wrote:
| Awesome! I love it! I was just about to sign up for ChatGPT Plus,
| but maybe I will pay for the API instead. So much good stuff
| coming out daily.
|
| How does the pricing per message + reply end up in practice? (If
| my calculations are right, it shouldn't be too bad, but sounds a
| bit too good to be true)
| ralfelfving wrote:
| I have a hard time saying how much this particular application
| cost to run, because I use the Voice+Vision APIs for so many
| different projects on a near daily basis and haven't
| implemented a prompt cost estimator.
|
| But I also pay for ChatGPT Plus, and it's sooo worth it to me.
|
| If you'd like to skip Plus and use something else, I don't
| think my project is the right one. I'd STRONGLY suggest you
| check out TypingMind, the best wrapper I've found:
| https://www.typingmind.com/
| qirpi wrote:
| Wow, thanks for sharing that link, I've been looking for
| something like this :)
| spullara wrote:
| Did you not find the built-in voice-to-text and text-to-speech
| APIs to be sufficient?
| ralfelfving wrote:
| Didn't even think of them to be honest.
| zmmmmm wrote:
| I've been wanting to build something like this by integrating
| into the terminal itself. Seems very straight forward and avoids
| the screen shotting. So you would just type a comment in the
| right format and it would recognise it: $ ls
| a.txt b.txt c.txt $ # AI: concatenate these files
| and sort the result on the third column $ #.... $
| # cat a.txt b.txt c.txt | sort -k 3
|
| This already works brilliantly by just pasting into CodeLLaMa so
| it's purely terminal integration to make it work. All i need is
| the rest of life to stop being so annoyingly busy.
| paulmedwards wrote:
| I wrote a simple command line app to let me quickly ask a quick
| question in the terminal - https://github.com/edwardsp/qq. It
| outputs the command I need and puts it in the paste buffer. I
| use it all the time now, e.g. $ qq
| concatenate all files in the current directory and sort the
| result on the third column cat * | sort -k3
| zmmmmm wrote:
| yep absolutely - have seen a few of those. And how well they
| work is what inspires me to want the next parts, which are
| (a) send the surrounding lines and output as context - notice
| above I can ask it about "these files" (b) automatically add
| the result to terminal history so I can avoid copy/paste if I
| want to run it. I think this could make these things
| absolutely fluid, almost like autocomplete (another crazy
| idea is to _actually_ tie it into bash-completion so when you
| press tab it does the above).
|
| CodeLLama with GPU acceleration on Mac M1 is almost instant
| in response, its really compelling.
| smcleod wrote:
| Nice project, any plans to make it work with local LLMs rather
| than "open"AI?
| ralfelfving wrote:
| Thanks. Had no plans, but might give it a try at some point.
| For me, personally, using OpenAI for this isn't an issue.
| hmottestad wrote:
| I think that LM Studio has an OpenAI "compliant" API, so if
| there is something similar that supports vision+text then it
| would be easy enough to make the base URL configurable and then
| point it to localhost.
|
| Do you know of a simple setup that I can run locally with
| support for both images and text?
___________________________________________________________________
(page generated 2023-12-12 23:00 UTC)