[HN Gopher] Launch HN: Aqua Voice (YC W24) - Voice-driven text e...
___________________________________________________________________
Launch HN: Aqua Voice (YC W24) - Voice-driven text editor
Hey HN! We're Jack and Finn from Aqua Voice
(https://withaqua.com/). Aqua is a voice-native document editor
that combines reliable dictation and natural language commands,
letting you say things like: "make this a list" or "it's Erin with
an E" or "add an inline citation here for page 86 of this book".
Here is a demo: https://youtu.be/qwSAKg1YafM. Finn, who is big-
time dyslexic, has been using dictation software since the sixth
grade when his dad set him up on Dragon Dictation. He used it
through school to write papers, and has been keeping his own
transcription benchmarks since college. All that time, writing with
your voice has remained a cumbersome and brittle experience that is
riddled with painpoints. Dictation software is still terrible. All
the solutions basically compete on accuracy (i.e. speech
recognition), but none of them deal with the fundamentally brittle
nature of the text that they generate. They don't try to format
text correctly and require you to learn a bunch of specialized
commands, which often are not worth it. They're not even close to a
voice replacement for a keyboard. Even post LLM, you are limited
to a set of specific commands and the most accurate models don't
have any commands. Outside of these rules, the models have no sense
for what is an instruction and what is content. You can't say "and
format this like an email" or "make the last bullet point shorter".
Aqua solves this. This problem is important to Finn and millions
of other people who would write with their voice if they could.
Initially, we didn't think of it as a startup project. It was just
something we wanted for ourselves. We thought maybe we'd write a
novel with it - or something. After friends started asking to use
the early versions of Aqua, it occurred to us that, if we didn't
build it, maybe nobody would. Aqua Voice is a text editor that you
talk to like a person. Depending on the way that you say it and the
context in which you're operating, Aqua decides whether to
transcribe what you said verbatim, execute a command, or subtly
modify what you said into what you meant to write. For example, if
you were to dictate: "Gryphons have classic forms resembling shield
volcanoes," Aqua would output your text verbatim. But if you
stumble over your words or start a sentence over a few times, Aqua
is smart enough to figure that out and to only take the last
version of the sentence. The vision is not only to provide a more
natural dictation experience, but to enable for the first time an
AI-writing experience that feels natural and collaborative. This
requires moving away from using LLMs for one-off chat requests and
towards something that is more like streaming where you are in
constant contact with the model. Voice is the natural medium for
this. Aqua is actually 6 models working together to transcribe,
interpret, and rewrite the document according to your intent.
Technically, executing a real-time voice application with a
language model at its core requires complex coordination between
multiple pieces. We use MoE transcription to outperform what was
previously thought possible in terms of real-time accuracy. Then we
sync up with a language model to determine what should be on the
screen as quickly as possible. The model isn't perfect, but it is
ready for early adopters and we've already been getting feedback
from grateful users. For example, a historian with carpal tunnel
sent us an email he wrote using Aqua and said that he is now able
to be five times as productive as he was previously. We've heard
from other people with disabilities that prevent them from typing.
We've also seen good adoption from people who are dyslexic or
simply prefer talking to typing. It's being used for everything
from emails to brainstorming to papers to legal briefings. While
there is much left to do in terms of latency and robustness, the
best experiences with Aqua are beginning to feel magical. We would
love for you to try it out and give us feedback, which you can do
with no account on https://withaqua.com. If you find it useful,
it's $10/month after a 1000-token free trial. (We want to bump the
free trial in the future, but we're a small team, and running this
thing isn't cheap.) We'd love to hear your ideas and comments with
voice-to-text!
Author : the_king
Score : 658 points
Date : 2024-03-26 14:53 UTC (1 days ago)
| jasonjmcghee wrote:
| What are your opinions on https://github.com/cursorless-
| dev/cursorless?
|
| Are you targeting developers?
|
| My understanding was people who are serious about developing via
| voice use it pretty exclusively.
|
| Like, yeah you need to learn commands, but "are often not worth
| it" feels like brushing a pretty massive offering under the rug.
|
| Is learning vi / emacs commands not worth it (or shortcuts in
| another IDE?)
|
| Is there a middle ground?
| the_king wrote:
| Cursorless is really cool, but we see the ideal computer-voice
| interaction a little differently.
|
| Our approach is based around understanding intent from speech
| alone. We think this will be the ideal division of labor
| between man and machine going forward - let the person think
| and the machine fit it into the document/file/text. Over time
| we think this will reduce the number of commands you have to
| learn to use it to zero.
|
| But our "command-less" approach isn't reliable for every use
| case yet - and as a fan of voice interfaces I am rooting for
| Cursorless - it's super sci-fi.
| tomberin wrote:
| I was impressed with the Demo, ready to pay 10 and no option to
| sign up with email :(
| jmcintire1 wrote:
| Made some tradeoffs for the sake of speed -- email signup will
| come. We want it too!
| freedomben wrote:
| Good to hear! I very much using my google account and other
| third parties to sign up for accounts.
|
| Do you have any idea of how soon? Not looking for a public
| commitment to hang you with, just wondering if this is one of
| those "we're working on it now" (so days) or one of those
| "it's in the backlog" (months or maybe never depending on
| priorities).
| jmcintire1 wrote:
| Hah! Excellent question. Somewhere in between (i.e. it is
| not a p1 right now but "months" is way too long). Give us
| til after demo day then one of us will sit down and knock
| it out
| iknownthing wrote:
| Are these your models or a wrapper around model apis?
| the_king wrote:
| We use our own fusion model in the transcription pipeline for
| intent understanding from encoded audio, but most of the
| rewriting tasks like "Turn this into a list" call out to fine-
| tunes of GPT-4. It's a combination.
|
| The fusion model is similar to the architecture described here:
| https://arxiv.org/abs/2310.13289
| elektor wrote:
| Trying out the app on Firefox gets me this error:
|
| NotSupportedError:
|
| AudioContext.createMediaStreamSource:
|
| Connecting AudioNodes from AudioContexts with different sample-
| rate is currently not supported.
|
| I would add that this really needs to be a native app with
| ability to use it within Microsoft Word, which itself has a
| decent voice to text tool built in.
| jmcintire1 wrote:
| Sorry about the Firefox error! Agreed on the sentiment behind
| native app -- we plan to get Aqua in as many places as possible
| asap. For product iteration, you can't beat the speed the
| browser affords.
| btown wrote:
| Make an Electron app that simply wraps your website! Just
| build in best-practices updating of the wrapper as well from
| day one, in case you want to ship improvements to the wrapper
| or start to move more things to client side processing.
|
| As a side benefit, you get real estate in people's docks and
| desktops :)
| freedomben wrote:
| Would that help with the problem of integration though?
| What would be absolutely killer would be to emulate a USB
| HID keyboard or something, which would make it usable with
| pretty much everything, though there are definitely some
| security considerations there. Or if there are higher-level
| APIs to hook into that could work, but I would guess those
| would also require native function calls.
|
| The way Google's keyboard works on Android, but on my Linux
| computer (and my Android phone) would be my dream here. I'd
| pay $10 a month for that for sure.
| 35mm wrote:
| I tried the demo, it worked well, allowing me to add a line and
| then delete the first line - a test that Dragon or Apple would
| have failed.
|
| What does the actual app look like though? Is it only in a
| browser or can I use this anywhere on my Mac?
| apinstein wrote:
| This is really great. I imagined such a thing should be created,
| amazing to see it in reality. It would be great for those of us
| not limited to exclusively voice to be able to use commands as
| well, as I still think in some cases doing explicitly what I want
| for simple things is easier than figuring out how to explain it
| :)
| the_king wrote:
| We agree totally; voice only can be ridiculous, for example, if
| you're spelling out a username or something.
|
| The sandbox doesn't have typing, but the full app does - you
| can switch between typing and talking seamlessly there.
|
| (written with Aqua)
| nylonstrung wrote:
| This is very cool. I would immediately buy it if someone ends up
| making an Obsidian plugin
| tremarley wrote:
| This would be very effective
| justanotheratom wrote:
| This is awesome.
|
| Video talks about a Mac App. Where can I get that?
|
| Voice input did not work on Edge browser on Windows, btw.
| the_king wrote:
| Thanks!
|
| We had to make a bunch of breaking API changes over the last
| week and the Mac app isn't ready to go on it quite yet, but
| we'll bring it back as soon as we can, max two weeks, hopefully
| sooner.
| parentheses wrote:
| This is very well done!!
| oliviabenson wrote:
| You're early and this is effectively a demo but just in case this
| is a blind spot: "token" is an in-the-weeds LLMism that means
| nothing in the context of transcription. Your costs may be
| measured in tokens but that's not relevant to customers. Just "A
| free trial" with no quantifier would be better than 1k tokens.
| the_king wrote:
| Appreciate the feedback, we'll take a look at that.
| agotterer wrote:
| This is a great point and a topic I've been thinking about
| myself. As more LLM services pop up that are subject to
| token/consumption pricing, what is the right pricing model for
| consumer based consumption products like this?
| oliviabenson wrote:
| Price based on value. Pricing is hard, something as simple as
| per-token is alluring because it doesn't require any thought
| but it's leaving a lot of money on the table. There's nothing
| unique about LLMs when it comes to pricing, all common
| pricing wisdom applies.
| agotterer wrote:
| That seems challenging to do with a writing/note taking app
| like this. First, what would the pricing tiers be based on?
| Word count? That would just be another way of saying token.
| Number of documents created? That puts you at risk of long
| unprofitable documents. Google Sheets doesn't really have
| this problem because the incremental cost of storage is
| relatively cheap. Tokens on the other hand are not cheap.
|
| How do you price based on value without a corollary to
| tokens? If you charged $40 for this service then maybe you
| don't provide enough value for the casual user who does the
| occasional school report. On the other hand you may be
| unprofitable for the doctor that decides to dictate all of
| her interactions every day or the author who dictates an
| entire book.
| samatman wrote:
| > _First, what would the pricing tiers be based on? Word
| count? That would just be another way of saying token._
|
| A customer sees "word count", they understand what's
| going on perfectly, right away. Tokens? More than half of
| them will think "what, like, game tokens? do I have to
| buy them in advance?"
|
| Generously, 10% of potential customers are going to have
| even an approximate idea of what a token means in this
| context, maybe 1% could tell you that words and tokens
| aren't _quite_ the same thing.
| nprateem wrote:
| Words. Just estimate how many tokens that'd be and talk in
| words, paragraphs, etc instead.
| eggdaft wrote:
| Great point. Someone needs to replace this measure.
|
| Words? Minutes? Number of edits? Eg
|
| Free - try 10 minutes active editing a month, great for trying
| it out
|
| Light use - 120 minutes a month, perfect for jotting down a few
| things daily
|
| Pro - 600 minutes a month, write an entire essay by voice
|
| Ultra - unlimited. Make voice editing your main workflow and
| work 10 times faster
| jppope wrote:
| the signup just failed for me. the console was logging out the
| token... you might want to fix that
| jmcintire1 wrote:
| patching now!! good catch.
| jmcintire1 wrote:
| should be fixed.
| GordonS wrote:
| I have neuropathy in my arms, so this is something I'm very
| interested in!
|
| Do I have to use a specific Aqua Voice text editor, or can I use
| it in apps like JetBrains Rider and Visual Studio Code? If so,
| are there some kind of plugins that would allow using IDE-
| specific features? (e.g. "build and run the API project")
| jmcintire1 wrote:
| Hey! Right now our focus is getting the core tech solid and we
| can do that much faster if we aren't juggling multiple
| platforms and plugins (we learned this the hard way), but after
| that we are going to blitz into as many places as possible.
| hidelooktropic wrote:
| This was such a well executed demo. A few seconds in and I'm
| seeing the value. The core of the product is fully explained in
| just 36 seconds.
|
| It's less about how quickly all that transpires and more about
| presenting the product in a way that doesn't require a lot of
| talking around it. Well done.
| matsemann wrote:
| I agree, very well spent seconds. Straight to the point and
| immediately obvious what the product is doing and how useful it
| could be.
|
| My first thought, when reading the headline, was that this
| could be useful for my coworker that got RSI in both hands and
| codes using special commands to a mic. But after having watched
| it I think it can be much more than such a niche product.
| benpacker wrote:
| This is really great. I was hoping someone would build this:
| https://bprp.xyz/__site/Looking+for+Collaborators/Better+Loc...
|
| I would really happily pay $10 / month for this, but what I
| really want is either: - A Raycast plugin or Desktop app that
| lets this interact with any editable text area in my environment
| - An API that I can pass existing text / context + audio stream
| to and get back a heartbeat of full document updates. Then, the
| community can build Obsidian/VSCode/browser plugins for the huge
| surface area of text entry
|
| Going to give you $10 later this afternoon regardless, and
| congrats!
| samstave wrote:
| Take this [TEXT] read it and then let me tell you how to edit
| it:
|
| > _Certainly - let me grok your text!!... OK - I am ready!_
|
| BLAH BLAH BLAH...
|
| etc
| benpacker wrote:
| I would also love to integrate this into text areas in my app,
| or as an editor of a JSON object.
|
| That would let me quickly build an interface for editing
| basically any application state, which would be awesome!
| oulipo wrote:
| There should probably a community effort to build an open-
| source version of this around Obsidian?
| GordonS wrote:
| The demo seemed to struggle a bit with my accent (Scottish),
| getting quite a few words wrong - for example, every time I said
| "test" it would write "taste". Is this something you can improve
| going forward?
| umanwizard wrote:
| https://m.youtube.com/watch?v=NMS2VnDveP8
| GordonS wrote:
| In the past when I've been in the USA, I've legit had to put
| on an American accent when calling for taxis and the like!
|
| I don't even have _that_ strong an accent, and I always try
| my best to enunciate correctly when talking to others _shrug_
| jmcintire1 wrote:
| I'm getting married in Scotland in December and will
| presumably want to be able to demo so you can bank on
| priority support and a hard deadline :)
| GordonS wrote:
| Lol, excellent :)
| the_king wrote:
| Sorry about that. We know we need to be better about that and
| of course add more languages.
|
| A few things to try to maximize your accuracy right now are:
|
| - Don't use AirPods, especially not AirPods Pro. Most built-in
| laptop mics or EarPods or a gaming headset are perfect. It
| doesn't need to be podcast quality.
|
| - Correct transcription mistakes as you would a person, then
| "plow through" and often the error will be corrected as you
| complete the sentence.
| soco wrote:
| All "normal" voice programs struggle with us non-native
| speakers and our funny accents (sample size: 2). The first
| try on your site was satisfactory but I'll have to lurk
| around more just to feel safer... And yes, I am really
| looking forward for more languages. And switching between
| them!
| pbronez wrote:
| What's the problem with the AirPods? Too much pre-processing?
| user_7832 wrote:
| Congrats on the launch!
|
| I absolutely love the idea, as a fellow neurodivergent who works
| much better over voice than text. My only feedback is... I'd love
| to run this with more control. I already run LLMs locally (LM
| Studio), and I can run something like whisper too. I understand
| that open-sourcing (or even making the source code available)
| might go against any commercialization attempt. However, there
| are some options (Red Hat-esque) where it may be possible to
| charge for business use and allow local running for free for
| personal use.
|
| On one hand you've got a solid first-mover advantage in a field
| where lots can benefit and use this, however if someone can bork
| together several layers of LLM output they might be able to offer
| competition (and such projects are often opensource, albeit
| sometimes less "polished".) If you offer a good deal you might
| have a good chance of major success. Best of luck!
| rrr_oh_man wrote:
| So... what do you want?
| user_7832 wrote:
| Ideally, some way to run this locally on my own machine. That
| would offer more power (and also allow the product at a
| lower/no cost without any demands on their servers). Are you
| from the Aqua Voice team by the way?
| amirhirsch wrote:
| Congratulations! This is really cool. Maybe your website could
| just load into the demo? Have a talking avatar that looks like a
| paperclip with googly eyes to explain how to use it...
|
| edit: I refreshed and then it did load with the blue mic button
| moconnor wrote:
| You don't say so explicitly, but it'd be good to know what data
| goes to the cloud - I presume all of it including speech
| recordings? Or is STT on device? Also what your privacy /
| retention policies are around this data.
|
| Excellent demo and great-looking product btw!
| geor9e wrote:
| I just spent 10 seconds trying it. It was able to interpret my
| intentions and parse out commands from the literal
| transcription. "bazinga but in all caps and with a j" became
| "BAZINJA". So at the minimum, it's going through an LLM in the
| league of llama, which if run locally in browser is slow as
| molasses on my ancient MacBook. So it's definitely going to the
| cloud. As a rule of thumb, you should just assume any website
| you didn't completely code yourself is sending every mouse
| movement and every text that you type and then backspace,
| including passwords, to a cloud big data analytics repo via a
| few javascript listeners.
| blueberrychpstx wrote:
| That's a hilarious over assumption but point taken
|
| Also I really enjoyed your analysis
| FlamingMoe wrote:
| First impression: Wow, this is awesome.
|
| So let's say I work in a quiet home office by myself. Could I
| just have Aqua open throughout the day and give it notes / to-dos
| without having to click the microphone on/off each time?
| jmcintire1 wrote:
| Thank you! And yes, the app has a Background mode which is
| designed for this use case exactly
| rickydroll wrote:
| I developed an RSI-related injury back in 94/95 and have been
| using speech recognition ever since. I would love a solution that
| would let me move off of Windows. I would love a solution
| allowing me to easily dictate text areas in Firefox, Thunderbird,
| or VS code. Most important, however, would be the ability to
| edit/manipulate the text using what Nuance used to call Select-
| and-Say. The ability to do minor edits, replace sentences with
| new dictation, etc., is so powerful and makes speech much easier
| to use than straight captured dictation like most whisper apps.
| If you can do that, I will be a lifelong customer.
|
| The next most important thing would be the ability to write
| action routines for grammar. My preference is for Python because
| it's the easiest target when using chatGPT to write code.
| However, I could probably learn to live with other languages
| (except JavaScript, which I hate). I refer you to Joel Gould's
| "natPython" package he wrote for NaturallySpeaking. Here's the
| original presentation that people built on.
| https://slideplayer.com/slide/5924729/
|
| Here's a lesson from the past. In the early days of
| DragonDictate/NaturallySpeaking, when the Bakers ran Dragon
| Systems, they regularly had employees drop into the local speech
| recognition user group meetings and talk to us about what worked
| for us and what failed. They knew that watching us Crips would
| give them more information about how to build a good speech
| recognition environment than almost any other user community. We
| found the corner cases before anybody else. They did some nice
| things, such as supporting a couple of speech recognition user
| group conferences with space and employee time.
|
| It seems like nuance has forgotten those lessons.
|
| Anyway, I was planning on getting work done today, but your
| announcement shoots that in the head. :-)
|
| [edit] Freaking impressive. It is clear that I should spend more
| time on this. I can see how my experience of Naturally Speaking
| limited my view, and you have a much wider view of what the user
| interface could be.
| zellyn wrote:
| You should check out cursorless... it may be more directly
| targeting your use case
| rickydroll wrote:
| I saw it was based on Talon, but unfortunately, Talon makes
| things overly complex and focuses the user on the wrong part
| of the process. The learning curve to get started, especially
| when writing your action routines, is much higher than it
| needs to be. See: https://vocola.net/. It's not perfect; it's
| clumsy, but you can start creating action routines within 5
| to 10 minutes of reading the documentation. Once you exceed
| the capabilities of Vocola, you can develop extensions in
| Python based on what you've learned in Vocola. One could say
| that Talon is the second system implementation according to
| Mythical Man Month.
|
| My use case is dictating text into various applications and
| correcting that text within the text area. If I have to, I
| can use the dictation box and then paste it into the target
| application.
|
| When you talk about using speech recognition for creating
| code, I've been through enough brute-force solutions like
| Talon to know they are the wrong way because they always
| focus the user on the wrong thing. When creating code, you
| should be thinking about the data structure and the
| environment in which it operates. When you use speech-driven
| programming systems, you focus on what you have to say to get
| the syntax you need to make it compile correctly. As a
| result, you lose your connection to the problem you're trying
| to solve.
|
| Whether you like it or not, ChatGPT is currently the best
| solution as long as you never edit the code directly.
| stcredzero wrote:
| I remember being in a conversation back in 2002 or so, where
| some Smalltalkers were brainstorming over the idea of
| controlling the IDE and debugger with voice.
|
| It just so happens, that many of the interfaces one has to deal
| with are somewhat low bandwidth. (For example, many spend most
| of their time stepping over, stepping into, or setting
| breakpoints in a debugger.) Code completion greatly cuts down
| the number of options to be navigated second to second. It
| seems like the time has arrived for an interactive voice
| operated AI pair programmer agent, where the human is taking
| the "strategic" role.
| phillco wrote:
| > when the Bakers ran Dragon Systems
|
| For those who don't know what happened next, and why Dragon
| seem to stagnant so much in the aughts, the story about how
| Goldman Sachs helped them sell to essentially Belgian Enron,
| months before they collapsed, was quite illuminating to me, and
| sad.
|
| https://archive.ph/Zck6i
| nerpderp82 wrote:
| Goldman Sachs is such a wonderful model of what is possible
| via Capitalism. I think they are holding on what they really
| could achieve with a little will.
| jonplackett wrote:
| Down voters: sarcasm alert!
| Aeolun wrote:
| It's crazy to me they were helped by what were essentially
| boys right out of college, and they had any faith it would
| work...
| gcanyon wrote:
| That's only the intro. Here's the conclusion:
| https://www.cornerstone.com/insights/cases/janet-baker-v-
| gol...
|
| > Professor Gompers opined that at the time the acquisition
| closed, Dragon was a troubled company that was losing money
| and had regularly missed its own financial projections. It
| was highly uncertain whether Dragon could survive as a stand-
| alone entity. Professor Gompers also showed that technology
| stocks were on a downward trend, and L&H was the only buyer
| willing to pay the steep price Dragon demanded. Thus, he
| concluded that if the company had not accepted the L&H deal,
| Dragon likely would have declared bankruptcy. The jury found
| in favor of the defendants and awarded no damages to the
| plaintiffs.
| jmcintire1 wrote:
| Thank you! We love hearing stories like this.
|
| We want to get Aqua into as many places as possible -- and will
| go full tilt into that as soon as the core is extremely
| extremely solid (this is our focus right now).
|
| Great lessons from Dragon Dictation. Would love to learn more
| about the speech recognition user group meetings! Are those
| still running? Are you a part of any?
| rickydroll wrote:
| Unfortunately no. I think they faded out almost 20 years ago.
| The main problem was that without having someone able to
| create solutions, the speech recognition user group devolved
| into a bunch of crips complaining about how fewer and fewer
| applications work with speech recognition. We knew what was
| wrong; we knew how to iterate to where NaturallySpeaking
| should be, but nobody was there to do it.
|
| FWIW, I am fleeing Fusebase, formally known as Nimbus,
| because they "pivoted" and messed up my notetaking
| environment. In the beginning, I went with Nimbus because it
| was the only notetaking environment that worked with Dragon.
| After the pivot, not so much. I'm giving Joplin a try. Aqua
| might work well as an extension to Joplin, especially if
| there was a WYSIMWYG (what you see is mostly what you get)
| front-end like Rich Markdown. I'd also look at heynote.
| stevenkkim wrote:
| On a somewhat unrelated note, I remember Nuance used to be
| quite litigious, using its deep patent collection to sue
| startups and competitors. I'm not sure if this is still the
| case now that they're owned by Microsoft, but you may want to
| look into that.
| rkagerer wrote:
| I always felt coding could be such a great fit for voice
| recognition, as you have a limited number of tokens in scope
| and know all the syntax in advance (so recognition accuracy
| should be pretty good). Never saw a solution that really
| capitalized on that, though.
| ikliuger wrote:
| This is super awesome. Do you develop your own models, or is this
| a wrapper around existing APIs? It would be great to have a way
| to introduce environment variables like my name, my preferences,
| and the topics I usually write about. I've actually written this
| comment using your service. Thank you. Looking forward to seeing
| what it becomes.
| paviva wrote:
| Great work, really hope you'll be able to pierce into the medical
| market eventually. Dragon is still useless to anyone who can
| touch type.
| justinlloyd wrote:
| From one dyslexic to another, who never got the option to even
| use a computer in school or college and instead was forced to
| write out everything long-hand, thank you so much for this.
|
| I use voice-to-text in the workshop and when taking notes and
| reviewing a PR. And all the current options are pretty much what
| you would expect. More focused on accuracy, which is usually
| quite poor, which, to paraphrase, "It's Erin with an E. Oh for
| **s sake, Erin. ERIN! E. R. I. N. <pause> N. I said N. Eh-rin.
| Fine. Whatever." so anything that can improve on that experience
| will be immensely helpful.
|
| Looking forward to seeing where you go with this, and I hope at
| some point you make a native desktop application.
| the_king wrote:
| I think I developed "bad handwriting" partially to hide
| misspellings--this is necessary in school though.
|
| On pure WER we are state-of-the-art in our testing, but more
| importantly, mistakes in Aqua are _correctable_.
|
| So you can speak your mind instead of having to wait until you
| have the perfect sentence and then dictate it.
|
| That said, we know it's not perfect, but we know a few more
| months of work will have it really solid.
| samstave wrote:
| _I have goosebumps!_
|
| Jiminy Crickets...
|
| I have SOOO many use cases for your thing.
|
| [edit: what does this mean: https://i.imgur.com/rHQt6ul.png when
| attempting to demo?]
|
| ---
|
| * I want an agent that I can speak to on Mobile headset as I love
| to think out loud - and air my thoughts and thought process
| through talking through my internal dialogue - if this could just
| capture what I am saying and log it and I can refine thoughts as
| I go.
|
| For example - I ride a lot. I try to cycle 1000 miles a month if
| I am doing a solid month - but else - I ride daily and its a
| movement meditation. as I ride - I think through things and I
| speak through thought processes with differing opposing 'experts'
| in my internal monologue to self-argue through to a solution....
|
| If I could have this record all that, then random epipehnies I
| think through while on ride will be captured in a _meaningful_
| way.
|
| ---
|
| * A meeting-notes-transcriber for whiteboard sessions.
|
| * record everything you say in an interview and be able to review
| after for self-coaching
|
| * talking through a dish as you wing the ingredients so that you
| speak out loud what you did (my grandmother was friends with
| Julia Child - my grandmother taught me to cook and when it came
| to measurements of things - they always wing it per feel/taste
| "salt to taste" for example means "eh... whatever"
|
| so to be able to talk through what your 'winging it with' and it
| captures it into a salient reproducible recipe (i make a mean
| Chimi Churry (sometimes if I can recall)
|
| * a voice "body cam" for things I may say in situations where I
| may be too flustered to recall.
|
| * Speak authoring - start telling a story outline so it captures
| a synopsis that you can further develop
|
| * Speech (like giving a speech) refinement as you can talk
| through the speech and capture and rework and reiterate etc
|
| and thats just off the top of my head through your demo....
|
| LOVE this.
| jmcintire1 wrote:
| Thank you, awesome to see how many ideas this inspired! We've
| thought about a lot of similar things ourselves and will
| certainly build some of them :)
|
| Sorry about the error you are getting! It's a Firefox thing. We
| will patch. In the meantime, Chrome/Safari will work
| samstave wrote:
| < _how many ideas this inspired!_
|
| I just want to qualify - you did not _inspire_ these ideas.
|
| These are desires sought which have been there for eons...
|
| You are not _inspiring_ them
|
| You have atool that _ENABLES_ them.
|
| Seek that which already is a flustered pop of ideas waiting
| for a release valve for such thought.
|
| You are not inspiring - you are enabling that which is
| already there, think of it as which valve to open - the
| pressure is mounting upon your dyke.
| jmcintire1 wrote:
| that's a better way to put it!
| aneeqdhk wrote:
| Hands down one of the best AI demos I have seen. Last time I got
| a wow feeling like this was when ChatGPT was released.
| ryanisnan wrote:
| Wow, great demo! Excited to see this grow.
| rhyme-boss wrote:
| I use Apple dictation heavily for transcribing interviews. I've
| tried all the voice-to-text services out there and none have been
| reliable enough *at transcribing an audio file. I've settled on
| playing audio in my headphones and pausing while I carefully
| dictate text into a document. If I could upload the audio file,
| get a first-pass transcription, and then go through and edit /
| make corrections with voice, that would be awesome.
|
| A difference in error rate from 20-something percent down to less
| than 5 percent sounds incredible.
| codeptualize wrote:
| Have you tried openai whisper? Last time I compared it was
| quite a bit better than all the other options.
| mathisd wrote:
| Have you tried using Whisper from OpenAI ? Aiko [0] have
| Whisper-v2-large built-in and allow for transcription of audio
| file
|
| [0] https://apps.apple.com/fr/app/aiko/id1672085276
| jonplackett wrote:
| This is amazing. Just tried really mumbling a long for a
| while and it got every word.
| LeoPanthera wrote:
| Is there anything like this for watching foreign television
| (or radio)? I don't want to create a document, I just want
| real-time translated subtitles, but I can't do it in advance
| for live shows.
| hantusk wrote:
| Check out Descript. It's been awesome when I used it in the
| past
| c0brac0bra wrote:
| Deepgram has been incredibly accurate for me.
| hamzakc wrote:
| Things like this always remind me of this excellent talk:
| https://youtu.be/8SkdfdXWYaI?si=MFxs7wFdqws0OeCi
|
| Worth a watch.
| michaelbuckbee wrote:
| Friendly FYI - not sure if this is a skill issue on my part or
| something that's not possible yet, but I couldn't figure out how
| to change the audio input. I think when it asked for microphone
| access (chrome latest, Mac) that it chose the Macbook microphone
| which won't work as it's docked.
| jdalgetty wrote:
| Not working for me on firefox/macos
| the_king wrote:
| Sorry about that, will fix asap. I love firefox.
| tremarley wrote:
| How much would 1000 tokens give us?
| the_king wrote:
| It's 1000-1500 words. I know that seems cheap of us but the
| cost to run the Aqua stack is eye-watering right now. We will
| increase this amount as we optimize.
| aaroninsf wrote:
| Infinite details to remark on, but,
|
| NO NOTES.
|
| This is the sort of the thing that I forward to people who are
| skeptical about the disruptive capacity AI has, to take long-
| standing seemingly intractable problems, and "solve" them.
|
| Hats off. Truly inspiring in many senses!
| the_king wrote:
| Thank you very much. I can't say how much these comments
| motivate us.
|
| When I was a kid I saw tech as a really aspirational thing
| (jobs, iPod, touch screens, Skyrim). What can be done with
| modern GPUs and transformers is making me feel like that again.
| heaths1 wrote:
| Awesome demo. A challenge where I work is an extreme acronym rich
| lingo. Is your model open to extension or learning in some
| fashion to accommodate picking up thousands of acronyms? We also
| can shift into rapid, specialized speaking patterns that I think
| are quite learnable but that are not really 'out of the box' for
| normal software products. I would think many industries could
| feature their own lingos like this.
| aronhegedus wrote:
| I liked how easy the demo was to play around with! I don't have
| the most amount of use for this product, but kudos to making
| something that clearly works very well!
| hubraumhugo wrote:
| Dictation software is huge in the healthcare industry. Every
| doctor uses it, and a solution like yours could likely make their
| work much more efficient.
|
| Have you explored this market segment?
| gardenhedge wrote:
| Why do doctors use it?
| lmiller1990 wrote:
| Not OP but a big part of a doctor's job is clinical notes.
| Typing is slow, talking is fast. Less time spent taking notes
| == more time with patient.
| voiceblue wrote:
| From exploring this segment for a while, I believe that
| dictation software is the "brick" in the hair-on-fire
| analogy (hence, it provides some relief, but is far from an
| actual solution). There _is_ a form of water (scribes on
| retainer) but it is too expensive for all but the most
| profitable of specialties. The problem to be solved is not
| "dictation but better," but "take this cognitive load away
| from doctors, and keep notes accurate." (which is what
| scribes with experience do.) In a broader sense, the
| problem to be solved is the American healthcare/insurance
| system (the reason these notes have to be taken in this way
| in the first place)...
|
| > Less time spent taking notes == more time with patient.
|
| This can be true in some cases, but from what I understand,
| industry wide it would end up more like:
|
| Less time spent taking notes == more patients scheduled.
|
| Which is still of value, but fails to solve the original
| point of a physician's frustration, and possibly makes it
| worse (assuming the physician is still the one generating,
| handling and verifying the notes, but with better
| efficiency).
| picohen wrote:
| check https://vetrec.io :)
| voiceblue wrote:
| Thanks! As I mentioned, I've been looking into this for a
| while. I'll add it to my list.
|
| Now I have: - vetrec.io -
| Abridge - Scribeberry - Scribematic
| - Notezap - Lytte - Deepscribe -
| FreedAI - s10 AI - Nable -
| DeepCura - DAX copilot - Suki -
| M*Modal - Amazon Healthscribe
|
| Slightly different (?) - maybe more human:
| - Overnight Scribe - Rev.ai
| hackable_sand wrote:
| Plus the output is legible
| quadragenarian wrote:
| My wife is a Radiologist and uses voice transcription
| literally ALL DAY LONG as she reads imaging and transcribes
| her findings. Powerscribe from Nuance in case your curious
| eggdaft wrote:
| I'd consider dentistry first. It's still an open market in
| terms of SaaS, and they tend to have the same computer sat
| there all day constantly switching between the patient and the
| machine.
| rafram wrote:
| This is cool! Some feedback:
|
| - As others have said, "1000 tokens" doesn't mean anything to
| non-technical users and barely means anything to me. Just tell me
| how many words I can dictate!
|
| - That serif-font LaTeX error rate table is also way too boring.
| People want something flashy: "Up to 7x fewer errors than macOS
| dictation" is cool, a comparison table is not.
|
| - Similarly, ".05 Word Error Rate" has to go. Spell out what that
| means and use percentages.
|
| - "Forgot a name, word, fact, or number? Just ask Aqua to fill it
| in for you." It would be nice to be able to turn this off, or at
| least have a clear indication when content that I did not say is
| inserted into my document. If I'm dictating, I don't usually want
| anything but the words I say on the page.
| jmcintire1 wrote:
| Thanks for the feedback! On the last point, you can't see it in
| the sandbox, but the app has a Strict mode that does what
| you're looking for
| seabass-labrax wrote:
| > People want something flashy: "Up to 7x fewer errors than
| macOS dictation" is cool, a comparison table is not.
|
| Respectfully disagree on this one: as a startup, you can't
| effectively compete with the likes of Apple on flashiness.
| However, the very target market of those dictating large
| amounts of text will include a significant number of people in
| academia themselves. For those people, Aqua Voice will feel
| relevant. Those who aren't interested in comparison tables will
| simply skip over them :)
| musiciangames wrote:
| One of my favourite nitpicks, but IMO 7x fewer errors means
| -6 times the error rate. Maybe error rate reduced by 86%.
| zigzag312 wrote:
| Isn't it just errors_count/7? (errors_count * 1/7)
|
| For example, if you got 70 errors before you now get only
| 10 errors.
| musiciangames wrote:
| 7 times 70 =490. 490 fewer than 70 is -420. But words
| mean what you want them to mean, so 7 times fewer to mean
| 1/7th is becoming commonplace.
|
| (edited because formatting swallowed asterisk for times)
| zigzag312 wrote:
| Is there any case where your interpretation makes sense?
| Would you ever say 0.86 times fewer instead of 0.86 of
| the size?
|
| If _new_ is 1 /7 the size of the _old_ , then 7 * new =
| old. It takes 7 times of new_count to get the value of
| old_count. 1/7th of the old_count. 7x fewer seems like a
| shorthand, but I'm not a native English speaker.
| x times != x times fewer
|
| 7 times => multiply
|
| 7 times fewer => multiply by fraction
| musiciangames wrote:
| I would never say times fewer, because it is ambiguous.
| zigzag312 wrote:
| When value is above 1, it's not really ambiguous anymore
| as only one interpretation makes sense. But I understand
| that it could still be incorrect, if it's not well
| defined term in English.
| musiciangames wrote:
| You choose an interesting case at 1. What does 1 times
| less errors mean? To me it means no errors.
| djbusby wrote:
| Gotta put a back-whack in like \\* for *
| bukacdan wrote:
| I was wondering whether the table actually comes from some
| paper, or it's just a marketing trick for techy folks.
| passion__desire wrote:
| Just wanted to inform you that your demo video is actually
| unlisted and invisible to public. I hope that is not intentional.
| the_king wrote:
| Fixed. Thanks for the heads up.
| freedomben wrote:
| Some of my thoughts:
|
| 1. This is an amazing idea!
|
| 2. I love that it is browser-based so can work everywhere. Native
| app would let you integrate more tightly (such as becoming a
| "keyboard" on the system), but that probably means "a mac app"
| which doesn't do me any good on Linux. If you could keep the bulk
| of it in cross-platform tech and just do the small integration
| part with native code, I think supporting at least "the big
| three" is doable. I bet if you provided a good API, somebody in
| the open source world would even do the work for you, on Linux at
| least.
|
| 3. Would really prefer being able to sign up with my email, and
| not having to log in with a third party account.
|
| 4. Online-only access is definitely fine for now, but to stay
| competitive in the future I would keep an eye toward being able
| to run inference locally so you don't have to be online to use
| it. This would also be a way for you to reduce costs and offer a
| cheaper version. If I were you, my long-term goal would be for
| this to be used by everybody (though that's years down the road).
| Local inference does complicate monetization, but that can be
| figured out.
|
| 5. For me to really use this enough to pay out every month, it
| needs to be relatively easy for me to get the output into
| whatever app I'm using, whether that is Chrome, Slack, Gmail,
| Google Docs, Vim, Gedit, or anything else. This is undoubtedly
| related to item 2 above, but I figured it warranted it's own
| mention as there may be solutions besides browser-based vs.
| native.
|
| 6. You're gonna have competitors hot on your heels, if they
| aren't already. Google in particular with GBoard on Android could
| be absolutely killer. Since it is Android-only, I don't think
| it's a major competitor now, but if they broadened it absolutely
| could be.
|
| 7. Do you have an exit strategy in mind already? Would you be
| willing to share anything on that? (I ask because it's relevant
| because your product could easily become part of my standard
| workflow, and I'm very conservative about becoming dependent on
| proprietary products, especially from startups). Please do not go
| native-only and only release a Mac app. At a minimum, please
| maintain the web-based version. And please for the love of all
| that is holy, don't sell to/get acquired by Apple! I want and
| need your product, and I don't and won't switch platforms (Fedora
| Linux currently) to get it.
|
| Really amazing idea and great work! It is rare that I see
| products that I think could actually "change the world" but this
| one has some potential by changing the way we interact with our
| computers!
| Centigonal wrote:
| Great product idea, excellent demo. Fantastic use case for LLMs.
| Keep it up!
| lxe wrote:
| Fascinating. Are you still using Whisper in any of these
| MoExperts to tanscribe or do you have something custom? Would
| love to learn more about the tech.
| youssefabdelm wrote:
| I feel like I'd much prefer this as an API I can request and get
| realtime updates from so that I can hook it into any application.
| Is that on the roadmap?
|
| Also latency seems to be a bit slow, wish it was faster, maybe
| thats due to traffic now
| ajolly wrote:
| I'll certainly go give this a spin later as I use voice to text
| daily. My first few questions:
|
| How's the dictation accuracy compare to Talons latest model, or
| Microsoft's new voice access? Or dragon? You've got a few
| comparisons already but nothing that I actually use.
|
| What's the latency like?
|
| At least for me a general voice editor isn't useful, give me
| something that can send text to wherever my mouse is pointing and
| that's useful. Then make sure it works with Microsoft's voice
| without borders, synergy, barrier, input director etc.
|
| Oh and does it support a user dictionary?
| the_king wrote:
| We'll be releasing a custom dictionary and templates soon. We
| are testing them internally now, and they aren't quite reliable
| enough to release, but we understand how important this is for
| many workflows.
|
| On accuracy, we benchmark very well against even large async
| models, with a WER of .05-.06 and when Aqua does make a mistake
| you can often correct it by just telling it "no it's our side
| not outside" and it won't mangle the text.
| tumidpandora wrote:
| This site can't be reached(?)
| theonething wrote:
| > Aqua is smart enough to figure that out and to only take the
| last version of the sentence
|
| I wish Siri, Alexa, et al would do this as well. They seem to
| expect you to speak perfectly the first time.
| gleb wrote:
| Tried it. Seemed quite impressive. Two issues:
|
| - it consistently uses word two instead of to
|
| - forcing Google OAuth as the only way to sign up is not a good
| idea. That prevented me from signing up.
| the_king wrote:
| Did you wait for the text to turn blue and then black? And were
| the twos still wrong then? The real-time text is non-final
| tokens and has many more errors than what is ultimately
| committed to the document (but committing is slower than we'd
| like at the moment).
| gleb wrote:
| Yes I did. I even later tried to tell it to fix this, and was
| not successful.
| jzellis wrote:
| I tried it in Firefox on my Android and got this error when I
| tried to use the demo:
|
| "Error: NotSupportedError: AudioContext.createMediaStreamSource:
| Connecting AudioNodes from AudioContexts with different sample-
| rate is currently not supported."
| saint11 wrote:
| Same here
| mrandish wrote:
| FYI to the devs... I got the same error on Firefox Win11 x64.
| jmcintire1 wrote:
| Patching this now!
| jmcintire1 wrote:
| This is fixed!
| feverishaaron wrote:
| My child is profoundly dyslexic. This kind of tool is a game-
| changer for him.
| the_king wrote:
| Hope this can be helpful. We know there are still many kinks to
| iron out.
|
| On another note, I think once you leave school dyslexia can
| become a wash or even a net positive in the right setting. I
| think whatever the brain config is can be a huge unlock for
| creative thinking - it's not always super helpful in the school
| context, but can be really asymmetric in tech and probably
| other industries.
| WheelsAtLarge wrote:
| WOW!!! Just wow...
|
| When will we PC peeps get to use it?
| the_king wrote:
| You can use it in the browser right now! but we get... native
| is better for voice stuff and we'll be in more places soon.
| samstave wrote:
| Train this model on this:
|
| >> _Dearest creature in creation, Study English pronunciation. I
| will teach you in my verse Sounds like corpse, corps, horse, and
| worse. I will keep you, Suzy, busy, Make your head with heat grow
| dizzy. Tear in eye, your dress will tear. So shall I! Oh hear my
| prayer._
|
| >> _Just compare heart, beard, and heard, Dies and diet, lord and
| word, Sword and sward, retain and Britain. (Mind the latter, how
| it 's written.) Now I surely will not plague you With such words
| as plaque and ague. But be careful how you speak: Say break and
| steak, but bleak and streak; Cloven, oven, how and low, Script,
| receipt, show, poem, and toe._
|
| >> _Hear me say, devoid of trickery, Daughter, laughter, and
| Terpsichore, Typhoid, measles, topsails, aisles, Exiles, similes,
| and reviles; Scholar, vicar, and cigar, Solar, mica, war and far;
| One, anemone, Balmoral, Kitchen, lichen, laundry, laurel;
| Gertrude, German, wind and mind, Scene, Melpomene, mankind._
|
| >> _Billet does not rhyme with ballet, Bouquet, wallet, mallet,
| chalet. Blood and flood are not like food, Nor is mould like
| should and would. Viscous, viscount, load and broad, Toward, to
| forward, to reward. And your pronunciation 's OK When you
| correctly say croquet, Rounded, wounded, grieve and sieve, Friend
| and fiend, alive and live._
|
| >> _Ivy, privy, famous; clamour And enamour rhyme with hammer.
| River, rival, tomb, bomb, comb, Doll and roll and some and home.
| Stranger does not rhyme with anger, Neither does devour with
| clangour. Souls but foul, haunt but aunt, Font, front, wont,
| want, grand, and grant, Shoes, goes, does. Now first say finger,
| And then singer, ginger, linger, Real, zeal, mauve, gauze, gouge
| and gauge, Marriage, foliage, mirage, and age._
|
| >> _Query does not rhyme with very, Nor does fury sound like
| bury. Dost, lost, post and doth, cloth, loth. Job, nob, bosom,
| transom, oath. Though the differences seem little, We say actual
| but victual. Refer does not rhyme with deafer. Feoffer does, and
| zephyr, heifer. Mint, pint, senate and sedate; Dull, bull, and
| George ate late. Scenic, Arabic, Pacific, Science, conscience,
| scientific.
|
| >>_Liberty, library, heave and heaven, Rachel, ache, moustache,
| eleven. We say hallowed, but allowed, People, leopard, towed, but
| vowed. Mark the differences, moreover, Between mover, cover,
| clover; Leeches, breeches, wise, precise, Chalice, but police and
| lice; Camel, constable, unstable, Principle, disciple, label.*
|
| >> _Petal, panel, and canal, Wait, surprise, plait, promise, pal.
| Worm and storm, chaise, chaos, chair, Senator, spectator, mayor.
| Tour, but our and succour, four. Gas, alas, and Arkansas. Sea,
| idea, Korea, area, Psalm, Maria, but malaria. Youth, south,
| southern, cleanse and clean. Doctrine, turpentine, marine._
|
| >> _Compare alien with Italian, Dandelion and battalion. Sally
| with ally, yea, ye, Eye, I, ay, aye, whey, and key. Say aver, but
| ever, fever, Neither, leisure, skein, deceiver. Heron, granary,
| canary. Crevice and device and aerie._
|
| >> _Face, but preface, not efface. Phlegm, phlegmatic, ass,
| glass, bass. Large, but target, gin, give, verging, Ought, out,
| joust and scour, scourging. Ear, but earn and wear and tear Do
| not rhyme with here but ere. Seven is right, but so is even,
| Hyphen, roughen, nephew Stephen, Monkey, donkey, Turk and jerk,
| Ask, grasp, wasp, and cork and work._
|
| >> _Pronunciation -- think of Psyche! Is a paling stout and
| spikey? Won 't it make you lose your wits, Writing groats and
| saying grits? It's a dark abyss or tunnel: Strewn with stones,
| stowed, solace, gunwale, Islington and Isle of Wight, Housewife,
| verdict and indict._
|
| >> _Finally, which rhymes with enough -- Though, through, plough,
| or dough, or cough? Hiccough has the sound of cup. My advice is
| to give up!!!_
|
| =====
|
| --
|
| I dont have the energy to defend a F up -- but there is a LOT of
| really cool development happening on HN... from AI to all sorts
| of SHOW and ASK and a just F-TON of keeping track.
|
| Iam not an OCD content influencuer focused type...
|
| But know --
|
| the VELOCITY of thought that is flowing through HN and human
| conscious as excelleratated by our tipping-the-cup on AI is
| having IRL consequences on both mentality and reality....
|
| If there is a community for a higher velocity firehose of where
| we are going share it.
|
| So - we are sprewing a firehose of ideas into the quantum future,
| as unknown-boomerangs
|
| The truth is to understand the boomerangs...
|
| ( to de-vague-lize this: Tesla:
|
| Pre compute an AI token. 3:6:9
|
| This token is a prime reflection of that.
| pablopeniche wrote:
| lol based, we'll do
| Arctic_fly wrote:
| I remember hearing about Dragon when I was in elementary school.
| It's cool to reflect on how far things have progressed in the
| last decade and a half.
| FloatArtifact wrote:
| Congratulations on an interesting project. There is a lost
| opportunity with your natural language only approach. The issue
| is natural language will never be efficient as an interface.
| Natural language helps with low domain knowledge. That's the plus
| side as it allows the end user to say a variety of phrases to get
| the desired result. Commands allow for surgical precision and
| efficiency/less voice strain for its end user. So there needs to
| be an approach that allows for both elements natural language and
| commands. As users develop their own process and workflow they
| will create actions as commands. (high domain knowledge)
|
| Since these commands are self created by the end user they
| remember them for their specific purposes. These often are high
| frequency of use commands where low use would still leverage
| large language model. You have an opportunity here to leverage
| this workflow. Being able to create commands with large language
| model is not something many projects have explored.
| lukko wrote:
| This is amazing! It's very satisfying to use and the combination
| of transcription + intent seems like it has huge potential.
|
| I would love to use this in healthcare for dictating patient
| letters etc. I guess a local model / HIPAA compliance is some way
| off?
| dharma1 wrote:
| This is super cool! Should ideally happen at the OS level (some
| future version of Siri) across whatever apps you're using
| lolpanda wrote:
| I love this idea. Wish there's a browser extension so I can
| dictate in my emails.
| whiplash451 wrote:
| Congrats on the launch. The demo is truly impressive. On my Apple
| cell phone with a Chrome browser, the latency feels a little
| sluggish (I am sure you are working on it). Congrats again and
| all the best!
| the_king wrote:
| Thanks, appreciate it. We can do a lot better than most people
| experienced today in terms of latency.
| _qua wrote:
| This is awesome, will likely subscribe--just need to pare down
| some of my other subscriptions--there are too many tempting AI
| products lately.
| jmcintire1 wrote:
| I understand the feeling :) great to hear!
| cryogenicplanet wrote:
| This looks very sick!
|
| any chance of getting this as native apps on mobile? or better
| yet like global macos utility like dictation so you can "type" in
| any apps?
| ortusdux wrote:
| I've wanted something like this for data entry for a while now. I
| often find my hands full measuring things and need to take notes.
| Can this output/format tabular data?
| ambersahdev wrote:
| I don't think it would perform too well in tabular contexts, at
| least with just natural language, for reasons I've explored in
| one of my own projects [https://github.com/AmberSahdev/Open-
| Interface/?tab=readme-ov...].
|
| That said, I would still think it's very doable to set reserved
| keywords to navigate it ourselves while keeping it
| conversational.
| daotoad wrote:
| When I was playing around with the demo, I gave it a list of
| things to do and then asked it to convert the list to a
| markdown table and label the second column estimated duration.
| It worked like a charm. It set the first column heading to
| "Description" even.
|
| I was then able to go through my list very quickly and add
| times to each item.
|
| The one failure I had was when I asked it to add whitespace to
| visually align the table columns.
|
| The table was, however, converted back into a list when I asked
| it to turn the text into an email.
|
| It's not exactly punching numbers into a spreadsheet, but it
| worked pretty well for the simple use case I tried.
| ortusdux wrote:
| Thanks, I'll have to try it out. Ideally I'd like to ask for
| a table with lettered/numbered columns and rows, and then
| just call out "B6 is 2.024" as I go.
| daotoad wrote:
| As others have said, good job.
|
| This seems like it would be particularly good on a telephone or
| my watch. In those places it seems like a real game changer in
| terms of ability to take notes when the keyboard experience is
| less than awesome.
|
| Have you tried using it to write code? This could be amazing as a
| IDE/text editor plugin.
|
| It's nice to see someone do something that's not regrettable with
| AI. So many of the applications we see are horrible. What you've
| made is brilliant and very far from being just another cursed
| chocolate factory experience.
| mavsman wrote:
| Since voice-to-text has gotten so good I've used it a lot more
| and also noticed how distracting and confusing it can be. Using
| Apple's dictation has a similar feel to this where you're
| constantly seeing something that's changing on the screen. It's
| kind of irritating and I don't really know what the solution is.
|
| One suggestion I have here is to have at least two different
| sections of the UI. One part would be the actual document and the
| other would be the scratchpad. It seems like much of what you say
| would not actually make it into the document (edits, corrections,
| etc) so those would only be shown in the scratchpad. Once the
| editor has processed the text from the scratchpad then it can go
| into the document how it's supposed to. Having text immediately
| show up in the document as it's dictated is weird.
|
| Your big challenge right now is just that STT is still relatively
| slow for this usecase. Time will be on your side in that regard
| as I'm sure you know.
|
| Good luck! Voice is the future of a lot of the interactions we
| have with computers.
| codercowmoo wrote:
| Distil-whisper is incredibly fast. Realtime on a 3060 Ti, and I
| used it to transcribe an 11 hour audiobook in 9 minutes.
| peddling-brink wrote:
| You know, those audiobooks already have transcriptions. Often
| written by the original author!
|
| I kid. Your comment made me think of a shower thought I had
| recently where I wished my audiobook had subtitles.
| jazzyjackson wrote:
| It really is a little absurd IMO that the text of the book
| is sold separately from the audio.
| chrisaiv wrote:
| Book publishing industry is different from audio
| recording industry.
| robbomacrae wrote:
| Not trying to hijack this. Great demo! But STT can be very much
| real-time now. Try SoundHound's transcription service available
| through the Houndify platform [0] (we really don't market this
| well enough). It's lightning fast and it's half of what powers
| the Dynamic Interaction demos that we've been putting out.
|
| I actually made a demo just like this aqua voice internally
| (unfortunately didn't get prioritized) but there is really no
| lag. However it will always be the case where the model will
| want to "revisit" transcribed words based on what comes next.
| So if you want the best accuracy you do want to wait a sec or
| two for the transcription to settle down a bit.
|
| [0]: https://www.houndify.com
| jbmsf wrote:
| I love it when someone shares an idea that I wouldn't have
| considered (which is on me) and so clearly solves a real problem.
| raylad wrote:
| I tried using it for a screenplay and it more or less knew the
| format, but didn't remember what to do with the blocks of text or
| how to separate them properly.
|
| It would be handy to be able to select various formats and have
| it know to keep to that format.
|
| Also I really liked it but found it quite slow. Assuming that
| will improve over time.
| jmcintire1 wrote:
| 100% agree on speed -- there's lots we can and will do to
| improve that
|
| also good point on memory. this turns out to be relevant in a
| lot of cases!
| tkgally wrote:
| After watching the video demo and logging in, I was able to
| compose and edit text easily. Nice job.
|
| My own use case is a bit different from many others who have
| commented here. I'm a reasonably fast typist and don't currently
| have any physical or neurological issues that might make typing
| difficult. I have tried voice input methods a number of times
| over the years, as I thought speaking would be faster than
| typing, but I always went back to typing due to accuracy problems
| and difficulty editing.
|
| Aqua Voice does seem to be a significant advance. I'm going to
| try it out from time to time to see if I can get comfortable with
| voice input. If I can, I will subscribe.
|
| I drafted this comment using Aqua Voice, but I ended up editing
| it quite a bit with a keyboard before posting.
| harryp_peng wrote:
| I don't think there is genuine use cases to this except for
| accessibility features etc. I always doubted voice; however,
| I'm beginning to see that Oai Voice Engine will potentially be
| huge and incredibly addicting to chat to. I.e. it's more
| interesting to confide in a human sounding 'friend' that
| vanilla ChatGPT with keyboards.
| the_king wrote:
| Appreciate it.
|
| I think the preference for voice versus typing is something
| that hits everyone differently, and I think as the reliability
| and speed improves, more and more people will find themselves
| using voice as a "tool in the toolbox," which aside from the
| occasional "Hey Siri, set a timer," isn't the case today.
| moonshotideas wrote:
| Love this app, love the story and everyone being so supportive is
| making my day! Hope you guys go all the way!
| reissbaker wrote:
| This is excellent and very impressive. Have you thought about
| offering this as an API? I'd bet there are lots of startups that
| want to easily integrate better speech-to-text for conversational
| AI (e.g. word correction, adding punctuation, etc), and would pay
| you for the service.
|
| (Personally, I would! Email me at matt@syntheticdreamlabs.com if
| you're interested in offering it as an API, I'd be pretty curious
| about pricing.)
| tebbers wrote:
| Yes, upvote for this please! We would pay to integrate this
| into our SaaS for our users.
| jmcintire1 wrote:
| We have thought about this: there are some boring reasons that
| it doesn't make sense to do _right now_ but I would definitely
| be interested in learning about your use case and keeping you
| posted if and when we decide to do it!
| reissbaker wrote:
| For sure! Email me if you want to chat more :)
| ceroxylon wrote:
| Impressive to see the total lack of response for "umm" and "ahh"
| filler sounds (in the text editing area), it does seem to
| recognize them for what they are. Also the rest of it seems
| valuable, especially the formatting abilities.
| habosa wrote:
| This is super impressive!
|
| I am really looking forward to the day when dictation like this
| can be done locally on our phones. I'd really like to do a lot of
| my basic messaging with my voice but the need to do corrections
| with that tiny keyboard means it's not much of a time saver.
| jmcintire1 wrote:
| 100%. This day will come!
| atlintots wrote:
| No way, I was literally mulling over the exact idea of voice-
| driven text editing (focused towards programming), using a mix of
| voice commands and usual speech to text. This is really exciting
| to see!
| the_king wrote:
| The programming use case is pretty interesting.
| Merik wrote:
| Nice work and congratulations on the launch! It's awesome when a
| project works as expected/assumed. I immediately started using it
| to build my packing list for my upcoming weekend away.
|
| Three suggestions:
|
| 1. If I try the demo on the landing page, and then sign up, I
| wanted it to automatically have the contents from my demo usage
| on the landing page copied into my newly created account.
|
| 2. I would like another option than an auto renewing
| subscription. If I could have loaded 5$ or 10$ that my usage
| would deduct from, instead of having to subscribe, I would have
| paid money right then and there, but I didn't because I have too
| many damn subscriptions (alternative is to offer a non auto renew
| $10 one month trial).
|
| 3. iOS app please :)
| jmcintire1 wrote:
| Thanks Merik! "It does what it says " is honestly my favorite
| feedback...
|
| 1. Good call on demo page, would be slick to sync that up with
| account. 2. Subscription exhaustion is a real thing --
| credits/usage based billing is an interesting 3. asap!
| selvan wrote:
| Nice, (code like) Refactoring meets speech-to-text
| dools wrote:
| I used this to type an email on my phone, the editing process in
| particular was very smooth once I realised I had to wait a little
| bit. You can have ten of my dollars per month.
| the_king wrote:
| Thanks! Sorry about the lag, today was slower than normal, but
| we need to improve latency overall.
| dools wrote:
| I've got a pinned tab and the site saved to my home screen, I
| look forward to using it as you build it out, very promising
| product well done.
| ForrestN wrote:
| This is so cool! Great work. I'm writing this comment using Aqua
| Voice, and it's very impressive. I've been waiting for something
| like this. As a neurodivergent person, certain tasks (cough,
| email, cough) are about 10 times harder sitting down at my
| computer than they are handling them aloud with my assistant.
|
| I'm sure you get this feedback 100 times a day, but I'd gladly
| pay a substantial amount to use this in place of the system
| dictation on my Mac and iPhone. Right now, the main limitation to
| me using it constantly would be the endless need to copy and
| paste from this separate new document editor into my email app or
| into Notion or Google Docs, etc.
| the_king wrote:
| Awesome. Agree on the copy-paste annoyance, we're working on
| more clients.
|
| But I do think that the reliability needs to take a few more
| steps before it becomes a true keyboard replacer.
| ForrestN wrote:
| Thanks for all your hard work! Even, as a start, I found
| myself asking the app to copy the text to the clipboard for
| me without even thinking. Might be nice to be able to do that
| more seamlessly, just as a start?
|
| You've moved us all a lot closer to my dream: taking a long
| walk outside, AirPods in, and handling the day's email
| without even looking at a screen once.
| the_king wrote:
| That's a great idea, we should do that.
|
| I have a similar dream, we'll make it happen!
| ForrestN wrote:
| Two more small pieces of feedback, in case they're useful:
|
| - Consider a time-based free trial. As others have said, tokens
| are confusing, but also your model is unlimited so the chunk of
| tokens doesn't allow me to see what it might be like to
| actually use your product. I'm more than halfway through my
| tokens after writing an HN comment and a brief todo list for
| work, so I've been able to see what it'd be like to pay the $10
| for about 5 minutes worth of work, which feels like a very
| short trial. A week, say, seems fair? And then you have some
| kind of cap on tokens that only comes up if someone uses an
| abusively huge amount (an issue, I'm sure, you'd face with
| paying customers too, right?)
|
| - I had a bit of trouble with making a todo list--I kept
| wanting the system to do a "new line" or "next item" and show
| me a new line with a dash so I know I'm dictating to the right
| place, but I couldn't coax it into doing that for me. I had to
| sort of just start on the next item and then use my keyboard to
| push return. When making lists, it's good to be able to do so
| fluidly and intentionally as much as possible. Sometimes it did
| figure out, impressively, that long pauses meant I wanted a new
| line. But not always.
| ashton314 wrote:
| Holy crap I'm blown away by the demo. That was so easy and
| natural to use.
|
| I had a brush with RSI some years ago. I'm good now with an
| ergonomic keyboard and better habits (water, exercise) but it
| brings me great comfort knowing something like this exists. Thank
| you!
| jmcintire1 wrote:
| Thank you for trying it! Our goal is to make it not just a
| suitable alternative but way better than your keyboard. Still a
| lot of work to do, but that is the fun part :)
| eiiot wrote:
| One of the first Launch HN products I've been excited about in a
| long time. I'm a student, and really looking forward to using
| this to write papers, assignments, emails, etc.
|
| Congrats!
| jmcintire1 wrote:
| Thanks Eliot! Let us know how we can make it better for you!
| saretup wrote:
| This is really nice. I tried it out on the app and wanna start
| using it more. The only issue stopping me is the privacy policy.
|
| I understand why you have to retain the voice data, however, is
| there any way you can implement an opt-out feature?
|
| I'm just not comfortable with my voice data lingering in some
| servers, ready to be used for training ml models.
| moomoo11 wrote:
| How else can it get better?
| rrr_oh_man wrote:
| Money
| methodicallymad wrote:
| Are you saving the voice recordings and/or using that data for
| training?
| deepGem wrote:
| Impressive demo.
|
| I noticed a correction that was done retroactively in the demo
|
| 'make that H100 GPUs'
|
| and noticed that there was only one instance of the token GPU.
| Hence the correction was seamless. Had there been a couple more
| instances I guess all GPU tokens would have been replaced by H100
| GPUs. I guess you could say make that NVIDIA H100 GPUs that would
| be more accurate but if there were multiple instances and you
| needed the change only in one instance, not sure how that'd fly.
| I am nitpicking but this could be a common theme.
|
| The fact this can retroactively change the text and also
| understand a command is quite brilliant. I don't see any trigger
| word for a command, so wondering if I needed the command as part
| of actual text, how would that work ?
| the_king wrote:
| Great question. I just tested it, and Aqua was smart enough to
| figure out which "GPU" I was talking about using context.
|
| In the example below I asked it to "make it H100 GPUs" and it
| only modified the GPU in the list.
|
| Aqua isn't perfect though, and while I think we are mostly
| solving this case there are plently where we need to do better.
|
| ---
|
| Hey Team,
|
| I just had a chat with the marketing people, and they confirmed
| we should buy enough GPUs to meet the demand.
|
| Our Equipment List:
|
| - 1,000 H100 GPUs
|
| - 1,000 processors
|
| - 500 NVME SSDs
|
| - 200 standard racks
|
| edit: formatting
| eggdaft wrote:
| Suggestion for go-to-market: if you haven't already done so, try
| to sell this direct to universities. They are an absolute pain to
| deal with, but they have various obligations to show how they are
| supporting dyslexic students etc and this fits the bill
| perfectly.
|
| Their lead times are long so I'd start establishing trials now,
| partner with a university on agreement of a discount and use them
| to develop the software.
|
| Once you've got a few uni's onboard, you can rapidly expand and
| they are very unlikely to churn as you're serving such an
| important niche.
|
| I know a very similar product that has had huge success doing
| this.
| the_king wrote:
| Great idea.
|
| This is definitely worth looking into. We don't want to slow
| down the pace of development, but this might be a case where
| the partnership makes too much sense.
|
| A huge portion of what goes on in universities is writing.
|
| edit: Reading this back, I thought I sounded too eager to
| partner with universities. We're a tech company, and quality
| and performance will always come first.
| eggdaft wrote:
| Working with universities won't slow your pace of
| development, because you can work directly with users on
| that.
|
| But go-to-market, after the initial delay, uni's will drive
| sales fast at scale.
| apex_sloth wrote:
| I'm surprised that universities have to consider dyslexic
| students. When I went to university, I was basically told to
| figure it out. "It's your problem after elementary school.
| nobody cares."
| ragebol wrote:
| This is great! I'd love this as a plugin for Obsidian :-)
| jellyfish24 wrote:
| Super cool and I'm excited to try it out more as I also prefer
| voice to typing.
|
| I'm think I'm having trouble getting it to edit the existing text
| when writing though? https://i.imgur.com/IfoWvMG.png
| the_king wrote:
| Yikes! I think we may have normal mode a bit too conservative
| at the moment. We've got a lot more tuning to do.
|
| One thing you can try if Aqua doesn't seem to be "getting" what
| you're after is to say something like "make this the email," or
| "transform this into the email I want to write." Sometimes it
| needs an extra push.
| kaspian wrote:
| Well done! Without context I was a bit confused regarding the
| usability given the speed of the models (which of course can and
| will be improved), but given your story I'm sold and sure this
| will help a lot of people.
|
| More input modalities are IMO always better, and being able to
| switch to this in the future when your fingers get tired would be
| awesome, my key painpoint from the demo is the speed, but I'm
| sure that will go up as models and inference speed gets better.
|
| One awesome thing would be to integrate the contextual
| understanding with a programming copilot, so you can pair program
| with only your voice as input.
|
| Rooting for you guys!
| the_king wrote:
| Thank you. 100% I'm with you on speed.
|
| I think we will be in a much better place speed-wise in a few
| months; some of that will be our stack, and some of that is
| what is happening lower down in the stack, but it will be
| meaningfully faster and more responsive soon.
| authorfly wrote:
| Nice demo, I like this.
|
| Will this work for programming languages? I often would like to
| type code by talking and especially, to talk into AI coding tools
| that help you complete code.
|
| Will this allow me to add links to text based on context? Like
| "Go to <mysite>/blog and link to the article on Voice AI" - I do
| this quite a lot in my article writing.
|
| Will this work over the sound of me cooking a loud fry up?
|
| Take all 3 and I may never leave the kitchen again.
| idk1 wrote:
| This is incredible, I said go back and swap one word with
| another, and it did it, this has blew my mind, I've not been able
| to do that before.
|
| I'm a heavy voice dictation user, and I would switch to this in a
| heartbeat. I'll tell you why this is so impressive, it means you
| can make mistakes and correct them with voice, it takes away the
| overhead of preparing a sentence in your mind before saying it,
| one of the hardest things about voice dictation.
|
| I often have my shoulder in pain, and I have to reach for my
| mouse to change a word, I would not if I used this. This software
| would literally prevent me pain.
|
| However, I cannot use it without a privacy policy. I have to know
| where the recording of my voice is being saved, if it's being
| saved, and what is it going to be used for.
|
| I would pay extra for my voice to be entirely deleted and not
| used afterwards, that could even be an upsell part of your
| packages. Extra $5 to never save your voice or data.
|
| I love it, but I can't use it for most things without a privacy
| policy.
| SuperHeavy256 wrote:
| How soon can I use this to write code?
| geniium wrote:
| The video demo is sooo slow. Seems like I would type it faster
| evrenesat wrote:
| I've just subscribed! Congratulations for this well crafted,
| immediately useful tool. It's something I was already looking
| for.
|
| Some Feature Requests:
|
| - Please implement proper undo/redo history and allow us to use
| it with voice, GUI and keyboard.
|
| - When I take over with keyboard, do not mess while I'm writing.
|
| - When I start speaking, if the text cursor is in between words,
| I'd expect it to insert, not append. (if you want to avoid this
| from happening accidentally, after some idle time, you may move
| cursor to end of the text)
|
| - A little advanced feature; It would be very nice to have some
| form of tagging for sentences and/or paragraphs with numbers,
| text or colors. So it would be easier to delete, reorganize, move
| around them by saying like "move the green section before purple
| one" or "part 12 should go before part 4", then we can let the
| language model to do it's magic and rephrase/reformat it a bit.
| Your AI probably smart enough to understand which sentence we are
| referring to by just hearing part of it, but as a user we may
| feel too lazy for that.
|
| - Add a cheat-sheet for what's possible next to editor. Something
| more succinct than the notion doc.
|
| - Allow pausing/resuming dictation with voice command.
|
| Note: I would really like to be able to benefit from its smart
| features while I'm not sitting in front of my computer or holding
| my phone. But I'm not sure there will be a day when Apple let us
| interface with smarter AI agents like ChatGPT or Aqua Voice on
| iOS while the screen is locked. IMO, this is gatekeeping of an
| inferior feature (Siri) in the name of protecting my privacy. I
| hope some day EU will also intervene with that.
| appel wrote:
| That looks amazing, congrats!
|
| Minor friendly heads up, the withaqua.com link in the description
| of your youtube video is currently not a link.
| mike31fr wrote:
| Amazing! I'm french and I'd like to know if there is a chance I
| can dictate in french to Aqua Voice someday? If yes, any idea
| when that would be implemented? Great work!
| benpacker wrote:
| One more piece of feedback: I'd like the new note page to have a
| unique route so that I can make a shortcut for creating a new
| note to easily put somewhere else in a document
| SimonDuerr wrote:
| Really cool product! The demo worked well for me. This must be
| incredibly useful to so many people.
| ayeager wrote:
| would subscribe in a heartbeat if i could upload audio files for
| transcription.
| joshstrange wrote:
| I think I would pay for this in a heartbeat if it was more
| "available", as in not just a web app. I'd love a native app
| where I could use this in any textbox (web, app, etc) on my Mac.
| Ideally I'd be able to remap the key I use for mac dictation. I
| think I'd be fine with a popup where I write all the text and
| then it just inserts it after it finishes (so you can just render
| your UI and paste in the result instead of needing to interact
| with existing text fields.
| Tistron wrote:
| Cool product. I signed up.
|
| I wish there was a clear way of sending you feedback though,
| there are some details that annoy me a lot.
|
| I'm transcribing some recorded sound, and after I have had it
| transcribed and I am editing it, every time I tab away from the
| browser the cursor position gets lost. The focus on the textarea
| is also lost, and when I click it, it doesn't insert the cursor
| where I click but at the start of the document so I even loose my
| scroll position.
|
| As a paying customer I'd hope to have a way to give you this kind
| of feedback. It should be fairly easy to make this a much better
| experience.
|
| Really cool product all in all though! I don't often subscribe to
| stuff.
| jmcintire1 wrote:
| Absolutely, and sorry for the UX annoyances. We'll add a way to
| quickly share this kind of thing from within the app.
| throwaway4aday wrote:
| As many others have noted, once you've got everything stable (and
| hopefully profitable) you should seriously explore a way to use
| this as input into any text field in any program. Microsoft is
| actively experimenting with something similar in Copilot Voice
| although theirs is very integrated with the editor and
| specialized for code. It would be great to have these types of
| voice interfaces in all software. Maybe you could look at
| providing a way to integrate with your system through an API so
| others could do the heavy lifting of creating a native experience
| for each app?
|
| Absolutely amazing product by the way! The 1000 free tokens is
| enough, the fact that people are complaining about running out
| too soon is good, it shows that they like the product and want to
| use it more. They do have a point about adding a rough word
| count, maybe just a subheading that says "on average, X spoken
| words".
| aminick wrote:
| Beatiful. We are exploring new ways of human-machine interaction.
| C-Loftus wrote:
| Nice work. I am very involved in the Talon community and it is
| cool to see other projects tackling voice interaction from
| different perspectives.
|
| I develop a very similar natural language voice interaction tool
| using the OpenAI API and Talon as the engine[0]. (i.e you apply
| any voice command transformation with AI on any text, or use it
| alongside Cursorless for semantically targeting scopes in the
| AST) You can use my solution with offline LLM models too.
|
| If you are interested in chatting, please reach out[1], as I am
| very interested and experienced in this space.
|
| [0] https://github.com/c-loftus/talon-ai-tools
|
| [1] https://colton.bio/contact/
| mauroszu wrote:
| This is useful, I hope this is enabled for other languages soon
| Onawa wrote:
| Just thought I'd let you know about this event that popped up in
| my inbox that I think you should definitely attend.
|
| 2024 Bridge2A Voice Symposium | Voice as a Biomarker of Health
|
| https://www.eventsquid.com/event.cfm?id=22807
|
| The 2024 Voice AI Symposium will be a groundbreaking 2-day event
| and unique opportunity to connect with stakeholders invested in
| artificial intelligence and voice biomarkers. This year's
| symposium will serve as a nexus for dialogue, collaboration,
| awareness, and engagement across diverse sectors and members of
| the community about the use voice of artificial intelligence in
| healthcare. Attendees will experience dynamic speakers, panels,
| and networking opportunities. Innovative interactive events
| include a Call for Science with 3 submission categories, a Voice
| AI Tech Fair, and a patient challenge competition.
| jmcintire1 wrote:
| Great recommendation, thanks for sharing!
| m0d0nne11 wrote:
| Paywalled behind Google Oauth? Why?
| genewitch wrote:
| it's susceptible to subtle prompt injection. I got it to output
| LLM like responses to just snippets of speech.
___________________________________________________________________
(page generated 2024-03-27 23:02 UTC)