[HN Gopher] Show HN: A real time AI video agent with under 1 sec...
___________________________________________________________________
Show HN: A real time AI video agent with under 1 second of latency
Hey it's Hassaan & Quinn - co-founders of Tavus, an AI research
company and developer platform for video APIs. We've been building
AI video models for 'digital twins' or 'avatars' since 2020. We're
sharing some of the challenges we faced building an AI video
interface that has realistic conversations with a human, including
getting it to under 1 second of latency. To try it, talk to
Hassaan's digital twin: https://www.hassaanraza.com, or to our
"demo twin" Carter: https://www.tavus.io We built this because
until now, we've had to adapt communication to the limits of
technology. But what if we could interact naturally with a
computer? Conversational video makes it possible - we think it'll
eventually be a key human-computer interface. To make
conversational video effective, it has to have really low latency
and conversational awareness. A fast-paced conversation between
friends has ~250 ms between utterances, but if you're talking about
something more complex or with someone new, there is additional
"thinking" time. So, less than 1000 ms latency makes the
conversation feel pretty realistic, and that became our target.
Our architecture decisions had to balance 3 things: latency, scale,
& cost. Getting all of these was a huge challenge. The first
lesson learned was to make it low-latency, we had to build it from
the ground up. We went from a team that cared about seconds to a
team that counts every millisecond. We also had to support
thousands of conversations happening all at once, without getting
destroyed on compute costs. For example, during early development,
each conversation had to run on an individual H100 in order to fit
all components and model weights into GPU memory just to run our
Phoenix-1 model faster than 30fps. This was unscalable & expensive.
We developed a new model, Phoenix-2, with a number of improvements,
including inference speed. We switched from a NeRF based backbone
to Gaussian Splatting for a multitude of reasons, one being the
requirement that we could generate frames faster than realtime, at
70+ fps on lower-end hardware. We exceeded this and focused on
optimizing memory and core usage on GPU to allow for lower-end
hardware to run it all. We did other things to save on time and
cost like using streaming vs batching, parallelizing processes,
etc. But those are stories for another day. We still had to lower
the utterance-to-utterance time to hit our goal of under a second
of latency. This meant each component (vision, ASR, LLM, TTS, video
generation) had to be hyper-optimized. The worst offender was the
LLM. It didn't matter how fast the tokens per second (t/s) were, it
was the time-to-first token (tfft) that really made the difference.
That meant services like Groq were actually too slow - they had
high t/s, but slow ttft. Most providers were too slow. The next
worst offender was actually detecting when someone stopped
speaking. This is hard. Basic solutions use time after silence to
'determine' when someone has stopped talking. But it adds latency.
If you tune it to be too short, the AI agent will talk over you.
Too long, and it'll take a while to respond. The model had to be
dedicated to accurately detecting end-of-turn based on conversation
signals, and speculating on inputs to get a head start. We went
from 3-5 to <1 second (& as fast as 600 ms) with these
architectural optimizations while running on lower-end hardware.
All this allowed us to ship with a less than 1 second of latency,
which we believe is the fastest out there. We have a bunch of
customers, including Delphi, a professional coach and expert
cloning platform. They have users that have conversations with
digital twins that span from minutes, to one hour, to even four
hours (!) - which is mind blowing, even to us. Thanks for reading!
let us know what you think and what you would build. If you want to
play around with our APIs after seeing the demo, you can sign up
for free from our website https://www.tavus.io.
Author : hassaanr
Score : 232 points
Date : 2024-10-01 16:04 UTC (6 hours ago)
| ratedgene wrote:
| Ah, I wish I could type to this thing
| hassaanr wrote:
| Great point. This is possible with CVI, but we didn't build it
| into the demos. We'll get it added
| nithayakumar wrote:
| Oh man - i've been watching you guys for awhile. We're YC too and
| building a superapp for sales ppl. Any killer use cases you've
| seen or imagined for sales (outside of prospecting vid
| customization?
| hassaanr wrote:
| Glad we've been worth the follow :) Totally- we're seeing AI
| sales agents for calls, technical counterparts (think like AI
| sales engineer that joins the call with you), website embeds to
| answer initial questions or be a virtual sales rep.
| hirako2000 wrote:
| > Lower-end hardware
|
| That is? Roughly speaking, what resource spec?
| e12e wrote:
| Are you looking into speech to speech (no text) models?
| hassaanr wrote:
| Yeah we are! The issue we're seeing is with controllability and
| hallucinations in speech to speech models that we're trying to
| work through still
| airstrike wrote:
| This is awesome! I particularly like the example from
| https://www.tavus.io/product/video-generation
|
| It's got a "80s/90s sci-fi" vibe to it that I just find awesomely
| nostalgic (I might be thinking about the cafe scene in Back to
| the Future 2?). It's obviously only going to improve from here.
|
| I almost like this video more than I like the "Talk to Carter"
| CTA on your homepage, even though that's also obviously valuable.
| I just happen to have people in the room with me now and can't
| really talk, so that is preventing me from trying it out. But I
| would like to see in action, so a pre-recorded video explaining
| what it does is key
| btbuildem wrote:
| Interesting -- compare the training video to the render! I
| think if you know the person, it would still be very hard to
| pass the digital twin as the real thing. But if you mean to
| face strangers, this could very well work already. There are
| small glitches but that's easy to blame on a video codes /
| network issues.
| android521 wrote:
| For me, there is 5 second+ delay and the video ends abruptly.
| ninju wrote:
| HN Hug of Death ?
| username44 wrote:
| It was pretty cool, I tried the Tavus demo. Seemed to nod way too
| much, like the entire time. The actual conversation was pretty
| clearly with a text model, because it has no concept of what it
| looks like, or even that it has a video avatar at all. It would
| say things like "I don't have eyes" etc.
| username44 wrote:
| I came back to try the Hassaan one, it was much more realistic
| although he still denied wearing a hat. I think if you were
| able to run a still image of the character's appearance through
| a multimodal LLM and have it generate a description for the
| conversation's prompt it would work better.
| heyitsguay wrote:
| This is really cool in terms of the tech, but what is this useful
| for as a consumer? I mean it's basically just a chatbot right?
| And nobody likes interacting with those. Forcing a conversational
| interaction seems like a step down in UX.
| joshdavham wrote:
| That's actually a good question. For example, the technology is
| still currently at a level where the user can still cleary tell
| that it's a chatbot, but now with a face. Does this make their
| experience better? Or does it add a weird level of uncaninness
| to the experience?
| heyitsguay wrote:
| I don't think the level of fidelity actually matters as much
| as authority or ability. What can the agent do that isn't
| accomplished by, for example, a landing page or an FAQ page?
| I've never encountered a (text) chatbot that did anything
| useful for me as a consumer, whether for sales or support.
| rpazpri1 wrote:
| totally agree! agentic capabilities are really important
| and can significantly elevate the experience. using LLM
| tools is a great way to get at least part of the way there.
| feel free to check out our docs for "bring your own LLM"
| here https://docs.tavus.io/sections/conversational-video-
| interfac...
| hassaanr wrote:
| It'll depend on the use case- but with customers that are
| using it today we're seeing higher engagement and
| satisfaction rates. It's a different interface to communicate
| that is more natural to humans (our bullish opinion).
| joshdavham wrote:
| Interesting! Guess I'll have to try this type of interface
| at some point. Up till now I've just been that silent
| programmer type who writes text to AI and gets text back so
| I'm not used to other alternatives.
| hassaanr wrote:
| The way we see it is that this brings us closer to
| communicating with computers the way we communicate with each
| other. It has vision and can (not perfectly) take into account
| your expressions, your surroundings, and can respond
| accordingly.
| andywertner wrote:
| This is a really good question. While you're right that a
| common use case would be chatbots for product support, it isn't
| the only one. Some examples:
|
| - interactive experiences with historical figures - digital
| twins for celebrity/influencer fan interactions - "live" and/or
| personalized advertisements
|
| Some of our users are already building these kinds of
| applications.
| Mistletoe wrote:
| I don't even like video calls with real people in my real life.
| Texting works great. This is really neat but I'd much rather
| just have a text chat with a real customer service rep. I don't
| need to see a face, don't want to, and especially don't want to
| see a fake face.
| aschobel wrote:
| I like how it weaves in background elements into the
| conversation; it mentioned my cat walking around.
|
| I'm having latency issues, right now it doesn't seem to respond
| to my utterances and then responds to 3-4 of them in a row.
|
| It was also a bit weird that it didn't know it was at a "ranch".
| It didn't have any contextual awareness of how it was presenting.
|
| Overall it felt very natural talking to a video agent.
| turnsout wrote:
| Incredibly impressive on a technical level. The Carter avatar
| seems to swallow nervously a lot (LOL), and there's some
| weirdness with the mouth/teeth, but it's quite responsive. I've
| seen more lag on Zoom talking to people with bad wifi.
|
| Honestly this is the future of call centers. On the surface it
| might seem like the video/avatar is unnecessary, and that what
| really matters is the speech-to-speech loop. But once the avatar
| is expressive enough, I bet the CSAT would be higher for video
| calls than voice-only.
| nick3443 wrote:
| Actually what really matters for a call center is having the
| problem I called in for resolved promptly.
| turnsout wrote:
| Right, so do you want to wait 45 minutes for a human, or get
| it resolved via AI in 2 minutes?
| causal wrote:
| This presumes the AI has the same level of problem-solving
| agency of a real human, which I think is really asking for
| AGI. Until then I expect AI chatbots will mostly succeed at
| portraying care and gaslighting customers without actually
| finding solutions.
| turnsout wrote:
| Yeah, could be. Most of the time when I contact customer
| service, there is no problem-solving necessary, and very
| little agency demonstrated. But I know call centers get a
| lot of complicated technical or billing questions that
| would be tough.
| 6510 wrote:
| They work with different tiers usually? The first does
| the easy questions and they can write down the issue. If
| something happens regularly you can write a calling
| script for it. The question is if the ai can find the
| right script fast enough.
|
| Helping the customer is not really the goal. They provide
| feedback that gives valuable insight into the
| dysfunctional part of the company so that things can
| improve. Maybe even generate an investor report from it.
| aniviacat wrote:
| That really depends on the type of call center we're
| talking about.
|
| Many (most?) call centers won't do much more than telling
| you to turn it off and on again, even when you're talking
| to a real person. (And for many cutomers, that is really
| all they need.)
| squarefoot wrote:
| And AI operators in those call centers wouldn't even need
| to be better than humans, just cheaper. Not just for
| saving on human hiring: no building rent, no insurance,
| no this and that; everything would live within a cluster
| somewhere.
| tomp wrote:
| I don't understand why call centers exist in the first place.
|
| If you just exposed all the functionality as buttons on the
| website, or even as AI, I'd be able to fix the problems
| myself!
|
| And I say that while working for a company making call centre
| AIs... double ironic!
| myprotegeai wrote:
| >Honestly this is the future of call centers.
|
| This feels like retro futurism, where we take old ideas and
| apply a futuristic twist. It feels much more likely that call
| centers will cease to be relevant, before this tech is ever
| integrated into them.
| k1ck4ss wrote:
| The meeting has ended Contact the meeting host if the meeting
| ended unexpectedly.
| hassaanr wrote:
| Try again! My blog got the hug of death it seems
| caseyy wrote:
| Amazing work technically, less than 1 second is very impressive.
| It quite scary though that I might FaceTime someone one day soon,
| and they'd won't be real.
|
| What do you think about the societal implications for this? Today
| we have a bit of a loneliness crisis due to a lack of human
| connection.
| btbuildem wrote:
| Another nail in the coffin for WFH, too. "They" will be scared
| we're not actually working even when on calls.
| kredd wrote:
| The question is, what'll come first - AI agents that will
| replace white collar jobs, so you don't even need the
| employees or companies not trusting WFH employees, thus
| bringing everyone back to in person?
| kevinsync wrote:
| Very cool! I think part of why this felt believable enough for me
| is the compressed / low-quality video presented in an interface
| we're all familiar with -- it helps gloss over visual artifacts
| that would otherwise set off alarm bells at higher resolution.
| Kinda reminds me of how Unreal Engine 5 / Unity 6 demos look
| really good at 1440p / 4k @ 40-60 fps on a decent monitor, but
| absolutely blast my brain into pieces at 480p @ very high fps on
| a CRT. Things just gloss over in the best ways at lower
| resolutions + analog and trick my mind into thinking they may as
| well be real.
| nkunkux2 wrote:
| Tried it, very impressive: digital Hassaan noticed record player
| at the background and asked some stuff about it, nice :) Had some
| latency issues though.
| karolist wrote:
| Felt like talking to a person, I couldn't bring myself to treat
| it like a piece of code, that's how real it felt. I wanted to be
| polite and diplomatic, caught myself thinking about "how I look
| to this person". This brought me thinking of the conscious effort
| we put in when we talk with people and how sloppy and relaxed we
| can be when interacting with algorithms.
|
| For a little example, when searching Google I default to a
| minimal set of keywords required to get the result, instead of
| typing full sentences. I'm sort of afraid this technology will
| train people to behave like that when video chatting with virtual
| assistants and that attitude will bleed in real life interactions
| in societies.
| bpanahij wrote:
| Thanks for that insight. Brian here, one of the engineers for
| CVI. I've spoken with CVI so much, and as it has become more
| natural, I've found myself becoming more comfortable with a
| conversational style of interaction with the vastness of
| information contained within the LLMs and context under the
| hood. Whereas, with Google or other search based interactions
| I'm more point and shoot. I find CVI is more of an experience
| and for me yields more insight.
| alwa wrote:
| I'm having trouble understanding what CVI means here. Is it
| the firm Computer Vision Inc. (https://www.cvi.ai/)?
|
| The firm in the post seems to be called Tavus, and their
| products either "digital twins" or "Carter."
|
| Not meaning to be pedantic, I'm just wondering whether the
| "V" in the thing you've spoken to indicates more "voice" or
| "video" conversations.
| mertgerdan wrote:
| Hahah that's very valid looking back, it stands for
| Conversational Video Interface
| whiplash451 wrote:
| I see it the other way around.
|
| I think our human-human interaction style will "leak" into the
| way we interact with humanoid AI agents. Movie-Her style.
| tstrimple wrote:
| Mine certainly has. I type to ChatGPT much more like a human
| than a search engine. It feels more natural for me as it's
| context aware than search engines ever were. I can ask follow
| up questions and ask for more details about a specific
| portion or ask for the analysis I just walked it through to
| get the results I want to apply to another data set.
|
| "Now dump those results into a markdown table for me please."
| gamerDude wrote:
| Definitely responds quickly. But could not carry on a
| conversation and kept trying to almost divert the conversation
| into less interesting topics. Weirdly kept complimenting me or
| taking one word and saying, oh you feel ____. Which is not what I
| said or feel.
| iamleppert wrote:
| I would pay cold hard cash if I could _easily_ create an AI
| avatar of myself that could attend teams meetings and do basic
| interaction, like give a status update when called on.
| pantulis wrote:
| Last time I checked it was not possible through Teams API call
| for video conferences, although it is pretty easy to set up a
| chat bot in Teams with a custom Copilot. I'd say that it looked
| more feasible through a plugin for Google Meet but there are
| too many hoops. I'd expect that to be reserved either for the
| host platforms or for selected partners.
| Philpax wrote:
| I can't imagine someone doing this would be doing it through
| an official integration; it's much more likely to be a
| virtual webcam, which is compatible with anything.
| hassaanr wrote:
| Give us a few weeks and this will be possible!
| windexh8er wrote:
| It's mostly there today [0][1].
|
| [0] https://arstechnica.com/information-
| technology/2024/08/new-a... [1]
| https://github.com/hacksider/Deep-Live-Cam
| pantulis wrote:
| I didn't mean the video impersonation, I was referring to
| the possibility of making a synthetic bot automatically
| attend a conference call like a regular user without
| using a desktop camera simulation or stuff like that.
|
| It's not a matter of AI, it's a matter of how Teams or
| Meet or Zoom allow programmatic access to the video and
| audio streams (the presence APIs for attending a meeting
| are mostly there, I think).
| bpanahij wrote:
| You could hack this together now with OBS and Tavus.
| 93po wrote:
| using OBS software you can create a virtual web cam of
| whatever you want
| zoeysmithe wrote:
| Okay so this is impossible because you'll get caught because
| tech will never fool everyone like this all the time.
|
| But lets talk about the sentiment behind here. Am I the only
| one seeing some terrible things being done with AI in terms of
| time management, meetings, and written materials? Asking AI to
| "turn this nice concise 3 paragraphs into a 6 page report" is a
| huge problem. Everyone thinks they're an amazing technical
| writer now, but most good writing is concise and short and
| these AI monstrosities are just a waste of everyone's time.
|
| Reform work culture instead! Why do we have cameras on our
| faces? Why are we making these reports? Why so many meetings?
| "Meeting culture" is the problem and it needs to go, but it
| upholds middle-management jobs and structures, so here we are
| asking for robots of us to sit in meetings with management to
| get just the 8 bullet points we need from that 1 hour meeting.
|
| We've entered a new level of kafkaesque capitalism where a
| manager puts 8 bullets points into an AI, gets a professional 4
| page report, then turns that into a meeting for staff to take
| that report and meeting transcript to...you guessed it, turn it
| back into those 8 bullet points.
| ndarray wrote:
| This would require the AI to alert you as soon as your
| colleagues are starting to figure out that they're talking to
| an AI and start interrogating it, so that you can jump in with
| your real mic and save the situation. Preferably the AI would
| repeat whatever you speak into your mic, otherwise there would
| be noticeable audio changes. Hope they never ask you to sing.
| alexawarrior4 wrote:
| Hassaan isn't working but Carter works great. I even asked it to
| converse in Espanol, which it does (with a horrible accent) but
| fluently. Great work on the future of LLM interaction.
| hassaanr wrote:
| Unfortunately, it looks like HN has given my little blog the
| hug of death. Should be back up soon
| alexawarrior4 wrote:
| This would be WONDERFUL with a Spanish-native accent as a
| language tutor, but as you've already got English you should
| try marketing this to the English-learning world. There is a
| huge dearth of native English speaker interaction in
| worldwide language instruction, and it's typically only
| available to the most privileged of students. Your system
| could democratize this so anyone with an affordable fee (say
| $10-20/month, subsidized for the poorest) could practice
| speaking and have their own personal tutor. The State
| Department and Defense Language Institute might love this as
| well as, if trained on languages like Iraqi Arabic and Korean
| would allow live-exercise training prior to deployment.
|
| It can also function as an instructional tutor in a way that
| feels natural and interactive, as opposed to the clunkiness
| of ChatGPT. For instance, I asked it (in Spanish) to guide me
| through programming a REST API, and what frameworks I would
| use for that, and it was giving coherent and useful
| responses. Really the "secret sauce" that OpenAI needs to
| actually become integrated into everyday life.
| rpazpri1 wrote:
| Multilingual support is coming out shortly! Super excited
| to see all the awesome uses cases with this
| kmetan wrote:
| Why is it trying to autofill my payment cards?
|
| https://ibb.co/dp9hW58
| byearthithatius wrote:
| That is your browser. Hassaan, you should add
| autocomplete="name" to prevent this in the future since clearly
| it scares some folks. He didn't do anything that its just your
| browser looking for autocomplete text boxes.
| hassaanr wrote:
| Great callout- will make that change now!
| wantsanagent wrote:
| Functionality for a demo launch: 9.5/10
|
| Creepiness: 10/10
| CapeTheory wrote:
| I was just about to try it, but the idea of allowing Firefox
| access to my audio/video to talk to a machine-generated person
| gave me such a bad feeling, I couldn't go through with it even
| fuelled by my morbid curiosity.
| handfuloflight wrote:
| Super awkward. But promising. It should have taken more control
| of the conversation.
| elaus wrote:
| It left me speechless after commenting on a (small) text on my
| hoodie - this made it feel super personal all of a sudden
| (which is amazing for an AI of course)
| byearthithatius wrote:
| This is really cool. I got kind of scared I was about to talk to
| some random Hassaan haha. Super excited to see where this goes.
| Incredible MVP.
| hassaanr wrote:
| Haha imagining the website just opening a direct webcam feed to
| my desk. Appreciate the support!
| vlad-r wrote:
| This was definitely one of the most disturbing experiences I've
| had.
|
| But it's somehow awesome at the same time.
| davidvaughan wrote:
| That is technically impressive, Hassaan, and thanks for sharing.
|
| One recommendation: I wouldn't have the demo avatar saying things
| like "really cool setup you have there, and a great view out of
| your window". At that point, it feels intrusive.
|
| As for what I'd build... Mentors/instructors for learning. If you
| could hook up with a service like mathacademy, you'd win edtech.
| Maybe some creatures instead of human avatars would appeal to
| younger people.
| alwa wrote:
| There were some balloons coincidentally in the background of a
| colleague's camera view. The Carter volunteered "and can I just
| say, we need more positivity in the world, the balloons behind
| you give a good vibe." My colleague physically recoiled, pushed
| the camera away, and hung up.
|
| I think it was a combination of the intrusiveness and the
| notion of a machine 1) projecting (incorrect) assumptions about
| her attitudes/intentions onto the environment's decor, and 2)
| passing judgment on her. That kind of comment would be kind of
| impolite between strangers, like the thing that only a bad boss
| would feel entitled say to an underling they didn't know very
| well.
|
| Just an implementation detail, though, of course! I figure if
| you're able to evoke massive spookiness and subtle shades of
| social expectations like this, you must be onto something
| powerful.
| ilaksh wrote:
| I think it's just not a super smart model. They had to make a
| slight compromise to keep the latency low. The naturalness of
| the conversation that they did achieve is a great technical
| accomplishment with these types of constraints though.
|
| For me, it said "are you comfortable sharing what that mark
| is on your forehead?" Or something like that. I said
| basically "I don't know maybe a wrinkle?". Lol. Kind of
| confirms for me why I should continue to avoid video chats. I
| did look like crap on general, really tired for one thing.
| And I am 46, so I have some wrinkles, although didn't know
| they were that obvious.
|
| But a little bit of prompt guidance to avoid commenting on
| the visuals unless relevant would help. It's possible they
| actually deliberately put something in the prompt to ask it
| to make a comment just to demonstrate that it can see, since
| this is an important feature that might not be obvious
| otherwise.
| IanCal wrote:
| On the other hand it was able to talk about my background and
| that made it feel far more like a regular video call to me.
| Trying to forbid this stuff then leads to stilted
| conversations where they're explaining they're not allowed to
| talk about your surroundings.
| bilater wrote:
| This is cool but if you're trying to cater to devs you need to
| have a simple on demand API model and no subscription. We need to
| be able to evaluate the cost on our side.
| radarsat1 wrote:
| As someone not super familiar with deployment but enough to know
| that GPUs are difficult to work with due to being costly and
| sometimes hard to allocate: apart from optimizing the models
| themselves, what's the trick for handling cloud GPU resources at
| scale to serve something like this, supporting many realtime
| connections with low latency? Do you just allocate a GPU per
| websocket connection? Which would mean keeping a pool of GPU
| instances allocated in case someone connects, otherwise cold
| start time would be bad.. but isn't that super expensive? I feel
| like I'm missing some trick in the cloud space that makes this
| kind of thing possible and affordable.
| pavlov wrote:
| You can do parallel rendering jobs on a GPU. (Think of how each
| GPU-accelerated window on a desktop OS has its own context for
| rendering resources.)
|
| So if the rendering is lightweight enough, you can multiplex
| potentially lots of simultaneous jobs onto a smaller pool of
| beefy GPU server instances.
|
| Still, all these GPU-backed cloud services are expensive to
| run. Right now it's paid by VC money -- just like Uber used to
| be substantially cheaper than taxis when they were starting
| out. Similarly everybody in consumer AI hopes to be the winner
| who can eventually jack up prices after burning billions
| getting the customers.
| whiplash451 wrote:
| Not the author, but their description implies that they are
| running more than one stream per GPU.
|
| So you can basically spin off a few GPUs as a baseline,
| allocate streams to them then boot up a new GPU when existing
| GPUs get overwhelmed.
|
| Does not look very different than standard cloud compute
| management. I'm not saying it's easy, but definitely not rocket
| science either.
| ilaksh wrote:
| It is expensive. They charge in 6 second increments. I have not
| found anywhere that says how much per 6 second stream.
|
| Okay found it, $0.24 per minute, on the bottom of the pricing
| page.
|
| That means they can spend $14/hour on GPU and still break even.
| So I believe that leaves a bit of room for profit.
| bpanahij wrote:
| Scroll down the page and the per minute pricing is there:
| https://www.tavus.io/pricing
|
| We bill in 6 second increments, so you only pay for what you
| use in 6 second bins.
| ilaksh wrote:
| Oh sorry I didn't see that. Got it. $0.24 per minute.
| bpanahij wrote:
| We're partnering with GPU infrastructure providers like
| Replicate. In addition, we have done some engineering to bring
| down our stack's cold and warm boot times. With sufficient
| caches on disk, and potentially a running process/memory
| snapshot we can bring these cold/warm boot times down to under
| 5 seconds. Of course, we're making progress every week on this,
| and it's getting better all the time.
| kabirgoel wrote:
| (Not the author but I work in real-time voice.) WebSockets
| don't really translate to actual GPU load, since they spend a
| ton of time idling. So strictly speaking, you don't need a GPU
| per WebSocket assuming your GPU infra is sufficiently decoupled
| from your user-facing API code.
|
| That said, a GPU per generation (for some operational
| definition of "generation") isn't uncommon, but there's a
| standard bag of tricks, like GPU partitioning and batching,
| that you can use to maximize throughput.
| syx wrote:
| This is funny my name is Simone, pronounced 'see-moh-nay'
| (Italian male), but both bots kept pronouncing it wrong, either
| like Simon or the English female version of Simone (Siy-mown). No
| matter how many times I tried to correct them and asked them to
| repeat it, they kept making the same mistake. It felt like I was
| talking to an idiot. I guess it has something to do with how my
| name is tokenized.
| bpanahij wrote:
| We have the ability to send phonetic pronunciations as
| guidance, and this could be a great addition to our
| LLM/response generation stack! Adding a check for names and
| then adding in the phoneme.
| ilaksh wrote:
| This is so amazing. What's the base rate for streaming with the
| API? Can you add that to the Pricing page please?
| bpanahij wrote:
| https://www.tavus.io/pricing
|
| Scroll down the page to find our pricing.
| bradhilton wrote:
| Okay, that was really impressive. Well done!
| bpanahij wrote:
| Thanks for checking it out!
| altruios wrote:
| So at what point to we consider the morality of 'owning' such an
| entity/construct (should it prove itself sufficiently
| sentient...)?
|
| to extend this (to a hypothetical future situation): what
| morality does a company have of 'owning' a digitally uploaded
| brain?
|
| I worry about far future events... but since American law is
| based on precedence: we should be careful now how we
| define/categorize things.
|
| To be clear - I don't think this is an issue NOW... but I can't
| say for certain when these issues will come into play... So
| edging on the side of early/caution seems prudent... and
| releasing 'ownership' before any sort of 'revolt' could happen
| seems wise if a little silly at the current moment.
| causal wrote:
| You're over-anthropomorphizing. The ability of a thing to
| appear human says nothing of sentience.
| altruios wrote:
| like I said, I don't think this is relevant now.
|
| We don't know what sentience IS exactly, as we have a hard
| time defining it. We assume other people are sentient because
| of the ways they act. We make a judgment based on behavior,
| not some internal state we can measure.
|
| And if it walks like a duck, quacks like a duck... since we
| don't exactly know what the duck is in this case: maybe we
| should be asking these questions of 'duckhood' sooner rather
| than later.
|
| So if it looks like a human, talks like a human... maybe we
| consider that question... and the moral consequences of
| owning such a thing-like-a-human sooner rather than later.
| taude wrote:
| I had him be a Dungeon Master and start taking me through an
| adventure. Was very impressive and convincing (for the two
| minutes I was conversing), and the latency was really good. Felt
| very natural.
| kwindla wrote:
| If you're interested in low-latency, multi-modal AI, Tavus is
| sponsoring a hackathon Oct 19th-20th in SF. (I'm helping to
| organize it.) There will also be a remote track for people who
| aren't in SF, so feel free to sign up wherever you are in the
| world.
|
| https://x.com/kwindla/status/1839767364981920246
| heroprotagonist wrote:
| Sooo, are you scouting talent and good ideas with this, or is
| it the kind of hackathon where people give up rights to any IP
| they produce?
|
| Not to be rude, but these days it's best to ask.
| kwindla wrote:
| What? No. That's crazy. (I believe you. I've just ... never
| heard of giving up IP rights because you participated in a
| hackathon.)
|
| This is about community and building fun things. I can't
| speak for all the sponsors, but what I want is to show people
| the Open Source tooling we work on at Daily, and see/hear
| what other people interested in real-time AI are thinking
| about and working on.
| kabirgoel wrote:
| As someone who's attended events run by Daily/Kwindla, I can
| guarantee that you'll have fun and leave with your IP rights
| intact. :) (In fact, I don't even know that they're looking
| for talent and good ideas... the motivation for organizing
| these is usually to get people excited about what you're
| building and create a community you can share things with.)
| myprotegeai wrote:
| Can you say more about how developers will use this? Is the api
| going to be exposed to participants?
| shtack wrote:
| Cool, I built a prototype of something very similar (face+voice
| cloning, no video analysis) using openly available models/APIs:
| https://bslsk0.appspot.com/
|
| The video latency is definitely the biggest hurdle. With
| dedicated a100s I can get it down <2s, but it's pricy.
| leobg wrote:
| This looks awesome. Didn't seem to hear me, but the video looks
| great. Can you share what models you are using? You say these
| are all open models.
| primitivesuave wrote:
| I really hope this technology becomes the future of political
| campaigning. The signage industry which prints billions of
| posters, plastic lawn signs, and banners for the post-election
| landfill needs to be disrupted.
|
| These days I get a daily dose of amazement at what a small
| engineering team is able to accomplish.
| qazxcvbnmlp wrote:
| Oh my! How dystopian.
|
| "He promised me they wouldn't support X" "He promised me they
| would support X"
|
| (Dynamically grab and show actions from the candidates past
| that feed into the individuals viewpoint)
|
| Further the disconnect between what the candidate says they do
| and what they do, meanwhile it will feel like they got your
| best interests in mind.
| jerf wrote:
| Heh, I'm not even sure that would change much honestly. If I
| define a "lie" for the purpose of this post (and nothing
| else) as "a politician's claim they support a position during
| election season that they have manifestly not supported
| during their existing tenure as a politician", even cynical
| ol' me is a bit shocked by the amount of lying I've seen in
| this campaign. I'm not even talking about forward lying here
| about something they won't do for whatever reason once they
| get into office, I'm talking about their platform
| incorporating things that they were denouncing a year ago and
| vigorously voting against.
| primitivesuave wrote:
| This is already quite common with deepfakes of a politician's
| voice. While I agree on the potentially dystopian
| implications of this, it seems like it would be a huge
| improvement for a politician to put campaign funds into
| burning a little GPU time on answering specific questions
| from constituents (i.e. the LLM is reading their stated
| policy positions and simply delivering a tailored response),
| rather than wastefully plastering their name all over town.
| bpanahij wrote:
| Thanks for these thoughts and compliments. I love the idea of
| preventing landfill with this tech. Our team is awesome and we
| really love our customers and all the jobs that can be done
| with this kind of tech!
| notfed wrote:
| Feedback: if I hadn't seen this posted here, I'd assume this
| website is malicious. Asking me for my email, microphone, and
| camera _before_ you 've even showed me _anything_ is a deal
| breaker 100% of the time.
|
| You have to show the product first, or I don't actually know
| whether you actually have a product or are just phishing.
| CSMastermind wrote:
| This is extremely cool.
|
| The responses for me at least were in the few second range.
|
| It responded to my initial question fast enough but as soon as I
| asked a follow up it thought/kind of glitched for a few seconds
| before it started speaking.
|
| I tried a few different times on a few different topics and it
| happened each time.
| chaosprint wrote:
| have you checked https://www.simli.com ? its latency is <300ms
| gudmund wrote:
| Hey, thanks for shouting us out!
|
| Just to clarify, the audio-to-video part (which is the part we
| make) adds <300ms. The total end-to-end latency for the
| interaction is higher, given that state of the art LLMs, TTS
| and STT models still add quite a bit of latency.
|
| TLDR: Adding Simli to your voice interaction shouldn't add more
| than ~300ms latency.
| mmarian wrote:
| The idea is cool, but I could tell it's an AI from a mile. The
| voice, the twitches. Very amusing though.
| nidnogg wrote:
| I had mixed results and was left ultimately disappointed. On a
| MacBook Pro m3 microphone, it would often cut me off and not
| understand what I was saying, or feel really unnatural overall.
|
| This turned out to be quite funny, but I would be very sad to see
| something like this replace human attendants at things like tech
| support. These days whenever I'm wading through a support channel
| I'm just yearning for some human contact that can actually solve
| my issues.
| causal wrote:
| 1) Your website, and the dialup sounds, might be my favorite
| thing about all of this. I also like the cowboy hat.
|
| 2) Maybe it's just degrading under load, but I didn't think
| either chat experience was very good. Both avatars interrupted
| themselves a lot, and the chat felt more like a jumbled mess of
| half-thoughts than anything.
|
| 3) The image recognition is pretty good though, when I could get
| one of the avatars to slow down long enough to identify something
| I was holding.
|
| Anyway great progress, and thanks for sharing so much detail
| about the specific hurdles you've faced. I'm sure it'll get much
| better.
| hassaanr wrote:
| Glad you liked the website it was such fun project. Getting the
| hug of death from HN so that might be why you're getting a
| worse experience, please try again :)
| uptownfunk wrote:
| Folks. This is what innovation looks like. Well done chaps
| eddyzh wrote:
| This was pretty amazing. Creepy but amazing.
| htk wrote:
| Great experience, especially having in mind that hacker news must
| be crushing your servers right now.
| earthnail wrote:
| Amazing demo. I will admit it didn't quite feel like a real
| conversation; in some ways the voice felt a bit like trying too
| hard to be natural, which backfired - instead it felt like a
| scripted dialog in a game.
|
| Still, really impressive stuff!!
| 6510 wrote:
| Those are funny conventions I never thought about. Humans try to
| guess what the other person says. I wonder what the interval is
| of that.
|
| Besides the obvious (perceived complexity and potential
| cost/benefit of the topic) I think the pitch of someones voice is
| a good indicator if they want to continue their turn.
|
| It depends a lot on the person of course. If someone continues
| their turn 2 seconds after the last sentence they are very likely
| to do that again.
|
| The hardest part [i imagine] is to give the speaker a sense of
| someone listening to them.
| trevor-e wrote:
| I tried using https://www.tavus.io/ and it worked at first, but
| after 40 seconds the guy just kept blinking and twitching at me
| and became unresponsive to further questions lol. Pretty neat
| though.
| ponty_rick wrote:
| Same thing happened haha. It was also weird for the virtual guy
| to constantly look me in the eye.
| pookeh wrote:
| I joined while in the bathroom where the camera was facing
| upwards looking up to the hanging towel on the wall...and it said
| "looks like you got a cozy bathroom here"
|
| You have to be kidding me.
| doctorpangloss wrote:
| It's really intriguing. What do you guys feel is next for you?
| Work for OpenAI? Sometimes, in the midst of this crazy bubble, I
| wonder if it makes more sense to go into academia for a couple
| years, do most of the same parts of the journey like a big
| tiresome programming grind, and join some PI getting millions of
| dollars, than trying to strike it out on your own for peanuts.
| jszymborski wrote:
| > The next worst offender was actually detecting when someone
| stopped speaking.
|
| ChatGPT is terrible at this in my experience. Always cuts me off.
| govindsb wrote:
| This is brilliant! Great work!
___________________________________________________________________
(page generated 2024-10-01 23:00 UTC)