[HN Gopher] Show HN: Infinity - Realistic AI characters that can...
       ___________________________________________________________________
        
       Show HN: Infinity - Realistic AI characters that can speak
        
       Hey HN, this is Lina, Andrew, and Sidney from Infinity AI
       (https://infinity.ai/). We've trained our own foundation video
       model focused on people. As far as we know, this is the first time
       someone has trained a video diffusion transformer that's driven by
       audio input. This is cool because it allows for expressive,
       realistic-looking characters that actually speak. Here's a blog
       with a bunch of examples: https://toinfinityai.github.io/v2-launch-
       page/  If you want to try it out, you can either (1) go to
       https://studio.infinity.ai/try-inf2, or (2) post a comment in this
       thread describing a character and we'll generate a video for you
       and reply with a link. For example: "Mona Lisa saying 'what the
       heck are you smiling at?'": https://bit.ly/3z8l1TM "A 3D pixar-
       style gnome with a pointy red hat reciting the Declaration of
       Independence": https://bit.ly/3XzpTdS "Elon Musk singing Fly Me To
       The Moon by Sinatra": https://bit.ly/47jyC7C  Our tool at Infinity
       allows creators to type out a script with what they want their
       characters to say (and eventually, what they want their characters
       to do) and get a video out. We've trained for about 11 GPU years
       (~$500k) so far and our model recently started getting good
       results, so we wanted to share it here. We are still actively
       training.  We had trouble creating videos of good characters with
       existing AI tools. Generative AI video models (like Runway and
       Luma) don't allow characters to speak. And talking avatar companies
       (like HeyGen and Synthesia) just do lip syncing on top of the
       previously recorded videos. This means you often get facial
       expressions and gestures that don't make sense with the audio,
       resulting in the "uncanny" look you can't quite put your finger on.
       See blog.  When we started Infinity, our V1 model took the lip
       syncing approach. In addition to mismatched gestures, this method
       had many limitations, including a finite library of actors (we had
       to fine-tune a model for each one with existing video footage) and
       an inability to animate imaginary characters.  To address these
       limitations in V2, we decided to train an end-to-end video
       diffusion transformer model that takes in a single image, audio,
       and other conditioning signals and outputs video. We believe this
       end-to-end approach is the best way to capture the full complexity
       and nuances of human motion and emotion. One drawback of our
       approach is that the model is slow despite using rectified flow
       (2-4x speed up) and a 3D VAE embedding layer (2-5x speed up).  Here
       are a few things the model does surprisingly well on: (1) it can
       handle multiple languages, (2) it has learned some physics (e.g. it
       generates earrings that dangle properly and infers a matching pair
       on the other ear), (3) it can animate diverse types of images
       (paintings, sculptures, etc) despite not being trained on those,
       and (4) it can handle singing. See blog.  Here are some failure
       modes of the model: (1) it cannot handle animals (only humanoid
       images), (2) it often inserts hands into the frame (very annoying
       and distracting), (3) it's not robust on cartoons, and (4) it can
       distort people's identities (noticeable on well-known figures). See
       blog.  Try the model here: https://studio.infinity.ai/try-inf2
       We'd love to hear what you think!
        
       Author : lcolucci
       Score  : 171 points
       Date   : 2024-09-06 16:47 UTC (6 hours ago)
        
       | DevX101 wrote:
       | Any details yet on pricing or too early?
        
         | lcolucci wrote:
         | It's free right now, and we'll try to keep it that way as long
         | as possible
        
       | sroussey wrote:
       | I look forward to movies that are dubbed moving the face+lips to
       | the dubbed text. Also using the original actors voice.
        
         | lcolucci wrote:
         | agreed!
        
         | foreigner wrote:
         | Wow that would be very cool.
        
         | SwiftyBug wrote:
         | +1 for the lips matching the dubbed speech, but I'm not sure
         | about cloning the actor's voice. I really like dubbing actor's
         | unique voices and how they become the voice of some characters
         | in their language.
        
         | schrijver wrote:
         | I thought the larger public was starting to accept subtitles so
         | I was hoping we'd rather see the end of dubbed movies !
        
       | dang wrote:
       | This is my favorite: https://6ammc3n5zzf5ljnz.public.blob.vercel-
       | storage.com/inf2...
        
         | lcolucci wrote:
         | Love this one as well. It's a painting of Trithemius, a German
         | monk, who actually said that
        
           | klipt wrote:
           | Although I assume he didn't say it in British English ;-)
        
             | lcolucci wrote:
             | No, probably not haha ;-)
        
       | LarsDu88 wrote:
       | Putting Drake as a default avatar is just begging to be sued.
       | Please remove pictures of actual people!
        
         | bongodongobob wrote:
         | Sounds like free publicity to me.
        
         | stevenpetryk wrote:
         | That would be ironic given how Drake famously performed
         | alongside an AI recreation of Pac.
        
         | sidneyprimas wrote:
         | Ya, this is tricky. Our stance is the people should be able to
         | make funny, parody videos with famous people.
        
       | aramndrt wrote:
       | Quick tangent: Does anybody know why many new companies have this
       | exact web design style? Is it some new UI framework or other
       | recent tool? The design looks sleek, but they all appear so
       | similar.
        
         | bearjaws wrote:
         | My sad millennial take is: We're in the brain rot era, if a
         | piece of content doesn't have immediate animation / video and
         | that "wowww" sound byte nobody pays attention.
         | 
         | https://www.youtube.com/watch?v=Xp2ROiFUZ6w
        
           | stevenpetryk wrote:
           | My happy millennial take is that browsers have made strides
           | in performance and flexibility, and people are utilizing that
           | to build more complex and dynamic websites.
           | 
           | Simplicity and stillness can be beautiful, and so can
           | animations. Enjoying smooth animations and colorful content
           | isn't brain rot imo.
        
             | whyslothslow wrote:
             | It may be unpopular, but my opinion is that web pages must
             | not have non-consensual movement.
             | 
             | I'll begrudgingly accept a default behavior of animations
             | turned on, but I want the ability to stop them. I want to
             | be able to look at something on a page without other parts
             | of the page jumping around or changing form while I'm not
             | giving the page any inputs.
             | 
             | For some of us, it's downright exhausting to ignore all the
             | motion and focus on the, you know, actual content. And I
             | hate that this seems to be the standard for web pages these
             | days.
             | 
             | I realize this isn't particularly realistic or enforceable.
             | But one can dream.
        
               | mnahkies wrote:
               | For sites that have paid enough attention to
               | accessibility you might be able to configure our
               | browser/OS such that this media query applies
               | https://developer.mozilla.org/en-
               | US/docs/Web/CSS/@media/pref... - it's designed to
               | encourage offering low motion alternatives
        
         | lcolucci wrote:
         | Do you mean on the infinity.ai site or studio.infinity.ai? On
         | infinity.ai we just wanted something fast and easy. This is
         | MagicUI
        
         | sidneyprimas wrote:
         | It's much easier to use standard CSS packages, and these come
         | with more standard styles. Our team doesn't have much
         | experience building websites, so we just went with the standard
         | styles. We used TailwindCSS.
        
         | ricardobeat wrote:
         | Designers today are largely driven by trends (just like
         | engineering?). Being cool = jumping on the latest bandwagon,
         | not being unique or better. The good news is this particular
         | style is pretty much restricted to tech companies, I think it
         | started with https://neon.tech a few years ago or a similar
         | startup.
         | 
         | Incidentally, the same behaviour is seen in academia. These
         | websites for papers are all copying this one from 2020:
         | https://nerfies.github.io/
        
       | ladidahh wrote:
       | I have uploaded an image and then used text to image, and both
       | videos were not animated but the audio was included
        
         | lcolucci wrote:
         | can you clarify? what image did you use? or send the link to
         | the resulting video
        
         | andrew-w wrote:
         | This can happen with non-humanoid images. The model doesn't
         | know how to animate them.
        
       | naveensky wrote:
       | Is there any limitation on the video length?
        
         | lcolucci wrote:
         | Our transformer model was trained to generate videos that are
         | up to 8s in length. However, we can make videos that are longer
         | by using it an an autoregressive manner, and taking the last N
         | frames of output i to seed output (i+1). It is important to use
         | more than just 1 frame. Otherwise ,the direction of movement
         | can suddenly change, which looks very uncanny. Admittedly, the
         | autoregressive approach tends to accumulate errors with each
         | generation.
         | 
         | It is also possible to fine-tine the model so that single
         | generations (one forward pass of the model) are longer than 8s,
         | and we plan to do this. In practice, it just means our batch
         | sizes have to be smaller when training.
         | 
         | Right now, we've limited the public tool to only allow videos
         | up to 30s in length, if that is what you were asking.
        
           | naveensky wrote:
           | Thanks for answering this. I would love to use it when APIs
           | are available to integrate with my apps
        
           | leobg wrote:
           | Video compression algorithms use key frames. So can't you do
           | the same thing? Essentially, generate five seconds. Then pull
           | out the last frame. Use some other AI model to enhance it
           | (upscale, consistency with the original character, etc.).
           | Then use that as the input for the next five seconds?
        
             | andrew-w wrote:
             | This is a good idea. We have discussed incorporating an
             | additional "identity" signal to the conditioning, but
             | simply enforcing consistency with the original character as
             | a post-processing step would be a lot easier to try. Are
             | there any tools you know of that do that?
        
       | naveensky wrote:
       | Is it similar to https://loopyavatar.github.io/. I was reading
       | about this today and even the videos are exactly the same.
       | 
       | I am curious if you are anyway related to this team?
        
         | vunderba wrote:
         | It was posted to hacker news as well within the last day.
         | 
         | https://news.ycombinator.com/item?id=41463726
         | 
         | Examples are very impressive, here's hoping we get an
         | implementation of it on huggingface soon so we can try it out,
         | and even potentially self-host it later.
        
         | aaroninsf wrote:
         | Either this is the commercialization of the work of that
         | project, by authors or collaborators,
         | 
         | or it is appears to be a straight up grift, wrapping someone
         | else's work with a SPA website.
         | 
         | I don't see other possibilities.
        
           | sidneyprimas wrote:
           | We are not related to Loopy Avatar. We trained our own
           | models. It's a coincidence that they launched yesterday.
           | 
           | In the AI/research community, people often try to use the
           | same examples so that it's easier to compare performance
           | across different models.
        
             | echelon wrote:
             | You should watch out for Hedra and Sync. Plus a bunch of
             | Loopy activity on Discord.
        
           | zaptrem wrote:
           | I know these guys in real life, they've been working on this
           | for months and, unlike the ByteDance paper, have actually
           | shipped something you can try yourself.
        
           | ricardobeat wrote:
           | These papers are simply using each other's examples to make
           | performance comparisons possible.
           | 
           | This is EMO from 6 months ago:
           | https://humanaigc.github.io/emote-portrait-alive/
        
         | lcolucci wrote:
         | No, not related. We just took some of Loopy's demo images +
         | audios since they came out 2 days ago and people were aware of
         | them. We want to do an explicit side-by-side at some point, but
         | in the meantime people can make their own comparisons, i.e.
         | compare how the two models perform on the same inputs.
         | 
         | Loopy is a Unet-based diffusion model, ours is a diffusion
         | transformer. This is our own custom foundation model we've
         | trained.
        
           | Stevvo wrote:
           | So, you're saying you took Loopy's work and tried to pass it
           | off as your own, "because people were aware of it" ?
        
             | csallen wrote:
             | No
        
           | arcticfox wrote:
           | This took me a minute - your output demos are your own, but
           | you included some of their inputs, to make for an easy
           | comparison? Definitely thought you copied their outputs at
           | first and was baffled.
        
             | lcolucci wrote:
             | Exactly. Most talking avatar papers re-use each others
             | images + audios in their demo clips. It's just a thing
             | everyone does... we never thought that people would think
             | it means we didn't train our own model!
             | 
             | For whoever wants to, folks can re-make all the videos
             | themselves with our model by extracting the 1st frame and
             | audio.
        
             | sidneyprimas wrote:
             | Yes, exactly! We just wanted to make it easy to compare. We
             | also used some inputs from other famous research papers for
             | comparison (EMO and VASA). But all videos we show on our
             | website/blog are our own. We don't host videos from any
             | other model on our website.
             | 
             | Also, Loopy is not available yet (they just published the
             | research paper). But you can try our model today, and see
             | if it lives up to the examples : )
        
         | cchance wrote:
         | Holy shit loopy is good, i imagine another closed model,
         | opensource never gets good shit like that :(
        
       | lofaszvanitt wrote:
       | Rudimentary, but promising.
        
       | ianbicking wrote:
       | The actor list you have is so... cringe. I don't know what it is
       | about AI startups that they seem to be pulled towards this kind
       | of low brow overly online set of personalities.
       | 
       | I get the benefit of using celebrities because it's possible to
       | tell if you actually hit the mark, whereas if you pick some
       | random person you can't know if it's correct or even stable. But
       | jeez... Andrew Tate in the first row? And it doesn't get better
       | as I scroll down...
       | 
       | I noticed lots of small clips so I tried a longer script, and it
       | seems to reset the scene periodically (every 7ish seconds). It
       | seems hard to do anything serious with only small clips...?
        
         | sidneyprimas wrote:
         | Thanks for the feedback! The good news is that the new V2 model
         | will allow people to create their own actors very easily, and
         | so we won't be restricted to the list. You can try that model
         | out here: https://studio.infinity.ai/
         | 
         | The rest of our website still uses the V1 model. For the V1
         | model, we had to explicitly onboard actors (by fine-tuning our
         | model for each new actor). So, the V1 actor list was just made
         | based on what users were asking for. If enough users asked for
         | an actor, then we would fine-tune a model for that actor.
         | 
         | And yes, the 7s limit on v1 is also a problem. V2 right now
         | allows for 30s, and will soon allow for over a minute.
         | 
         | Once V2 is done training, we will get it fully integrated into
         | the website. This is a pre-release.
        
           | ianbicking wrote:
           | Ah, I didn't realize I had happened upon a different model.
           | Your actor list in the new model is much more reasonable.
           | 
           | I do hope more AI startups recognize that they are projecting
           | an aesthetic whether they want to or not, and try to avoid
           | the middle school boy or edgelord aesthetic, even if that
           | makes up your first users.
           | 
           | Anyway, looking at V2 and seeing the female statue makes me
           | think about what it would be like to take all the dialog from
           | Galatea (https://ifdb.org/viewgame?id=urxrv27t7qtu52lb) and
           | putting it through this. [time passes :)...] trying what I
           | think is the actual statue from the story is not a great fit,
           | it feels too worn by time
           | (https://6ammc3n5zzf5ljnz.public.blob.vercel-
           | storage.com/inf2...). But with another statue I get something
           | much better: https://6ammc3n5zzf5ljnz.public.blob.vercel-
           | storage.com/inf2...
           | 
           | One issue I notice in that last clip, and some other clips,
           | is the abrupt ending... it feels like it's supposed to keep
           | going. I don't know if that's an artifact of the input audio
           | or what. But I would really like it if it returned to a kind
           | of resting position, instead of the sense that it will keep
           | going but that the clip was cut off.
           | 
           | On a positive note, I really like the Failure Modes section
           | in your launch page. Knowing where the boundaries are gives a
           | much better sense of what it can actually do.
        
             | andrew-w wrote:
             | Very creative use cases!
             | 
             | We are trying to better understand the model behavior at
             | the very end of the video. We currently extend the audio a
             | bit to mitigate other end-of-video artifacts
             | (https://news.ycombinator.com/item?id=41468520), but this
             | can sometimes cause uncanny behavior similar to what you
             | are seeing.
        
       | w10-1 wrote:
       | Breathtaking!
       | 
       | First, your (Lina's) intro is perfect in honestly and briefly
       | explaining your work in progress.
       | 
       | Second, the example I tried had a perfect interpretation of the
       | text meaning/sentiment and translated that to vocal and facial
       | emphasis.
       | 
       | It's possible I hit on a pre-trained sentence. With the default
       | manly-man I used the phrase, "Now is the time for all good men to
       | come to the aid of their country."
       | 
       | Third, this is a fantastic niche opportunity - a billion+ memes a
       | year - where each variant could require coming back to you.
       | 
       | Do you have plans to be able to start with an existing one and
       | make variants of it? Is the model such that your service could
       | store the model state for users to work from if they e.g., needed
       | to localize the same phrase or render the same expressivity on
       | different facial phenotypes?
       | 
       | I can also imagine your building different models for niches:
       | faces speaking, faces aging (forward and back); outside of
       | humans: cartoon transformers, cartoon pratfalls.
       | 
       | Finally, I can see both B2C and B2B, and growth/exit strategies
       | for both.
        
         | lcolucci wrote:
         | Thank you! You captured the things we're excited about really
         | well. And I'm glad your video was good! Honestly, I'd be
         | surprised if that sentence was in the training data... but that
         | default guy tends to always look good.
         | 
         | Yes, we plan on allowing people to store their generations,
         | make variations, mix-and-match faces with audios, etc. We have
         | more of an editor-like experience (script-to-video) in the rest
         | of our web app but haven't had time to move the new V2 model
         | there yet. Soon!
        
       | zaptrem wrote:
       | The e2e diffusion transformer approach is super cool because it
       | can do crazy emotions which make for great memes (like Joe Biden
       | at Live Aid! https://youtu.be/Duw1COv9NGQ)
       | 
       | Edit: Duke Nukem flubs his line: https://youtu.be/mcLrA6bGOjY
        
         | lcolucci wrote:
         | Nice :) It's been really cool so see the model get more and
         | more expressive over time
        
         | andrew-w wrote:
         | I don't think we've seen laughing quite that expressive before.
         | Good find!
        
       | dorianmariefr wrote:
       | quite slow btw
        
         | andrew-w wrote:
         | Yeah, it's about 5x slower than realtime with the current
         | configuration. The good news is that diffusion models and
         | transformers are constantly benefitting from new acceleration
         | techniques. This was a big reason we wanted to take a bet on
         | those architectures.
         | 
         | Edit: If we generate videos at a lower resolution and with a
         | fewer number of diffusion steps compared to what's used in the
         | public configuration, we are able to generate videos at 20-23
         | fps, which is just about real-time. Here is an example:
         | https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/fast...
        
           | lcolucci wrote:
           | Woah that's a good find Andrew! That low-res video looks
           | pretty good
        
           | ilaksh wrote:
           | Wowww.. can you buy more hardware and make a realtime
           | websocket API?
        
       | naveensky wrote:
       | For such models, is it possible to fine-tune models with multiple
       | images of the main actor?
       | 
       | Sorry, if this question sounds dumb, but I am comparing it with
       | regular image models, where the more images you have, the better
       | output images you generate for the model.
        
         | andrew-w wrote:
         | It is possible to fine-tune the model with videos of a specific
         | actor, but not images. You need videos to train the model.
         | 
         | We actually did this in early overfitting experiments (to
         | confirm our code worked!), and it worked surprisingly well.
         | This is exciting to us, because it means we can have actor-
         | specific models that learn the idiosyncratic gestures of
         | particular person.
        
       | PerilousD wrote:
       | Damn - I took an (AI) image that I "created" a year ago that I
       | liked and then you animated it AND let it sing Amazing Grace.
       | Seeing IS believing this technology pretty much means video
       | evidence ain't necessarily so.
        
         | lcolucci wrote:
         | We're definitely moving into a world where seeing is no longer
         | believing
        
       | cchance wrote:
       | I tried with the drake and drake saying some stuff and while its
       | cool, its still lacking, like his teeth are disappearing
       | partially :S
        
         | sidneyprimas wrote:
         | Agreed! The teeth can be problematic. The good news is we just
         | need to train at higher resolution (right now we are at
         | 320x320px), and that should resolve the teethe issue.
         | 
         | So far, we have purposely trained on low resolution to make
         | sure we get the gross expressions / movements right. The final
         | stage of training with be using higher resolution training
         | data. Fingers crossed.
        
         | andrew-w wrote:
         | Thanks for the feedback. The current model was trained at
         | ~320x320 resolution. We believe going higher will result in
         | better videos with finer detail, which we plan to do soon.
        
       | nickfromseattle wrote:
       | I need to create a bunch of 5-7 minute talking head videos.
       | What's your timeline for capabilities that would help with this?
        
         | lcolucci wrote:
         | Our model can recursively extend video clips, so theoretically
         | we could generate your 5-7min talking head videos today. In
         | practice, however, error accumulates with each recursion and
         | the video quality gets worse and worse over time. This is why
         | we've currently limited generations to 30s.
         | 
         | We're actively working on improving stability and will
         | hopefully increase the generation length soon.
        
       | sharemywin wrote:
       | accidentally clicked the generate button twice.
        
       | Andrew_nenakhov wrote:
       | i wonder how long would it take for this technology to advance to
       | a point where nice people from /r/freefolk would be able to
       | remake seasons 7 and 8 of Game of Thrones to have a nice proper
       | ending? 5 years, 10?
        
         | lcolucci wrote:
         | I'd say the 5 year ballpark is about right, but it'll involve
         | combining a bunch of different models and tools together. I
         | follow a lot of great AI filmmakers on Twitter. They typically
         | make ~1min long videos using 3-8 different tools... but even
         | those 1min videos were not possible 9 months ago! Things are
         | moving fast
        
         | andrew-w wrote:
         | Haha, wouldn't we all love that? In the long run, we will
         | definitely need to move beyond talking heads, and have tools
         | that can generate full actors that are just as expressive. We
         | are optimistic that the approach used in our V2 model will be
         | able to get there with enough compute.
        
         | squarefoot wrote:
         | In a few years we'll have entire shows made exclusively by AI.
        
       | RobinL wrote:
       | Have to say, whilst this tech has some creepy aspects, just
       | playing about with this my family have had a whole sequence of
       | laughs out loud moments - thank you!
        
         | lcolucci wrote:
         | I'm so glad! We're trying to increase the laugh out loud
         | moments in the world :)
        
         | sidneyprimas wrote:
         | This makes me so happy. Thanks for reporting back! Goal is to
         | reduce creepiness over time.
        
       | Andrew_nenakhov wrote:
       | I tried making this short clip [0] of Baron Vladimir Harkonnen
       | announcing the beginning of the clone war, and it's almost fine,
       | but the last frame somehow completely breaks.
       | 
       | [0]: https://6ammc3n5zzf5ljnz.public.blob.vercel-
       | storage.com/inf2...
        
         | lcolucci wrote:
         | This is a bug in the model we're aware of but haven't been able
         | to fix yet. It happens at the end of some videos but not all.
         | 
         | Our hypothesis is that the "breakdown" happens when there's a
         | sudden change in audio levels (from audio to silence at the
         | end). We extend the end of the audio clip and then cut it out
         | the video to try to handle this, but it's not working well
         | enough.
        
           | drhodes wrote:
           | just an idea, but what if the appended audio clip was
           | reversed to ensure continuity in the waveform? That is, if ><
           | is the splice point and CLIP is the audio clip, then the idea
           | would be to construct CLIP><PILC.
        
             | andrew-w wrote:
             | This is exactly what we do today! It seems to work better
             | the more you extend it, but extending it too much
             | introduces other side effects (e.g. the avatar will start
             | to open its mouth, as if it were preparing to talk).
        
               | drhodes wrote:
               | Hmm, maybe adding white noise would work. -- OK, that's
               | quite enough unsolicited suggestions from me up in the
               | peanut gallery. Nice job on the website, it's impressive,
               | thank you for not requiring a sign up.
        
               | andrew-w wrote:
               | All for suggestions! We've tried white noise as well, but
               | it only works on plain talking samples (not music, for
               | example). My guess is that the most robust solution will
               | come from updating how it's trained.
        
       | sharemywin wrote:
       | you need a slider for how animated the facial expression are.
        
         | lcolucci wrote:
         | That's a good idea! CFG is roughly correlated with
         | expressiveness, so we might to expose that to the user at some
         | point
        
       | johnyzee wrote:
       | It's incredibly good - bravo. Only thing missing for this to be
       | immediately useful for content creation, is more variety in
       | voices, or ideally somehow specifying a template sound clip to
       | imitate.
        
         | andrew-w wrote:
         | Thanks for the feedback! We used to have more voices, but
         | didn't love the experience, since users had no way of knowing
         | what each voice sounded like without creating a clip
         | themselves. Probably having pre-generated samples for each one
         | would solve that. Let us know if you have any other ideas.
         | 
         | We're also very excited about the template idea! Would love to
         | add that soon.
        
       | artur_makly wrote:
       | oh this made my day: https://6ammc3n5zzf5ljnz.public.blob.vercel-
       | storage.com/inf2...
       | 
       | !NWSF --lyrics by Biggy$malls
        
         | lcolucci wrote:
         | that's a great one!
        
         | kelseyfrog wrote:
         | Big Dracula Flow energy which is not bad :)
        
         | knodi123 wrote:
         | So if we add autotune....
        
       | slt2021 wrote:
       | great job Andrew and Sidney!
        
       | billconan wrote:
       | can this achieve real-time performance or how far are we from a
       | real-time model?
        
         | andrew-w wrote:
         | The model configuration that is publicly available is about 5x
         | slower than real-time (~6fps). At lower resolution and with a
         | less conservative number of diffusion steps, we are able to
         | generate the video at 20-23 fps, which is just about real-time.
         | Here is an example:
         | https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/fast...
         | 
         | We use rectified flow for denoising, which is a (relatively)
         | recent advancement in diffusion models that allow them to run a
         | lot faster. We also use a 3D VAE that compresses the video
         | along both spatial and temporal dimensions. Temporal
         | compression also improves speed.
        
       | marginalia_nu wrote:
       | Tried my hardest to push this into the uncanny valley. I did, but
       | it was pretty hard. Seems robust.
       | 
       | https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
        
         | lcolucci wrote:
         | Nice! Earlier checkpoints of our model would "gender swap" when
         | you had a female face and male voice (or vice versa). It's more
         | robust to that now, which is good, but we still need to improve
         | the identity preservation
        
         | layer8 wrote:
         | The jaw is particularly unsettling somehow.
        
         | klipt wrote:
         | It even works on animals:
         | https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
        
           | lcolucci wrote:
           | I think you've made the 1st ever talking dog with our model!
           | I didn't know it could do that
        
         | trunch wrote:
         | Not robust enough to work against a sketch
         | https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
         | 
         | though perhaps it rebelled against the message
        
           | marginalia_nu wrote:
           | https://6ammc3n5zzf5ljnz.public.blob.vercel-
           | storage.com/inf2...
           | 
           | xD
        
           | andrew-w wrote:
           | Cartoons are definitely a limitation of the current model.
        
       | bschmidt1 wrote:
       | Amazing work! This technology is only going to improve. Soon
       | there will be an infinite library of rich and dynamic games,
       | films, podcasts, etc. - a totally unique and fascinating
       | experience tailored to you that's only a prompt away.
       | 
       | I've been working on something adjacent to this concept with
       | Ragdoll (https://github.com/bennyschmidt/ragdoll-studio), but
       | focused not just on creating characters but producing creative
       | deliverables using them.
        
         | lcolucci wrote:
         | Very cool! If we release an API, you could use it across the
         | different Ragdoll experiences you're creating. I agree
         | personalized character experiences are going to be a huge
         | thing. FYI we plan to allow users to save their own characters
         | (an image + voice combo) soon
        
           | bschmidt1 wrote:
           | > If we release an API, you could use it
           | 
           | Absolutely, especially if the pricing makes sense! Would be
           | very nice to just focus on the creative suite which is the
           | real product, and less on the AI infra of hosting models,
           | vector dbs, and paying for GPU.
           | 
           | Curious if you're using providers for models or self-hosting?
        
       | jl6 wrote:
       | Say I'm a politician who gets caught on camera doing or saying
       | something shady. Will your service do anything to prevent me from
       | claiming the incriminating video was just faked using your
       | technology? Maybe logging perceptual hashes of every output could
       | prove that a video didn't come from you?
        
         | bee_rider wrote:
         | These sort of models are probably going to end up released as
         | publicly available weights at some point, right? Or, if it can
         | be trained for $500k today, how much will it cost in a couple
         | years? IMO we can't stuff this genie back in the bottle, for
         | better or worse. A video won't be solid evidence of much within
         | our lifetimes.
        
           | sidneyprimas wrote:
           | That's how I see it as well. Very soon, people will assume
           | most videos are AI generated, and the burden of prove will be
           | on people claiming videos are real. We plan to embed some
           | kind of hash to indicate our video is AI generated, but
           | people will be able to get around this. Google/Apple/Samsung
           | seem to be in the best place to solve this: whenever their
           | devices record a real video, they can generate a hash
           | directly in HW for that video, which can be used to verify
           | that it was actually recorded by that phone.
           | 
           | Also, I think it will cost around $100k to train a model at
           | this quality level within 1-2 years. And, will only go down
           | from there. So, the genie is out of the bag.
        
             | bee_rider wrote:
             | That makes sense. It isn't reasonable to expect malicious
             | users to helpfully set the "evil bit," but you can at least
             | add a little speedbump by hashing your own AI generated
             | content (and the presence of videos that _are_ verifiably
             | AI generated will at least probably catch some particularly
             | lazy /incompetent bad actors, which will destroy their
             | credibility and also be really funny).
             | 
             | In the end though, the incentive and the capability lies in
             | the hands of camera manufacturers. It is unfortunate that
             | video from the pre-AI era have no real reason to have been
             | made verifiable...
             | 
             | Anyway, recordings of politicians saying some pretty
             | heinous things haven't derailed some of their campaigns
             | anyway, so maybe none of this is really worth worrying
             | about in the first place.
        
           | sidneyprimas wrote:
           | Ya, it's only a matter of time until very high quality video
           | models will be open sourced.
        
       | deisteve wrote:
       | what is the TTS model you are using
        
         | lcolucci wrote:
         | We use more than one but ElevenLabs is a major one. The voice
         | names in the dropdown menu ("Amelia", "George", etc) come from
         | ElevenLabs
        
       | svieira wrote:
       | Quite impressive - I tried to confuse it with things it would not
       | generally see and it avoided all the obvious confabulations
       | https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
        
         | andrew-w wrote:
         | Thank you! It has learned a surprising amount of world
         | knowledge.
        
         | lcolucci wrote:
         | Wow this worked so well! Sometimes with long hair and
         | paintings, it separates part of the hair from the head but not
         | here
        
       | ilaksh wrote:
       | It would be amazing to be able to drive this with an API.
        
         | sidneyprimas wrote:
         | We are considering it. Do you have anything specific you want
         | to use it for?
        
           | ilaksh wrote:
           | Basically as a more engaging alternative to Eleven Labs or
           | other TTS.
           | 
           | I am working on my latest agent (and character) framework and
           | I just started adding TTS (currently with the TTS library and
           | xtts_v2 which I think is maybe also called Style TTS.) By the
           | way, any idea what the license situation is with that?
           | 
           | Since it's driven by audio, I guess it would come after the
           | TTS.
        
       | archon1410 wrote:
       | The website is pretty lightweight and easy-to-use. The service
       | also holds up pretty well, specially if the source image is high-
       | enough resolution. The tendency to "break" at the last frame
       | happens with low resolution images it seems.
       | 
       | My generation: https://6ammc3n5zzf5ljnz.public.blob.vercel-
       | storage.com/inf2...
        
         | lcolucci wrote:
         | Thank you! It's interesting you've noticed the last frame
         | breakdown happening more with low-res images. This is a good
         | hypothesis that we should look into. We've been trying to debug
         | that issue
        
       | atum47 wrote:
       | This is super funny.
        
       | max4c wrote:
       | This is amazing and another moment where I question what the
       | future of humans will look like. So much potential for good and
       | evil! It's insane.
        
         | lcolucci wrote:
         | thank you! it's for sure an interesting time to be alive...
         | can't complain about it being boring
        
       | modeless wrote:
       | Won't be long before it's real time. The first company to launch
       | video calling with good AI avatars is going to take off.
        
         | andrew-w wrote:
         | Totally agree. We tweaked some settings after other commenters
         | asked about speed, and got it up to 23fps generation (at the
         | cost of lower resolution). Here is the example:
         | https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/fast...
        
       | squarefoot wrote:
       | Someone had to do that, so here it is:
       | https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
        
       | yellowapple wrote:
       | As soon as I saw the "Gnome" face option I gnew exactly what I
       | gneeded to do: https://6ammc3n5zzf5ljnz.public.blob.vercel-
       | storage.com/inf2...
       | 
       | EDIT: looks like the model doesn't like Duke Nukem:
       | https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
       | 
       | Cropping out his pistol only made it worse lol:
       | https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
       | 
       | A different image works a little bit better, though:
       | https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
        
         | andrew-w wrote:
         | This is why we do what we do lol
        
         | ainiriand wrote:
         | Haha I almost wake up my kid with my sudden laugh!
        
         | zaptrem wrote:
         | Fixed Duke Nukem: https://youtu.be/mcLrA6bGOjY
        
       | genaiguy wrote:
       | its wild this is like a 1:1 rip off of hedra...
       | 
       | "As far as we know, this is the first time someone has trained a
       | video diffusion transformer that's driven by audio input" did you
       | google?
        
         | lcolucci wrote:
         | We are big fans of Hedra. Do you know if they've publicly
         | commented on their model architecture? As far as we know, our
         | particular choice of an end-to-end diffusion + transformer is
         | novel.
         | 
         | We don't know what Hedra is doing. It could be the approach EMO
         | has taken (https://humanaigc.github.io/emote-portrait-alive/)
         | or VASA (https://www.microsoft.com/en-
         | us/research/project/vasa-1/) or Loopy Avatar
         | (https://loopyavatar.github.io/) or something else.
        
       | vessenes wrote:
       | Hi Lina, Andrew and Sidney, this is awesome.
       | 
       | My go-to for checking the edges of video and face identification
       | LLMs are Personas right now -- they're rendered faces done in a
       | painterly style, and can be really hard to parse.
       | 
       | Here's some output: https://6ammc3n5zzf5ljnz.public.blob.vercel-
       | storage.com/inf2...
       | 
       | Source image from: https://personacollective.ai/persona/1610
       | 
       | Overall, crazy impressive compared to competing offerings. I
       | don't know if the mouth size problems are related to the race of
       | the portrait, the style, the model, or the positioning of the
       | head, but I'm looking forward to further iterations of the model.
       | This is already good enough for a bunch of creative work, which
       | is rad.
        
         | lcolucci wrote:
         | I didn't know about Persona Collective - very cool!
         | 
         | I think the issues in your video are more related to the style
         | of the image and the fact that she's looking sideways than the
         | race. In our testing so far, it's done a pretty good job across
         | races. The stylized painting aesthetic is one of the harder
         | styles for the model to do well on. I would recommend trying
         | with a straight on portrait (rather than profile) and shorter
         | generations as well... it might do a bit better there.
         | 
         | Our model will also get better over time, but I'm glad it can
         | already be useful to you!
        
           | vessenes wrote:
           | It's not portrait orientation or gender specific or length
           | related: https://6ammc3n5zzf5ljnz.public.blob.vercel-
           | storage.com/inf2...
           | 
           | It's not stylization (alone): here's a short video using the
           | same head proportions as the original video, but the photo
           | style is a realistic portrait. I'd say the mouth is still
           | overly wide. https://6ammc3n5zzf5ljnz.public.blob.vercel-
           | storage.com/inf2...
           | 
           | I tentatively think it _might_ be race related -- this is one
           | done of a different race. Her mouth might also be too wide?
           | But it stands out a bit less to me.
           | https://6ammc3n5zzf5ljnz.public.blob.vercel-
           | storage.com/inf2...
           | 
           | p.s. happy to post a bug tracker / github / whatever if you
           | prefer. I'm also happy to license over the Persona Collective
           | images if you want to pull them in for training / testing --
           | : feel free to email me -- there's a move away from
           | 'painterly' style support in the current crop of diffusion
           | models (flux for instance absolutely CANNOT do painting
           | styles), and I think that's a shame.
           | 
           | Anyway, thanks! I really like this.
        
       | WaffleIronMaker wrote:
       | Does anybody know about the legality of using Eminem's "Gozilla"
       | as promotional material[1] for this service?
       | 
       | I thought you had to pay artists for a license before using their
       | work in promotional material.
       | 
       | [1] https://infinity.ai/videos/setA_video3.mp4
        
       | bufferoverflow wrote:
       | It completely falls apart on longer videos for me, unusable over
       | 10 seconds.
        
         | lcolucci wrote:
         | This is a good observation. Can you share the videos you're
         | seeing this with? For me, normal talking tends to work well
         | even on long generations. But singing or expressive audio
         | starts to devolve with more recursions (1 forward pass = 8
         | sec). We're working on this.
        
       | zach_miller wrote:
       | Tried to make this meme [1] a reality and the source image was
       | tough for it.
       | 
       | Heads up, little bit of language in the audio.
       | 
       | https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
       | 
       | [1] https://i.redd.it/uisn2wx2ol0d1.jpeg
        
         | andrew-w wrote:
         | I see a lot of potential in animating memes and making them
         | more fun to share with friends. Hopefully, we can do better on
         | orcs soon!
        
       | guessmyname wrote:
       | Is this the original?
       | https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
        
         | andrew-w wrote:
         | No, it's just a hallucination of the model. The audio in your
         | clip is synthetic and doesn't reflect any video in the real
         | world.
         | 
         | Hopefully we can animate your bear cartoon one day!
        
       | eth0up wrote:
       | Lemming overlords
       | 
       | https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
        
         | andrew-w wrote:
         | I know what will be in my nightmares tonight...
        
       | doctorpangloss wrote:
       | If you had a $500k training budget, why not buy 2 DGX machines?
        
       | toisanji wrote:
       | can we choose our own voices?
        
       ___________________________________________________________________
       (page generated 2024-09-06 23:00 UTC)