hngopher.com

       [HN Gopher] Audio2Photoreal
       ___________________________________________________________________
        
       Audio2Photoreal
        
       Author : wildpeaks
       Score  : 66 points
       Date   : 2024-01-04 18:03 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | ilaksh wrote:
       | That's amazing. It's a non-commercial license though.
       | 
       | How feasible is it to imitate what this model and codebase is
       | doing to use it in a commercial capacity?
       | 
       | Did they release the dataset?
       | 
       | It would also be nice if Facebook would consider making an API to
       | give Heygen and Diarupt some competition, if they aren't going to
       | allow commercial use.
       | 
       | Although there will probably be a bunch of people who become
       | millionaires using this for their porn gf bot service who just
       | don't care about license restrictions.
        
       | pseudosavant wrote:
       | Like the rest of Facebook's AI research... I find this
       | underwhelming. Not even good enough to trigger uncanny valley
       | issues.
        
         | dtauzell wrote:
         | Are there some similar models that are currently better?
        
           | pseudosavant wrote:
           | I don't know, but I can't imagine having this as a feature in
           | any app (Zoom, etc) and leaving it on. That is how most of
           | FB's AI research seems. Not good enough to make into a real
           | product or feature.
        
             | TaylorAlexander wrote:
             | The nature of this type of research is that there are long
             | term goals which are currently unachievable with no clear
             | concept for how to approach them, so researchers need to
             | start putting small pieces together and working out how to
             | make it all work smoothly as a single concept. It looks
             | like someone had a neural network for mouth movement.
             | Someone had one for body movement, etc. Composing multiple
             | systems in to one teaches us how we can approach more
             | complex problems and how to better tie things together than
             | just inserting the output of one in to the input of
             | another.
             | 
             | Long term this type of work helps solve big problems even
             | if the intermediate steps don't produce exciting results.
             | 
             | As an example, early image generators were pretty
             | uninteresting but today they are widely utilized and
             | generally considered impressive. The thing that researchers
             | in the field know that the public doesn't is that there's
             | 100 boring steps before the exciting release, and some of
             | the boring steps are very exciting on a technical level.
             | Those intermediate achievements represent 99% of what
             | machine learning research actually is and others in the
             | field appreciate those works.
        
         | echelon wrote:
         | Also CC-NC. They want free feedback, but won't let you use it
         | to make anything yourself.
        
         | smusamashah wrote:
         | This is amazing if used in games. Game designer can easily
         | create realistic body movement just using audio.
        
       | ArekDymalski wrote:
       | Impressive. Even at current state it would make RPGs like Fallout
       | or Skyrim sooo much more alive ...
        
       | aantix wrote:
       | Why would we want an avatar vs a real video stream of the actual
       | person?
        
         | kuschku wrote:
         | Being able to have an avatar that fits your voice without
         | having to actually look like that has many applications.
         | 
         | Whether you're trans or you just want to join a video call
         | early in the morning without dressing up, the applications are
         | endless.
         | 
         | In many situations we demand that people dress or present a
         | certain way, just out of bullshit social expectations. This is
         | one way to eat your cake and have it too.
        
           | zamadatix wrote:
           | For those use cases you should be able to get much more
           | accurate results using a base video stream. This more fits
           | use cases where you're lacking a video stream but not
           | necessarily because you just don't want to turn it on.
        
             | kridsdale1 wrote:
             | A video stream isn't volumetric.
             | 
             | This is for the metaverse.
        
         | bigfishrunning wrote:
         | You could generate the avatar clientside and save a ton of
         | bandwidth vs a compressed video stream...
        
         | esafak wrote:
         | Old recordings of people without pictures, for one!
        
         | zamadatix wrote:
         | Given it's by meta I'm guessing it's related to their metaverse
         | goals.
        
         | RobCodeSlayer wrote:
         | I'm imaging video game applications where the avatars are
         | controlled by both online users and LLMs
        
         | plaguuuuuu wrote:
         | Either games or its just interesting research that mostly ties
         | in with what FB is doing. Cause there are problems like, e.g.
         | imagine the bandwidth requirement of streaming 3D copies of
         | like 20 people in a room
         | 
         | it's simply not possible within the near future, even today
         | zoom/teams video conferencing is somehow highly compressed and
         | shit quality with just low res 2D video.
        
       | leshokunin wrote:
       | Pretty cool. It's going to take a while to make it into a usable
       | product though. Having conversations with people flailing their
       | hands algorithmically is going to feel weird until it gets more
       | natural. Right now it feels like those "blink every n" scripts.
        
         | kridsdale1 wrote:
         | Every video game NPC is basically following such an algorithm.
        
       | CrzyLngPwd wrote:
       | It's really impressive.
       | 
       | I wonder where it is headed.
        
       | aaroninsf wrote:
       | Below the right wing, the world famous Uncanny Valley of Menlo
       | Park, one of the seven blunders of the natural world.
        
       | kridsdale1 wrote:
       | Goddamn that's cool.
       | 
       | End-state for Winamp vizualizers: synthesize an entire living
       | world from the audio alone.
        
       ___________________________________________________________________
       (page generated 2024-01-04 23:00 UTC)