[HN Gopher] Audiobox: Meta's new foundation research model for a...
       ___________________________________________________________________
        
       Audiobox: Meta's new foundation research model for audio generation
        
       Author : reqo
       Score  : 204 points
       Date   : 2023-12-07 09:57 UTC (3 days ago)
        
 (HTM) web link (ai.meta.com)
 (TXT) w3m dump (ai.meta.com)
        
       | nuz wrote:
       | VR is gonna get wild in like 5 years if they keep this up
        
         | spaceman_2020 wrote:
         | Have you seen some of the demos people have been building with
         | Unreal 5.3? Insane stuff. In a decade, this stuff will be hard
         | to tell from reality.
        
           | SheinhardtWigCo wrote:
           | Sounds cool - got any specific ones to share?
        
             | dvh wrote:
             | https://youtu.be/A7tp4eg0ax8
        
               | jl6 wrote:
               | The geometry and lighting is amazing but I couldn't
               | detect any animation, like gentle motion due to wind.
        
               | flyaway123 wrote:
               | Some motion due to wind, in another video:
               | https://youtu.be/_B9hkn6wgNA?feature=shared&t=24
        
               | pants2 wrote:
               | That's really good, though I also want to point out some
               | of the amazing graphics that modders accomplished in
               | Crysis (2008): https://youtu.be/3w6COXBfIY4
        
               | jay-barronville wrote:
               | Incredible.
        
               | Racing0461 wrote:
               | The new avatar (blue skin not arrow) game looks like this
               | demo with some characters tossed in to control.
        
             | prakhar897 wrote:
             | https://www.youtube.com/watch?v=IK76q13Aqt0
        
         | nerdix wrote:
         | This is why I'm high on the metaverse long term. In ten years,
         | there will be a $500 (or whatever the 2033 inflation adjusted
         | value is) VR headset that blows the Apple Vision Pro out of the
         | water in terms of optics, will run a highly optimized version
         | of the lastest revision of Llama locally (and it will be much
         | better than anything we currently have today), come with wifi 8
         | (so it will have multigigabit per second real word
         | performance), capable of rendering graphics that look much more
         | realistic than Unreal Engine 5 (with high frame rates due to AI
         | upscaling and frame generation).
         | 
         | There will be people that will spend almost every waking hour
         | with one of those things attached to their face if they can
         | also make this device lightweight and comfortable
        
           | doublerabbit wrote:
           | And yet I'll still be waiting five weeks to just download the
           | world because I'm stuck on 2Mb/s ADSL.
        
             | morbusfonticuli wrote:
             | Germany? :-)
        
       | 9dev wrote:
       | If I shutdown every voice other than the optimist's one in my
       | head, this, along with other recent AI research, will mark the
       | advent of never-seen-before role play game possibilities. If the
       | current pace of progress continues, we'll see games with complete
       | narrative freedom for players, where you aren't limited to pre-
       | written answers anymore, but can actually talk to in-game
       | characters with your actual voice, goals, and motivations. And
       | those virtual conversation participants can talk back to you,
       | react to your words and actions in a believable, fully immersive
       | manner. That's a dream come true for every gamer on the face of
       | this earth, I believe.
       | 
       | The more rational voices in my mind, though, become more and more
       | afraid of a world where the only thing you can trust is people
       | sitting right in front of you. That makes the world of
       | information pretty small again.
        
         | logicchains wrote:
         | >If the current pace of progress continues, we'll see games
         | with complete narrative freedom for players, where you aren't
         | limited to pre-written answers anymore, but can actually talk
         | to in-game characters with your actual voice, goals, and
         | motivations. And those virtual conversation participants can
         | talk back to you, react to your words and actions in a
         | believable, fully immersive manner. That's a dream come true
         | for every gamer on the face of this earth, I believe.
         | 
         | It's basically Dungeons and Dragons with an AI dungeon master
         | who can generate video in realtime. Which would be awesome, but
         | like Dungeons and Dragons it wouldn't be easy to keep the
         | player on track.
        
           | 9dev wrote:
           | I'd imagine everyone else has an agenda, a schedule and
           | ordinary life which is pre-written, so the game actually
           | _wants_ to tell an interesting story, but as a player, you
           | can choose to listen or just do all shenanigans in the
           | context of the game you can come up with. If you want to be a
           | farmer instead of defeating the dark overlord, so be it --
           | until the world ends because the overload has achieved his
           | goals with no resistance, or maybe someone else becomes a
           | hero instead. ...Gosh, the more I think about it the more
           | awesome it gets.
           | 
           | Edit: just one more! Imagine actually having to complete
           | quests in a given amount of time, because the rest of the
           | world continues to revolve. People being mad at you because
           | you left their children to die in the dungeon after arriving
           | a day too late, because you were busy running an errand for
           | someone else.
        
             | spacemanspiff01 wrote:
             | What I have been thinking would be really cool and fun to
             | try would be to remake zork with llms, speech to text and
             | text to speech.
             | 
             | I think you might be able to do it with a lot of prompting,
             | and having a database that functions like a wiki for the
             | current and past states of the game world.
             | 
             | If you got really fancy, you could also make it a pseudo
             | MMO where the content of the story you create with a
             | character could be used as the basis for a NPC plotline in
             | other people's worlds, possibly reducing the amount of
             | content needed to be written.
             | 
             | If it got popular you could also use it as a research tool,
             | where you could force some subset of the player population
             | have some interaction, and be able to test and get a
             | dataset for counterfactual reasoning.
             | 
             | The next 5 years will be wild.
        
         | spaceman_2020 wrote:
         | I'm an amateur music producer and vocals are by far the
         | toughest part of making music. I have to find a singer,
         | convince them to work with me (I am an amateur and not
         | particularly good tbh), and book studio space because its very
         | tough to get a clean recording at home.
         | 
         | I'm hoping that like digital instruments, I'll be able to
         | splice in digital voices instead of finding singers.
        
           | CuriouslyC wrote:
           | Save your money and record at home under a blanket.
        
             | EligibleDecoy wrote:
             | Closet full of jackets works too. That's what a lot of
             | podcasters do when traveling
        
               | QuantumGood wrote:
               | They also do pillow forts, e.g.
               | http://PillowFortStudios.com/
        
             | j45 wrote:
             | Layered towels are surprisingly capable too
        
               | devmor wrote:
               | Or if you want to feel a little fancy, a 3-sided folding
               | project board with foam glued to it.
        
           | grumpymouse wrote:
           | This is already somewhat available (check out Dreamtomics
           | Synthesizer V and the voices like Solaris etc)
        
           | MattRix wrote:
           | This already exists, ex. Audimee: https://audimee.com/
        
             | echelon wrote:
             | And the millions of other RVC websites. Musicfy, Uberduck,
             | Coversai, Kitsai, FakeYou, Voicemyai, Voicify, Bangerapp,
             | Tryreplay, Weightsgg ...
             | 
             | RVC is so easy anyone can spin up a website for it. No
             | moat. Over a hundred thousand trained weights files in the
             | open, so it's easy to bootstrap.
        
               | PaulMest wrote:
               | For anybody else who also hadn't come across the term RVC
               | before:
               | 
               | "The RVC model is a Retrieval-based Voice Conversion
               | system using AI for high-quality voice cloning. It
               | utilizes artificial intelligence to modify or clone
               | voices in real-time." Source:
               | https://speechify.com/blog/rvc-vocal-models/
        
               | greesil wrote:
               | Wow. I'm going to give each family member a password and
               | make them prove that they're real when they call me :)
        
               | sorokod wrote:
               | or go pro and issue a One Time Pad to each of them.
               | 
               | Would make a nice vignette in a film about a dystopian
               | future where video can be be generated cheaply and of
               | sufficient quality.
        
               | esafak wrote:
               | Great, so the competition is going to yield cheap
               | services with a nice UX.
        
               | causality0 wrote:
               | Now if only the prices would drop and I could start
               | making custom audiobooks for the price of regular
               | audiobooks.
        
             | spaceman_2020 wrote:
             | An English language and accent bias in most current models
        
           | tomduncalf wrote:
           | https://voice-swap.ai/ may be of interest, you can covert
           | your rough singing to use a real singer's voice (apparently!
           | I've not tried it)
        
         | anonylizard wrote:
         | It'll affect linear entertainment far before it impacts games
         | seriously (Especially the 'full immersive' games where you chat
         | with AI agents)
         | 
         | AI is still too expensive and performance intensive to run in
         | games cost effectively, and truly powerful AI is probably
         | another 10-100x cost increase.
         | 
         | On the other hand, novels will be rapidly replaced by visual
         | novels. The cost of having a novel fully illustrated and voiced
         | will go down 1000x. A high quality illustration used to cost
         | $500-$1000 (A day's work from a high-tier commercial artist),
         | soon it will be about $0.5. I'm not counting in the author's
         | time to prompt the images, because it would have costed them
         | way more time to communicate with the illustrator anyways.
         | 
         | The entire boundary between novels, comics, cartoons etc will
         | blur. Like if a newly written Harry Potter can have thousands
         | of illustrations set in Hogwarts and be fully voiced, the
         | standards for a movie adaptation will be astronomically high,
         | which will in turn drive AI use in movie production just to
         | keep up.
        
           | dragonwriter wrote:
           | > It'll affect linear entertainment far before it impacts
           | games seriously
           | 
           | Maybe. But I think you make the mistake of considering games
           | that combine existung AAA features + AI as where it will
           | first impact games, where I think it will first make its mark
           | in games that _don't_ use hardware heavily for 3d rendering
           | by opening up new modes of gaming.
           | 
           | > novels will be rapidly replaced by visual novels. The cost
           | of having a novel fully illustrated and voiced will go down
           | 1000x.
           | 
           | The cost if having art made isn't the only reason novels
           | aren't fully illustrated now, and AI doesn't impact any of
           | the others.
        
         | codetrotter wrote:
         | > react to your words and actions in a believable, fully
         | immersive manner
         | 
         | "I'm sorry, but as an ethically trained AI I cannot engage in
         | this sword fight. Violence is never the answer."
         | 
         | Yeah. It's gonna be very immersive :P
        
           | dvngnt_ wrote:
           | you can use an uncensored model. not everything will be
           | connected to open ai.
           | 
           | I could see Microsoft making the first move next generation
           | since they're knee deep in it.
        
         | ekianjo wrote:
         | Back to Descartes
        
         | molave wrote:
         | > The more rational voices in my mind, though, become more and
         | more afraid of a world where the only thing you can trust is
         | people sitting right in front of you. That makes the world of
         | information pretty small again.
         | 
         | Gives me cosmological analogies: in the far enough future, the
         | only things you can see in the night sky are the members of the
         | local supercluster.
        
         | chunky1994 wrote:
         | I used to be an LLM until I took an arrow to the knee. Aside
         | from the joke, I think the barrier would definitely reduce and
         | in-game characters would be contextually far more aware but how
         | do you enforce plot progression in such a truly open world? Can
         | you control the boundary of LLM expression?
        
           | dragonwriter wrote:
           | > but how do you enforce plot progression in such a truly
           | open world?
           | 
           | I'd imagine a mixture of general prompting/training the LLM
           | on techniques like those used for plot progressiom by human
           | game masters in TTRPGs and guidance via systems tracking
           | progress and injecting contextual prompts based on mechanisms
           | like those used for tracking and guiding plot progression in
           | GM-less/GM-replacement systems (e.g., the Mythic Game Master
           | Emulator) for TTRPGs.
        
         | huytersd wrote:
         | It would be fantastic to put a bunch of different LLMs in a
         | game map with "senses" fulfilled by multimodal inputs and
         | agency to carry out actions within the game's universe. With a
         | goal such as make the most money or rule the most kingdoms, it
         | would be super interesting to see how it self organizes.
        
         | vlovich123 wrote:
         | The amount of context needed would require quite a bit of novel
         | R&D that doesn't exist on the horizon yet. I think it's more
         | realistic that it'll be a mixture of real & fake in the interim
         | (e.g. the LLM will record important game state changes /
         | information & then use that as context but it'll still forget a
         | bunch of things you'd expect a human to).
        
         | nrjames wrote:
         | Part of me looks forward to these experiences. A larger part of
         | me already mourns the decline is purely human creativity that
         | they suggest. Perhaps AI models will make the perfect game or
         | music or write the perfect novel. I'm certain they'll be
         | programmed to reproduce the funky jank and humans bring to
         | everything they create. At some level, though, we each have our
         | own story that we want to tell and when there are so many
         | automated voices in the room, it's going to be harder to tell
         | those stories.
        
         | holoduke wrote:
         | Definitely agree with your last point, but regarding games with
         | endless possibilities. Mm I rather have a game that is created
         | by someone with a clear goalbin mind. One with boundaries. A
         | good single player experience. Like for example Alan wake 2.
        
       | nathanfig wrote:
       | Multi input? Infilling? First generative audio model I've seen
       | that starts to close the gap with image models.
        
       | novolunt wrote:
       | I think that for artificial intelligence to become like humans,
       | it should be treated under the same conditions as natural humans.
       | It should be able to see the surrounding environment, listen to
       | the surrounding sounds, smell the surrounding smells, and taste
       | the surrounding food. It should be given Parents and relatives
       | should be given their own partners and their own country. In this
       | way, the artificial intelligence trained in the environment will
       | naturally be more like human beings and have their own emotions.
        
         | kevindamm wrote:
         | What you're looking for is embodiment, and actually this was
         | explicitly left aside in a recent paper that attempts to give
         | measurement criteria for AGI[0]. But I agree with you that the
         | entire lived sensation is critical to approaching any objective
         | involving alignment.
         | 
         | [0] https://arxiv.org/abs/2311.02462
        
         | empath75 wrote:
         | What if the "environment" for an AI is just "the internet".
         | 
         | A long time ago, there was a great story in a Shadowrun
         | supplement of all things about a hacker that got trained to
         | teach an "ai" how to break into computers. It was basically at
         | a child's level, emotionally, and the only world it's ever
         | known was "the matrix" (yes, really -- and written almost a
         | decade before The Matrix came out). Eventually it turns out
         | that it's not an ai, but a corporation was stealing kids and
         | sticking them in Virtual Reality at birth to train a team of
         | super hackers.
        
       | youssefabdelm wrote:
       | Have weights been released?
       | 
       | Edit: nvm, seems not from this line "In the coming weeks, we will
       | be opening up the application here, along with an interactive
       | demo that will showcase Audiobox's capabilities."
        
       | maroonblazer wrote:
       | > We're inviting researchers and institutions who have been
       | previously involved in speech research, and who want to pursue
       | responsibility and safety research on the latest Audiobox models,
       | to apply.
       | 
       | It's not clear as to what the expected outcome of this
       | 'responsibility and safety research' effort is. Is the idea to
       | nerf the tech such that it can't be used for morally/ethically
       | nefarious purposes? If so, then is the "speech research"
       | community the group best fit to do that work?
        
       | david_draco wrote:
       | Admirable, how they release the training data openly, after they
       | put so much effort into recording these audio data themselves.
       | Not.
        
       | varunytoons wrote:
       | This a fantastic new development in the AI Audio space! However,
       | it's quite disappointing that the model is closed sourced.
       | Nonetheless, Alibaba's equivalent was released earlier in Nov and
       | it's open-sourced! https://github.com/QwenLM/Qwen-Audio
       | 
       | Does anyone have suggestions for how to integrate this into your
       | tech stack via an internal API? Interested to hear the varying
       | thoughts on this. From what I softly understand is that the model
       | weights have to be swapped or altered per se to be able to
       | commercially reuse this. Correct me if I'm wrong.
        
         | two_in_one wrote:
         | Thanks for the link. License is clear: Researchers and
         | developers are free to use the codes and model weights of both
         | Qwen-Audio and Qwen-Audio-Chat. We also allow their commercial
         | use.
         | 
         | and important, if you have more than 100m active users:) 4.
         | Restrictions If you are commercially using the Materials, and
         | your product or service has more than 100 million monthly
         | active users, You shall request a license from Us
         | 
         | So, looks like it's absolutely fine to use, except for IT
         | behemoth.
         | 
         | As for how to use, API, I think. Interesting applications are
         | possible. Like interactive mobile robots. Assistants for people
         | with disabilities, both software and wearable.
         | 
         | Interesting times... this will be called AI revolution
         | probably. It's already not a joke, after several ups and downs.
        
       | mmaunder wrote:
       | I think the release of closed source models right now is a net
       | negative and worth opposing. Right now we're building a future
       | where the very wealthy and powerful will control access to AI on
       | ethical grounds, while they have uncensored access to the latest
       | and most powerful models. Innovation, high frequency trading,
       | medical breakthroughs, creative output - all of these and more
       | will be enhanced by AI, and you'll be eating leftovers and paying
       | a fortune for them, wondering why you can't keep up - unless we
       | enable a vibrant open source ecosystem, and force big tech to
       | release models into that ecosystem.
       | 
       | Support open source models by celebrating their release and
       | pressuring companies to release them, and oppose closed source AI
       | or face a very bleak future for you and your descendants.
       | 
       | You may be having fun with "Open" AI's API today, but you're
       | supporting and celebrating the collapse of society into megacap
       | AI elites and a majority paying for metered access to old
       | technology.
        
         | s3p wrote:
         | I mean sure.. but imagine if this were open sourced as is. This
         | is new tech that has barely had time to mature. The
         | possibilities for abuse are endless. I for one am happy that
         | this model isn't being open sourced. This is an excellent way
         | for people to generate all kinds of disturbing and fake audio
         | clips.
        
           | mmaunder wrote:
           | Same logic could be applied to Linux by Microsoft in the 90s.
           | In fact, the "It's for your safety" has been applied to some
           | of the worst things humans have perpetrated including
           | apartheid (which I lived through) and the holocaust. And it's
           | always those that claim to keep us safe doing the worst. And
           | it continues with perceived dangers providing pretext and
           | moral authority to do bad things.
        
             | beebeepka wrote:
             | "If we were to release our kettle and chickens, they would
             | go extinct within a few years."
             | 
             | "I just hold on to all the money, 'cause bitches can't be
             | trusted with it. We pool all the kissing money together,
             | see? But if you wanna buy anything, you just talk to the
             | bottom bitch, and then the bottom bitch talks to me.
             | 
             | Do you know what I am saying?"
        
               | mmaunder wrote:
               | Choose OSS, or put that mouth to work on someone's AI
               | API.
        
           | Pugpugpugs wrote:
           | Uh yeah, who cares? Give an example of something this audio
           | generator could make that is dangerous.
        
         | axpy906 wrote:
         | A utopian future would be FOSS LLMs that are private and run
         | locally. The opposite of one where models are public,
         | proprietary - running on your data and owned by just a few
         | large entities.
        
       | petarb wrote:
       | What's the best way to try these models out?
       | 
       | Does Meta usually provide a web interface for them or do you have
       | to download and run locally?
        
       ___________________________________________________________________
       (page generated 2023-12-10 23:00 UTC)