[HN Gopher] Audiobox: Meta's new foundation research model for a...
___________________________________________________________________
Audiobox: Meta's new foundation research model for audio generation
Author : reqo
Score : 204 points
Date : 2023-12-07 09:57 UTC (3 days ago)
(HTM) web link (ai.meta.com)
(TXT) w3m dump (ai.meta.com)
| nuz wrote:
| VR is gonna get wild in like 5 years if they keep this up
| spaceman_2020 wrote:
| Have you seen some of the demos people have been building with
| Unreal 5.3? Insane stuff. In a decade, this stuff will be hard
| to tell from reality.
| SheinhardtWigCo wrote:
| Sounds cool - got any specific ones to share?
| dvh wrote:
| https://youtu.be/A7tp4eg0ax8
| jl6 wrote:
| The geometry and lighting is amazing but I couldn't
| detect any animation, like gentle motion due to wind.
| flyaway123 wrote:
| Some motion due to wind, in another video:
| https://youtu.be/_B9hkn6wgNA?feature=shared&t=24
| pants2 wrote:
| That's really good, though I also want to point out some
| of the amazing graphics that modders accomplished in
| Crysis (2008): https://youtu.be/3w6COXBfIY4
| jay-barronville wrote:
| Incredible.
| Racing0461 wrote:
| The new avatar (blue skin not arrow) game looks like this
| demo with some characters tossed in to control.
| prakhar897 wrote:
| https://www.youtube.com/watch?v=IK76q13Aqt0
| nerdix wrote:
| This is why I'm high on the metaverse long term. In ten years,
| there will be a $500 (or whatever the 2033 inflation adjusted
| value is) VR headset that blows the Apple Vision Pro out of the
| water in terms of optics, will run a highly optimized version
| of the lastest revision of Llama locally (and it will be much
| better than anything we currently have today), come with wifi 8
| (so it will have multigigabit per second real word
| performance), capable of rendering graphics that look much more
| realistic than Unreal Engine 5 (with high frame rates due to AI
| upscaling and frame generation).
|
| There will be people that will spend almost every waking hour
| with one of those things attached to their face if they can
| also make this device lightweight and comfortable
| doublerabbit wrote:
| And yet I'll still be waiting five weeks to just download the
| world because I'm stuck on 2Mb/s ADSL.
| morbusfonticuli wrote:
| Germany? :-)
| 9dev wrote:
| If I shutdown every voice other than the optimist's one in my
| head, this, along with other recent AI research, will mark the
| advent of never-seen-before role play game possibilities. If the
| current pace of progress continues, we'll see games with complete
| narrative freedom for players, where you aren't limited to pre-
| written answers anymore, but can actually talk to in-game
| characters with your actual voice, goals, and motivations. And
| those virtual conversation participants can talk back to you,
| react to your words and actions in a believable, fully immersive
| manner. That's a dream come true for every gamer on the face of
| this earth, I believe.
|
| The more rational voices in my mind, though, become more and more
| afraid of a world where the only thing you can trust is people
| sitting right in front of you. That makes the world of
| information pretty small again.
| logicchains wrote:
| >If the current pace of progress continues, we'll see games
| with complete narrative freedom for players, where you aren't
| limited to pre-written answers anymore, but can actually talk
| to in-game characters with your actual voice, goals, and
| motivations. And those virtual conversation participants can
| talk back to you, react to your words and actions in a
| believable, fully immersive manner. That's a dream come true
| for every gamer on the face of this earth, I believe.
|
| It's basically Dungeons and Dragons with an AI dungeon master
| who can generate video in realtime. Which would be awesome, but
| like Dungeons and Dragons it wouldn't be easy to keep the
| player on track.
| 9dev wrote:
| I'd imagine everyone else has an agenda, a schedule and
| ordinary life which is pre-written, so the game actually
| _wants_ to tell an interesting story, but as a player, you
| can choose to listen or just do all shenanigans in the
| context of the game you can come up with. If you want to be a
| farmer instead of defeating the dark overlord, so be it --
| until the world ends because the overload has achieved his
| goals with no resistance, or maybe someone else becomes a
| hero instead. ...Gosh, the more I think about it the more
| awesome it gets.
|
| Edit: just one more! Imagine actually having to complete
| quests in a given amount of time, because the rest of the
| world continues to revolve. People being mad at you because
| you left their children to die in the dungeon after arriving
| a day too late, because you were busy running an errand for
| someone else.
| spacemanspiff01 wrote:
| What I have been thinking would be really cool and fun to
| try would be to remake zork with llms, speech to text and
| text to speech.
|
| I think you might be able to do it with a lot of prompting,
| and having a database that functions like a wiki for the
| current and past states of the game world.
|
| If you got really fancy, you could also make it a pseudo
| MMO where the content of the story you create with a
| character could be used as the basis for a NPC plotline in
| other people's worlds, possibly reducing the amount of
| content needed to be written.
|
| If it got popular you could also use it as a research tool,
| where you could force some subset of the player population
| have some interaction, and be able to test and get a
| dataset for counterfactual reasoning.
|
| The next 5 years will be wild.
| spaceman_2020 wrote:
| I'm an amateur music producer and vocals are by far the
| toughest part of making music. I have to find a singer,
| convince them to work with me (I am an amateur and not
| particularly good tbh), and book studio space because its very
| tough to get a clean recording at home.
|
| I'm hoping that like digital instruments, I'll be able to
| splice in digital voices instead of finding singers.
| CuriouslyC wrote:
| Save your money and record at home under a blanket.
| EligibleDecoy wrote:
| Closet full of jackets works too. That's what a lot of
| podcasters do when traveling
| QuantumGood wrote:
| They also do pillow forts, e.g.
| http://PillowFortStudios.com/
| j45 wrote:
| Layered towels are surprisingly capable too
| devmor wrote:
| Or if you want to feel a little fancy, a 3-sided folding
| project board with foam glued to it.
| grumpymouse wrote:
| This is already somewhat available (check out Dreamtomics
| Synthesizer V and the voices like Solaris etc)
| MattRix wrote:
| This already exists, ex. Audimee: https://audimee.com/
| echelon wrote:
| And the millions of other RVC websites. Musicfy, Uberduck,
| Coversai, Kitsai, FakeYou, Voicemyai, Voicify, Bangerapp,
| Tryreplay, Weightsgg ...
|
| RVC is so easy anyone can spin up a website for it. No
| moat. Over a hundred thousand trained weights files in the
| open, so it's easy to bootstrap.
| PaulMest wrote:
| For anybody else who also hadn't come across the term RVC
| before:
|
| "The RVC model is a Retrieval-based Voice Conversion
| system using AI for high-quality voice cloning. It
| utilizes artificial intelligence to modify or clone
| voices in real-time." Source:
| https://speechify.com/blog/rvc-vocal-models/
| greesil wrote:
| Wow. I'm going to give each family member a password and
| make them prove that they're real when they call me :)
| sorokod wrote:
| or go pro and issue a One Time Pad to each of them.
|
| Would make a nice vignette in a film about a dystopian
| future where video can be be generated cheaply and of
| sufficient quality.
| esafak wrote:
| Great, so the competition is going to yield cheap
| services with a nice UX.
| causality0 wrote:
| Now if only the prices would drop and I could start
| making custom audiobooks for the price of regular
| audiobooks.
| spaceman_2020 wrote:
| An English language and accent bias in most current models
| tomduncalf wrote:
| https://voice-swap.ai/ may be of interest, you can covert
| your rough singing to use a real singer's voice (apparently!
| I've not tried it)
| anonylizard wrote:
| It'll affect linear entertainment far before it impacts games
| seriously (Especially the 'full immersive' games where you chat
| with AI agents)
|
| AI is still too expensive and performance intensive to run in
| games cost effectively, and truly powerful AI is probably
| another 10-100x cost increase.
|
| On the other hand, novels will be rapidly replaced by visual
| novels. The cost of having a novel fully illustrated and voiced
| will go down 1000x. A high quality illustration used to cost
| $500-$1000 (A day's work from a high-tier commercial artist),
| soon it will be about $0.5. I'm not counting in the author's
| time to prompt the images, because it would have costed them
| way more time to communicate with the illustrator anyways.
|
| The entire boundary between novels, comics, cartoons etc will
| blur. Like if a newly written Harry Potter can have thousands
| of illustrations set in Hogwarts and be fully voiced, the
| standards for a movie adaptation will be astronomically high,
| which will in turn drive AI use in movie production just to
| keep up.
| dragonwriter wrote:
| > It'll affect linear entertainment far before it impacts
| games seriously
|
| Maybe. But I think you make the mistake of considering games
| that combine existung AAA features + AI as where it will
| first impact games, where I think it will first make its mark
| in games that _don't_ use hardware heavily for 3d rendering
| by opening up new modes of gaming.
|
| > novels will be rapidly replaced by visual novels. The cost
| of having a novel fully illustrated and voiced will go down
| 1000x.
|
| The cost if having art made isn't the only reason novels
| aren't fully illustrated now, and AI doesn't impact any of
| the others.
| codetrotter wrote:
| > react to your words and actions in a believable, fully
| immersive manner
|
| "I'm sorry, but as an ethically trained AI I cannot engage in
| this sword fight. Violence is never the answer."
|
| Yeah. It's gonna be very immersive :P
| dvngnt_ wrote:
| you can use an uncensored model. not everything will be
| connected to open ai.
|
| I could see Microsoft making the first move next generation
| since they're knee deep in it.
| ekianjo wrote:
| Back to Descartes
| molave wrote:
| > The more rational voices in my mind, though, become more and
| more afraid of a world where the only thing you can trust is
| people sitting right in front of you. That makes the world of
| information pretty small again.
|
| Gives me cosmological analogies: in the far enough future, the
| only things you can see in the night sky are the members of the
| local supercluster.
| chunky1994 wrote:
| I used to be an LLM until I took an arrow to the knee. Aside
| from the joke, I think the barrier would definitely reduce and
| in-game characters would be contextually far more aware but how
| do you enforce plot progression in such a truly open world? Can
| you control the boundary of LLM expression?
| dragonwriter wrote:
| > but how do you enforce plot progression in such a truly
| open world?
|
| I'd imagine a mixture of general prompting/training the LLM
| on techniques like those used for plot progressiom by human
| game masters in TTRPGs and guidance via systems tracking
| progress and injecting contextual prompts based on mechanisms
| like those used for tracking and guiding plot progression in
| GM-less/GM-replacement systems (e.g., the Mythic Game Master
| Emulator) for TTRPGs.
| huytersd wrote:
| It would be fantastic to put a bunch of different LLMs in a
| game map with "senses" fulfilled by multimodal inputs and
| agency to carry out actions within the game's universe. With a
| goal such as make the most money or rule the most kingdoms, it
| would be super interesting to see how it self organizes.
| vlovich123 wrote:
| The amount of context needed would require quite a bit of novel
| R&D that doesn't exist on the horizon yet. I think it's more
| realistic that it'll be a mixture of real & fake in the interim
| (e.g. the LLM will record important game state changes /
| information & then use that as context but it'll still forget a
| bunch of things you'd expect a human to).
| nrjames wrote:
| Part of me looks forward to these experiences. A larger part of
| me already mourns the decline is purely human creativity that
| they suggest. Perhaps AI models will make the perfect game or
| music or write the perfect novel. I'm certain they'll be
| programmed to reproduce the funky jank and humans bring to
| everything they create. At some level, though, we each have our
| own story that we want to tell and when there are so many
| automated voices in the room, it's going to be harder to tell
| those stories.
| holoduke wrote:
| Definitely agree with your last point, but regarding games with
| endless possibilities. Mm I rather have a game that is created
| by someone with a clear goalbin mind. One with boundaries. A
| good single player experience. Like for example Alan wake 2.
| nathanfig wrote:
| Multi input? Infilling? First generative audio model I've seen
| that starts to close the gap with image models.
| novolunt wrote:
| I think that for artificial intelligence to become like humans,
| it should be treated under the same conditions as natural humans.
| It should be able to see the surrounding environment, listen to
| the surrounding sounds, smell the surrounding smells, and taste
| the surrounding food. It should be given Parents and relatives
| should be given their own partners and their own country. In this
| way, the artificial intelligence trained in the environment will
| naturally be more like human beings and have their own emotions.
| kevindamm wrote:
| What you're looking for is embodiment, and actually this was
| explicitly left aside in a recent paper that attempts to give
| measurement criteria for AGI[0]. But I agree with you that the
| entire lived sensation is critical to approaching any objective
| involving alignment.
|
| [0] https://arxiv.org/abs/2311.02462
| empath75 wrote:
| What if the "environment" for an AI is just "the internet".
|
| A long time ago, there was a great story in a Shadowrun
| supplement of all things about a hacker that got trained to
| teach an "ai" how to break into computers. It was basically at
| a child's level, emotionally, and the only world it's ever
| known was "the matrix" (yes, really -- and written almost a
| decade before The Matrix came out). Eventually it turns out
| that it's not an ai, but a corporation was stealing kids and
| sticking them in Virtual Reality at birth to train a team of
| super hackers.
| youssefabdelm wrote:
| Have weights been released?
|
| Edit: nvm, seems not from this line "In the coming weeks, we will
| be opening up the application here, along with an interactive
| demo that will showcase Audiobox's capabilities."
| maroonblazer wrote:
| > We're inviting researchers and institutions who have been
| previously involved in speech research, and who want to pursue
| responsibility and safety research on the latest Audiobox models,
| to apply.
|
| It's not clear as to what the expected outcome of this
| 'responsibility and safety research' effort is. Is the idea to
| nerf the tech such that it can't be used for morally/ethically
| nefarious purposes? If so, then is the "speech research"
| community the group best fit to do that work?
| david_draco wrote:
| Admirable, how they release the training data openly, after they
| put so much effort into recording these audio data themselves.
| Not.
| varunytoons wrote:
| This a fantastic new development in the AI Audio space! However,
| it's quite disappointing that the model is closed sourced.
| Nonetheless, Alibaba's equivalent was released earlier in Nov and
| it's open-sourced! https://github.com/QwenLM/Qwen-Audio
|
| Does anyone have suggestions for how to integrate this into your
| tech stack via an internal API? Interested to hear the varying
| thoughts on this. From what I softly understand is that the model
| weights have to be swapped or altered per se to be able to
| commercially reuse this. Correct me if I'm wrong.
| two_in_one wrote:
| Thanks for the link. License is clear: Researchers and
| developers are free to use the codes and model weights of both
| Qwen-Audio and Qwen-Audio-Chat. We also allow their commercial
| use.
|
| and important, if you have more than 100m active users:) 4.
| Restrictions If you are commercially using the Materials, and
| your product or service has more than 100 million monthly
| active users, You shall request a license from Us
|
| So, looks like it's absolutely fine to use, except for IT
| behemoth.
|
| As for how to use, API, I think. Interesting applications are
| possible. Like interactive mobile robots. Assistants for people
| with disabilities, both software and wearable.
|
| Interesting times... this will be called AI revolution
| probably. It's already not a joke, after several ups and downs.
| mmaunder wrote:
| I think the release of closed source models right now is a net
| negative and worth opposing. Right now we're building a future
| where the very wealthy and powerful will control access to AI on
| ethical grounds, while they have uncensored access to the latest
| and most powerful models. Innovation, high frequency trading,
| medical breakthroughs, creative output - all of these and more
| will be enhanced by AI, and you'll be eating leftovers and paying
| a fortune for them, wondering why you can't keep up - unless we
| enable a vibrant open source ecosystem, and force big tech to
| release models into that ecosystem.
|
| Support open source models by celebrating their release and
| pressuring companies to release them, and oppose closed source AI
| or face a very bleak future for you and your descendants.
|
| You may be having fun with "Open" AI's API today, but you're
| supporting and celebrating the collapse of society into megacap
| AI elites and a majority paying for metered access to old
| technology.
| s3p wrote:
| I mean sure.. but imagine if this were open sourced as is. This
| is new tech that has barely had time to mature. The
| possibilities for abuse are endless. I for one am happy that
| this model isn't being open sourced. This is an excellent way
| for people to generate all kinds of disturbing and fake audio
| clips.
| mmaunder wrote:
| Same logic could be applied to Linux by Microsoft in the 90s.
| In fact, the "It's for your safety" has been applied to some
| of the worst things humans have perpetrated including
| apartheid (which I lived through) and the holocaust. And it's
| always those that claim to keep us safe doing the worst. And
| it continues with perceived dangers providing pretext and
| moral authority to do bad things.
| beebeepka wrote:
| "If we were to release our kettle and chickens, they would
| go extinct within a few years."
|
| "I just hold on to all the money, 'cause bitches can't be
| trusted with it. We pool all the kissing money together,
| see? But if you wanna buy anything, you just talk to the
| bottom bitch, and then the bottom bitch talks to me.
|
| Do you know what I am saying?"
| mmaunder wrote:
| Choose OSS, or put that mouth to work on someone's AI
| API.
| Pugpugpugs wrote:
| Uh yeah, who cares? Give an example of something this audio
| generator could make that is dangerous.
| axpy906 wrote:
| A utopian future would be FOSS LLMs that are private and run
| locally. The opposite of one where models are public,
| proprietary - running on your data and owned by just a few
| large entities.
| petarb wrote:
| What's the best way to try these models out?
|
| Does Meta usually provide a web interface for them or do you have
| to download and run locally?
___________________________________________________________________
(page generated 2023-12-10 23:00 UTC)