[HN Gopher] Chameleon: Meta's New Multi-Modal LLM
___________________________________________________________________
Chameleon: Meta's New Multi-Modal LLM
Author : gabrielbirnbaum
Score : 202 points
Date : 2024-05-21 01:37 UTC (10 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| elijahbenizzy wrote:
| What's cool (and tough to keep up with) with this wave of tech is
| just how quickly it moves.
|
| On the plus side there's a lot of interesting things and it is
| generally easy to follow/figure out what they did.
|
| On the minus side it's a little exhausting, and there's so much
| money in it feels like the vast majority of it is grifting. To
| add to that, the people who are trying to catalog it (the AI guy
| on LI) are the griftiest of them all.
|
| I've found the best way to keep up is find one topic you want to
| learn about, deep-dive, and read all the related papers, then
| explore breadth-first from there until you find another topic...
| randmeerkat wrote:
| > On the minus side it's a little exhausting, and there's so
| much money in it feels like the vast majority of it is
| grifting.
|
| It is all grifting. The moment someone creates something that
| can improve upon itself there will be an intelligence explosion
| and it won't need press releases or debates about its
| intelligence. The current path of research will not lead to
| that, if there was something to be discovered here it would
| have been discovered already. It's just new to the general
| consumer and there's a wow factor associated with it like
| crypto and NFTs before. The truth is tech has lost its momentum
| and is desperate to find a new trick.
| elijahbenizzy wrote:
| I think the opposite. There is value in intelligent software,
| but IMO we're a long way from AGI. So lots of grifting but
| some gold along the way. And it's intellectually
| interesting/nuanced (cool math, interesting infra), unlike
| crypto which was more of a massive energy burning waste than
| anyone likes to admit.
| sanxiyn wrote:
| If our standard is cool math, crypto is also full of cool
| math. (Have you read Vitalik's explanation of Quadratic
| Arithmetic Programs?) Our standard can't be that low.
|
| https://medium.com/@VitalikButerin/quadratic-arithmetic-
| prog...
| Kuinox wrote:
| The rate of improving is important. If it's as intelligent as
| the average human, the rate of improving will be slow, very
| slow compared to what the researcher can do currently.
| 361994752 wrote:
| It's still in the early phase where the bubble is building up.
| This is necessary if we want a prosperous market. Hopefully,
| after the bubble bursts, some good companies will remain (which
| is very likely).
| polskibus wrote:
| How do you keep your knowledge after deep dive? Do you try to
| use it somehow? I found that reading a lot usually does not
| contribute to long term proficiency in a given topic, unless
| followed by non trivial amount of practice.
| ljlolel wrote:
| It also matches or exceeds the performance of much larger models,
| including Gemini Pro and GPT-4V, according to human judgments on
| a new long-form mixed-modal generation evaluation, where either
| the prompt or outputs contain mixed sequences of both images and
| text. Chameleon marks a significant step forward in a unified
| modeling of full multimodal documents.
| krasin wrote:
| Relevant thread on /r/locallama ([1]). A relevant quote from the
| comments:
|
| > There's a Twitter thread from one of the authors ([2]). This
| part seems pretty important: "The models in this paper were done
| training 5 months ago. We've progressed significantly since
| then."
|
| 1.
| https://www.reddit.com/r/LocalLLaMA/comments/1ctsala/newly_p...
|
| 2. https://x.com/ArmenAgha/status/1791275549815648473
| vessenes wrote:
| There's some pretty nice fundamental research in here, and I
| appreciate the publication very much. What stood out to me is
| their discussion of the difficulties of using softmax against
| different tokenization spaces; super interesting analysis (they
| say different modalities compete by upping their own strength
| relative to other modalities, leading to divergence), and the
| ultimate fix, (I can't remember it right now, and leave this as a
| tease to the interested paper reader).
|
| They also noted the problem was most pronounced once they got up
| to 34b sized. It's a good reminder training large scale models
| leads to new interesting problems. I imagine a lot of techniques
| and know how are not published; all those little bits of
| experience add up to a lot of competitive advantage in many
| venues, so once again, thanks to Zuck and co for publishing.
| md_rumpf wrote:
| the modality competition was one of my favorite insights, too!
| md_rumpf wrote:
| every 3rd sentsnce is "the model was not trained on data from
| meta's products"
| keyle wrote:
| That makes sense. You probably don't want to train your LLM
| against your uncle's dubious claims about the government, flat
| earthers and the likes content :)
| gdiamos wrote:
| Does Meta plan to open source these models?
| msoad wrote:
| Compared to Mirasol3B[1] this is not supporting audio as a
| modality. What Google has done with Mirasol3B made the demo of
| "Astro" in Google I/O possible. They do a little of cheating by
| converting audio to images(spectrogram) and video to 25 photo
| frames with some sort of attention system to things that change
| during those frames. So the tokenizer is basically the same for
| audio and video and images.
|
| I believe Meta is going to this direction with multimodality as
| well. The new GPT voice mode is probably using the same
| architecture.
|
| What's mind boggling is that models perform better at the same
| parameter size with new modality added to them!
|
| It seems obvious that 3D is the next modality.
|
| [1] https://arxiv.org/pdf/2311.05698
| kriro wrote:
| Only browsed but this is really interesting and I'm glad it was
| published.
|
| I understand why a unified model is an interesting thing to work
| on but doesn't the discovery of "modal-competition" suggest that
| at least short term it might be even better to train specialized
| models for each modality and some sort of modality-supervisor
| (glue code model)?
| dankle wrote:
| Are they downloadable?
| mjburgess wrote:
| Am I reading this correctly:
|
| Training time was 4282407hrs. At, conservatively, 200w gpus,
| that's (4282407*200)/1_000_000_000 GWh ~= 1 GWh. At 10c/kWh
| that's $100,000 ?
|
| So if you have a single eqv GPU at home, it's 500yrs of training
| time and $100k in energy costs. Or, in practice, 3000 gpus for
| 2mo.
|
| The AI industry has to hope the world doesnt change fast enough
| for these models to be useless.
|
| EDIT: price is $100k
| TaylorAlexander wrote:
| Thanks for the figures. I suppose with expenses like that, they
| will be motivated to research methods of updating models which
| have already been trained.
|
| Edit: I see the price was updated
| hackerlight wrote:
| 1 GWh is 1 million kWh, multiplied by $0.1 that should give
| $100k in energy costs?
| mjburgess wrote:
| Yes, thanks. I had assumed I had been off by a factor
| somewhere. Yet, 100k seems small -- the total cost of
| production is in the 10mil+ range.
| Etheryte wrote:
| 100k is small, but you only get away with 100k if you nail
| everything perfect the first time around -- something that
| we all know does not really happen. I think compiling is a
| good parallel to training, imagine if compiling your whole
| software project cost 100k if you did it from scratch. Sure
| there's incremental builds etc, but the cost is steep no
| matter which way you look at it.
| jsheard wrote:
| Numbers like these really don't bode well for the longer term
| prospects of open source models, I doubt the current strategy
| of waiting expectantly for a corporation to spoonfeed us yet
| another $100,000 model for free is going to work forever.
|
| That $100k is conservative too, it doesn't include the cost of
| buying/renting the hardware, or the compute time spent on
| experimental training runs, or the cost of data acquisition,
| labeling and cleaning, or the cost of RLHF fine-tuning.
| mikehollinger wrote:
| > Numbers like these really don't bode well for the long-term
| prospects of open source models, I doubt the current strategy
| of waiting expectantly for a corporation to spoonfeed us yet
| another $100,000 model for free is going to work forever.
|
| I would add "in their current form" and agree. There's three
| things that can change here: 1. Moore's law: The worldwide
| economy is built around the steady progression of cheaper
| compute. Give it 36 months and your problem becomes a $25,000
| problem. 2. Quantization and smaller models: There'll likely
| become specializations of the various models (is this the
| beginning of the "Monolith vs Microservices" debate? 3. E2E
| Training isn't for everyone: Finetunes and Alignment are more
| important than an end to end training run, IF we can coerce
| the behaviors we want into the models by finetuning them.
| That along with quantized models (imho) unlocked vision
| models which are now in the "plateau of producivity" in the
| gartner hype cycle compared to a few years ago.
|
| So as an example today, I can grab a backbone and pretrained
| weights for an object detector, and with relatively little
| data (from a few lines to a few 10's of lines of code, and 50
| to 500 images) and relatively little wall clock time and
| energy (say 5 to 15 minutes) on a PC, I can create a
| customized object detector that can detect -my- specific
| objects pretty well. I might need to revise it a few times,
| but it'll work pretty well.
|
| Why would we not see the same sort of progression with
| transformer architectures? It hinges on someone creating the
| model weights for the "greater good," or us figuring out how
| to do distributed training for open source in a "seti@home"
| style (long live the blockchain, anyone?).
| jsheard wrote:
| Yeah, there's no accounting for breakthroughs in training
| efficiency. I wouldn't count on Moores Law though, the
| amount of compute you can put into these problems is
| effectively unbounded so more efficient silicon just means
| those with money can train even bigger models. 3D rendering
| is a decent analogy, Moores Law has made it easy to render
| something comparable to the first Toy Story movie, but
| Pixar poured those gains back into more compute and is
| using it to do things you definitely can't afford to.
___________________________________________________________________
(page generated 2024-05-21 12:00 UTC)