[HN Gopher] Chameleon: Meta's New Multi-Modal LLM
       ___________________________________________________________________
        
       Chameleon: Meta's New Multi-Modal LLM
        
       Author : gabrielbirnbaum
       Score  : 202 points
       Date   : 2024-05-21 01:37 UTC (10 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | elijahbenizzy wrote:
       | What's cool (and tough to keep up with) with this wave of tech is
       | just how quickly it moves.
       | 
       | On the plus side there's a lot of interesting things and it is
       | generally easy to follow/figure out what they did.
       | 
       | On the minus side it's a little exhausting, and there's so much
       | money in it feels like the vast majority of it is grifting. To
       | add to that, the people who are trying to catalog it (the AI guy
       | on LI) are the griftiest of them all.
       | 
       | I've found the best way to keep up is find one topic you want to
       | learn about, deep-dive, and read all the related papers, then
       | explore breadth-first from there until you find another topic...
        
         | randmeerkat wrote:
         | > On the minus side it's a little exhausting, and there's so
         | much money in it feels like the vast majority of it is
         | grifting.
         | 
         | It is all grifting. The moment someone creates something that
         | can improve upon itself there will be an intelligence explosion
         | and it won't need press releases or debates about its
         | intelligence. The current path of research will not lead to
         | that, if there was something to be discovered here it would
         | have been discovered already. It's just new to the general
         | consumer and there's a wow factor associated with it like
         | crypto and NFTs before. The truth is tech has lost its momentum
         | and is desperate to find a new trick.
        
           | elijahbenizzy wrote:
           | I think the opposite. There is value in intelligent software,
           | but IMO we're a long way from AGI. So lots of grifting but
           | some gold along the way. And it's intellectually
           | interesting/nuanced (cool math, interesting infra), unlike
           | crypto which was more of a massive energy burning waste than
           | anyone likes to admit.
        
             | sanxiyn wrote:
             | If our standard is cool math, crypto is also full of cool
             | math. (Have you read Vitalik's explanation of Quadratic
             | Arithmetic Programs?) Our standard can't be that low.
             | 
             | https://medium.com/@VitalikButerin/quadratic-arithmetic-
             | prog...
        
           | Kuinox wrote:
           | The rate of improving is important. If it's as intelligent as
           | the average human, the rate of improving will be slow, very
           | slow compared to what the researcher can do currently.
        
         | 361994752 wrote:
         | It's still in the early phase where the bubble is building up.
         | This is necessary if we want a prosperous market. Hopefully,
         | after the bubble bursts, some good companies will remain (which
         | is very likely).
        
         | polskibus wrote:
         | How do you keep your knowledge after deep dive? Do you try to
         | use it somehow? I found that reading a lot usually does not
         | contribute to long term proficiency in a given topic, unless
         | followed by non trivial amount of practice.
        
       | ljlolel wrote:
       | It also matches or exceeds the performance of much larger models,
       | including Gemini Pro and GPT-4V, according to human judgments on
       | a new long-form mixed-modal generation evaluation, where either
       | the prompt or outputs contain mixed sequences of both images and
       | text. Chameleon marks a significant step forward in a unified
       | modeling of full multimodal documents.
        
       | krasin wrote:
       | Relevant thread on /r/locallama ([1]). A relevant quote from the
       | comments:
       | 
       | > There's a Twitter thread from one of the authors ([2]). This
       | part seems pretty important: "The models in this paper were done
       | training 5 months ago. We've progressed significantly since
       | then."
       | 
       | 1.
       | https://www.reddit.com/r/LocalLLaMA/comments/1ctsala/newly_p...
       | 
       | 2. https://x.com/ArmenAgha/status/1791275549815648473
        
       | vessenes wrote:
       | There's some pretty nice fundamental research in here, and I
       | appreciate the publication very much. What stood out to me is
       | their discussion of the difficulties of using softmax against
       | different tokenization spaces; super interesting analysis (they
       | say different modalities compete by upping their own strength
       | relative to other modalities, leading to divergence), and the
       | ultimate fix, (I can't remember it right now, and leave this as a
       | tease to the interested paper reader).
       | 
       | They also noted the problem was most pronounced once they got up
       | to 34b sized. It's a good reminder training large scale models
       | leads to new interesting problems. I imagine a lot of techniques
       | and know how are not published; all those little bits of
       | experience add up to a lot of competitive advantage in many
       | venues, so once again, thanks to Zuck and co for publishing.
        
         | md_rumpf wrote:
         | the modality competition was one of my favorite insights, too!
        
       | md_rumpf wrote:
       | every 3rd sentsnce is "the model was not trained on data from
       | meta's products"
        
         | keyle wrote:
         | That makes sense. You probably don't want to train your LLM
         | against your uncle's dubious claims about the government, flat
         | earthers and the likes content :)
        
       | gdiamos wrote:
       | Does Meta plan to open source these models?
        
       | msoad wrote:
       | Compared to Mirasol3B[1] this is not supporting audio as a
       | modality. What Google has done with Mirasol3B made the demo of
       | "Astro" in Google I/O possible. They do a little of cheating by
       | converting audio to images(spectrogram) and video to 25 photo
       | frames with some sort of attention system to things that change
       | during those frames. So the tokenizer is basically the same for
       | audio and video and images.
       | 
       | I believe Meta is going to this direction with multimodality as
       | well. The new GPT voice mode is probably using the same
       | architecture.
       | 
       | What's mind boggling is that models perform better at the same
       | parameter size with new modality added to them!
       | 
       | It seems obvious that 3D is the next modality.
       | 
       | [1] https://arxiv.org/pdf/2311.05698
        
       | kriro wrote:
       | Only browsed but this is really interesting and I'm glad it was
       | published.
       | 
       | I understand why a unified model is an interesting thing to work
       | on but doesn't the discovery of "modal-competition" suggest that
       | at least short term it might be even better to train specialized
       | models for each modality and some sort of modality-supervisor
       | (glue code model)?
        
       | dankle wrote:
       | Are they downloadable?
        
       | mjburgess wrote:
       | Am I reading this correctly:
       | 
       | Training time was 4282407hrs. At, conservatively, 200w gpus,
       | that's (4282407*200)/1_000_000_000 GWh ~= 1 GWh. At 10c/kWh
       | that's $100,000 ?
       | 
       | So if you have a single eqv GPU at home, it's 500yrs of training
       | time and $100k in energy costs. Or, in practice, 3000 gpus for
       | 2mo.
       | 
       | The AI industry has to hope the world doesnt change fast enough
       | for these models to be useless.
       | 
       | EDIT: price is $100k
        
         | TaylorAlexander wrote:
         | Thanks for the figures. I suppose with expenses like that, they
         | will be motivated to research methods of updating models which
         | have already been trained.
         | 
         | Edit: I see the price was updated
        
         | hackerlight wrote:
         | 1 GWh is 1 million kWh, multiplied by $0.1 that should give
         | $100k in energy costs?
        
           | mjburgess wrote:
           | Yes, thanks. I had assumed I had been off by a factor
           | somewhere. Yet, 100k seems small -- the total cost of
           | production is in the 10mil+ range.
        
             | Etheryte wrote:
             | 100k is small, but you only get away with 100k if you nail
             | everything perfect the first time around -- something that
             | we all know does not really happen. I think compiling is a
             | good parallel to training, imagine if compiling your whole
             | software project cost 100k if you did it from scratch. Sure
             | there's incremental builds etc, but the cost is steep no
             | matter which way you look at it.
        
         | jsheard wrote:
         | Numbers like these really don't bode well for the longer term
         | prospects of open source models, I doubt the current strategy
         | of waiting expectantly for a corporation to spoonfeed us yet
         | another $100,000 model for free is going to work forever.
         | 
         | That $100k is conservative too, it doesn't include the cost of
         | buying/renting the hardware, or the compute time spent on
         | experimental training runs, or the cost of data acquisition,
         | labeling and cleaning, or the cost of RLHF fine-tuning.
        
           | mikehollinger wrote:
           | > Numbers like these really don't bode well for the long-term
           | prospects of open source models, I doubt the current strategy
           | of waiting expectantly for a corporation to spoonfeed us yet
           | another $100,000 model for free is going to work forever.
           | 
           | I would add "in their current form" and agree. There's three
           | things that can change here: 1. Moore's law: The worldwide
           | economy is built around the steady progression of cheaper
           | compute. Give it 36 months and your problem becomes a $25,000
           | problem. 2. Quantization and smaller models: There'll likely
           | become specializations of the various models (is this the
           | beginning of the "Monolith vs Microservices" debate? 3. E2E
           | Training isn't for everyone: Finetunes and Alignment are more
           | important than an end to end training run, IF we can coerce
           | the behaviors we want into the models by finetuning them.
           | That along with quantized models (imho) unlocked vision
           | models which are now in the "plateau of producivity" in the
           | gartner hype cycle compared to a few years ago.
           | 
           | So as an example today, I can grab a backbone and pretrained
           | weights for an object detector, and with relatively little
           | data (from a few lines to a few 10's of lines of code, and 50
           | to 500 images) and relatively little wall clock time and
           | energy (say 5 to 15 minutes) on a PC, I can create a
           | customized object detector that can detect -my- specific
           | objects pretty well. I might need to revise it a few times,
           | but it'll work pretty well.
           | 
           | Why would we not see the same sort of progression with
           | transformer architectures? It hinges on someone creating the
           | model weights for the "greater good," or us figuring out how
           | to do distributed training for open source in a "seti@home"
           | style (long live the blockchain, anyone?).
        
             | jsheard wrote:
             | Yeah, there's no accounting for breakthroughs in training
             | efficiency. I wouldn't count on Moores Law though, the
             | amount of compute you can put into these problems is
             | effectively unbounded so more efficient silicon just means
             | those with money can train even bigger models. 3D rendering
             | is a decent analogy, Moores Law has made it easy to render
             | something comparable to the first Toy Story movie, but
             | Pixar poured those gains back into more compute and is
             | using it to do things you definitely can't afford to.
        
       ___________________________________________________________________
       (page generated 2024-05-21 12:00 UTC)