[HN Gopher] LLaVA-O1: Let Vision Language Models Reason Step-by-...
       ___________________________________________________________________
        
       LLaVA-O1: Let Vision Language Models Reason Step-by-Step
        
       Author : lnyan
       Score  : 124 points
       Date   : 2024-11-18 09:44 UTC (13 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | Wilsoniumite wrote:
       | That first page graph has a very interesting choice of x-axis.
        
         | jerpint wrote:
         | Sadly this is seen at so many prestigious ML conferences, a
         | trimmed X axis which makes performance seem significant when
         | it's sometimes incremental
        
           | exe34 wrote:
           | I think it's acceptable if you're trying to show subtle
           | differences - but I would probably put the whole plot and
           | then the zoomed version and clearly label it as "zoomed in
           | for highlighting <.....>"
        
             | nerdponx wrote:
             | You don't need to include 0 on every axis.
             | 
             | In this case they really made the numbers smaller than they
             | should be, so it's hard to see that the scale is on the
             | order of single digits. It looks like this is about a 3-4%
             | improvement over GPT-4o-mini and Gemini Pro 1.5.
             | 
             | The bigger problem here is not the axis baseline, but the
             | fact that I have no idea (as a non-AI-researcher) what
             | benchmark this is, or if 0 is even the natural minimum. The
             | caption should at least mention what the natural range of
             | the x-axis is.
        
               | Ukv wrote:
               | > the fact that I have no idea (as a non-AI-researcher)
               | what benchmark this is
               | 
               | The figure labels it as as "average score [on] 6
               | multimodal reasoning benchmarks", and the caption notes
               | that the full results are in table 7 - which lists those
               | benchmarks: MMStar-R, MMBench-R, MMVet-R, MathVista,
               | AI2D, Hallusion
               | 
               | I think it's mostly fine as a lead diagram giving an
               | overview before going into detail.
        
               | nerdponx wrote:
               | Right, I don't need to know what they are, I just need to
               | know what "64" means. Is the baseline actually 0? That
               | detail is enough to avoid actually drawing 0 on the axis.
        
         | jdonaldson wrote:
         | "Convincing you is more important than informing you"
         | 
         | Always a pass from me, gets things off on the wrong foot right
         | away.
        
         | llm_nerd wrote:
         | What's wrong with it? Among the graphed cohort the average
         | benchmark score was between 56 - 66, so they scaled to 55-67.
         | Such a strategy to differentiate is completely normal, and it's
         | weird how often this is called out as being deceptive.
         | 
         | Further this is a paper on arXiv, so the idea by some that it's
         | meant to deceive -- as if the target audience isn't going to
         | immediately look at the axis labels, and for more dig into what
         | the benchmarks even were -- is not convincing.
         | 
         | I'd hold more criticism for the fact that their lead graphic
         | specifically excludes options which beat it (e.g. GPT-4o,
         | Sonnet), though these details can be found in the chart below.
         | 
         | Still interesting. And this "structuring AI" approach is how
         | the next evolution in AI is happening.
        
           | mdp2021 wrote:
           | > _What 's wrong with it_
           | 
           | Unfortunately the practice of showing the latter slice runs
           | along that of showing the whole bars, so a better convention
           | to distinguish the two would be beneficial.
           | 
           | For example, "breaking" the bars (on the left side),
           | similarly to when some bars run too far on the right side.
           | I.e.:                 | ==//====|       | ==//========|
           | | ==//===|       +----------------
           | 
           | ...which is not uncommon practice already.
        
       | tucnak wrote:
       | The o1 connection is made through "Evaluation of openai o1:
       | Opportunities and challenges of AGI"[63]--a paper mill product
       | with 50 or so authors. They created that 280-page monstrosity in
       | less than two weeks of the o1 release. Did I miss something?
       | AFAIK, there's no published literature from OpenAI on o1, and
       | nobody knows what o1 is doing exactly, but it seems the Chinese
       | have figured it out in the matter of days... They say their model
       | performs well on visual benchmarks, but I suspect it probably
       | owes to them overfitting on these benchmarks in the first place.
       | 
       | Consider their Proposed Method:
       | 
       | "Each stage is initiated at the model's discretion, without
       | external prompt engineering frameworks or additional prompting.
       | Specifically, we provide the model with four pairs of special
       | tags: <SUMMARY></SUMMARY>, <CAPTION></CAPTION>,
       | <REASONING></REASONING>, and <CONCLUSION></CONCLUSION>.
       | 
       | These tags correspond to summarizing the response approach,
       | describing relevant image content, conducting reasoning, and
       | preparing a final answer, respectively. Upon training, the model
       | autonomously selects these tags as needed, activating each stage
       | based on its own judgment.
       | 
       | As with OpenAI o1 [63], all stages are completed by the model in
       | a single inference pass."
       | 
       | [63]: https://arxiv.org/pdf/2409.18486
        
       | startupsfail wrote:
       | Generating data with OpenAI model AND copying the approach from
       | OpenAI model. This is a bit unsatisfactory, its like saying you
       | wrote some working code, while in fact you've decompiled the
       | binary and then compiled it again.
        
         | exe34 wrote:
         | well if you have working code at the end, you made progress.
         | closedAI can pull any model at any time for a profit.
        
       | yalok wrote:
       | This quote summarizes the main secret sauce to me - once they
       | generate a wrong token/phrase, the whole answer goes south - and
       | it basically explains why the whole CoT approach works - prevent
       | LLM from generating a wrong answer with 2 tricks: 1) ask LLM
       | explicitly to generate intermediate steps instead of a final
       | answer and 2) use beam search (filtering from several answers at
       | each stage) to reduce the risk of picking a wrong answer even
       | further.
       | 
       | Quote from this paper: " Moreover, they (VLM) frequently deviate
       | from a logical reasoning toward conclusions, instead of
       | presenting a conclusion prematurely and subsequently attempting
       | to justify it. Given that language models generate responses
       | token-by-token, once an erroneous conclusion is introduced, the
       | model typically continues along a flawed reasoning path."
        
       | Jackson__ wrote:
       | Figure 2 in the paper shows what I really dislike about a lot of
       | vision model benchmarks.
       | 
       | I care about whether these VLMs can accurately _see_ and
       | _describe_ things in a picture. Meanwhile the vision part of
       | these benchmarks are a lot of extremely basic OCR that any VLMs
       | of the past year can do. The gains in score come from the LM
       | improving logic skills not from the actual vision ability
       | improving.
        
       | resource_waste wrote:
       | What are options to fine tune?
       | 
       | For instance, if I have a CAD model of a screw fastened to a
       | wall, can I teach it that its a screw fastened to a wall?
       | 
       | I have years worth of potential training data.
       | 
       | Consider this a multi-million dollar problem.
        
         | abdussamit wrote:
         | This is quite an interesting problem. Hope you find a solution
         | to this, and wish I had the right knowledge to work on it
        
       | a3w wrote:
       | PSA: LLMs don't reason, they pattern match. Renaming stuff means
       | different results, since the pattern is not matched. This has
       | even been shown for OpenAI o1.
        
         | mptest wrote:
         | Has it been shown for o1 conclusively? I'd love to read the
         | paper. I recall that Apple paper about non reasoning due to
         | fuzzed question data causing performance degradation that
         | caught a lot of traction but IIRC o1 had pretty resilient
         | performance compared to previous models. which to be clear I
         | agree with your sentiment towards. I just have yet to see
         | definitive data that shows o1 is not fundamentally more
         | resilient to the types of test we use to discern "reasoning"
         | from "pattern matching".
         | 
         | I watched a professor lecture on the likely candidates for what
         | the open source llm community think is going on in o1[0] and
         | I'm not convinced it's still simple pattern matching. [0]
         | https://youtu.be/6PEJ96k1kiw
        
         | SubiculumCode wrote:
         | Can you provide an example or link?
         | 
         | I'm not so confident that humans reason in a fundamentally
         | different way than pattern matching. Perhaps paradigms focused
         | on predicting the next token is too limiting. Reasoning
         | plausibly involves pattern matching relevant schema
         | representations, then executing along that schema. The ability
         | to intuit that an existing schema is applicable to a certain
         | situation is a good measure of intelligence, IMO. Could even
         | make a good llm metric.
        
           | mdp2021 wrote:
           | > _humans reason in a fundamentally different way_
           | 
           | After having formulated an idea, do you put it on your
           | intellectual bench and re-examine it, purposefully,
           | analytically? Well, that is more than plain pattern matching
           | over intellectual keys - it is procedural.
           | 
           | And what about those intellectual keys or <<schemas>>, how
           | are they generated? Through a verification, consolidation
           | that is further to the original (pattern matching) intuition.
        
         | blixt wrote:
         | I don't completely disagree but I believe it's a bit more fuzzy
         | than that. From what I understand, the models learn a very
         | compressed version of what they receive as input and produce as
         | output. While not sufficient to generalize, you could say they
         | memorize some very high-dimensional function to cause the
         | expected text to be produced, and they can turn on and combine
         | multiple of these functions (multiply by non-zero, sum, etc).
         | So on some level an LLM can kind of perform logic on the input,
         | even if it has a slightly novel pattern. But at the same time,
         | no model is shown to completely generalize the way a human
         | would.
         | 
         | And let's also be fair, it would take a lot of effort for a
         | human to generalize to a previously unseen pattern as well, so
         | I always wonder just how useful it is to try to make such
         | binary statements as "models don't reason" or they're
         | "stochastic parrots". But maybe it's to counterweigh the
         | statements that they are sentient, AGI is here, etc?
        
         | blovescoffee wrote:
         | You're going to PSA an opinion?
        
       | snats wrote:
       | This paper is not comparing against MOLMO or Qwen, so I would
       | take it with a grain of salt
        
       ___________________________________________________________________
       (page generated 2024-11-18 23:00 UTC)