[HN Gopher] LLaVA-O1: Let Vision Language Models Reason Step-by-...
___________________________________________________________________
LLaVA-O1: Let Vision Language Models Reason Step-by-Step
Author : lnyan
Score : 124 points
Date : 2024-11-18 09:44 UTC (13 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| Wilsoniumite wrote:
| That first page graph has a very interesting choice of x-axis.
| jerpint wrote:
| Sadly this is seen at so many prestigious ML conferences, a
| trimmed X axis which makes performance seem significant when
| it's sometimes incremental
| exe34 wrote:
| I think it's acceptable if you're trying to show subtle
| differences - but I would probably put the whole plot and
| then the zoomed version and clearly label it as "zoomed in
| for highlighting <.....>"
| nerdponx wrote:
| You don't need to include 0 on every axis.
|
| In this case they really made the numbers smaller than they
| should be, so it's hard to see that the scale is on the
| order of single digits. It looks like this is about a 3-4%
| improvement over GPT-4o-mini and Gemini Pro 1.5.
|
| The bigger problem here is not the axis baseline, but the
| fact that I have no idea (as a non-AI-researcher) what
| benchmark this is, or if 0 is even the natural minimum. The
| caption should at least mention what the natural range of
| the x-axis is.
| Ukv wrote:
| > the fact that I have no idea (as a non-AI-researcher)
| what benchmark this is
|
| The figure labels it as as "average score [on] 6
| multimodal reasoning benchmarks", and the caption notes
| that the full results are in table 7 - which lists those
| benchmarks: MMStar-R, MMBench-R, MMVet-R, MathVista,
| AI2D, Hallusion
|
| I think it's mostly fine as a lead diagram giving an
| overview before going into detail.
| nerdponx wrote:
| Right, I don't need to know what they are, I just need to
| know what "64" means. Is the baseline actually 0? That
| detail is enough to avoid actually drawing 0 on the axis.
| jdonaldson wrote:
| "Convincing you is more important than informing you"
|
| Always a pass from me, gets things off on the wrong foot right
| away.
| llm_nerd wrote:
| What's wrong with it? Among the graphed cohort the average
| benchmark score was between 56 - 66, so they scaled to 55-67.
| Such a strategy to differentiate is completely normal, and it's
| weird how often this is called out as being deceptive.
|
| Further this is a paper on arXiv, so the idea by some that it's
| meant to deceive -- as if the target audience isn't going to
| immediately look at the axis labels, and for more dig into what
| the benchmarks even were -- is not convincing.
|
| I'd hold more criticism for the fact that their lead graphic
| specifically excludes options which beat it (e.g. GPT-4o,
| Sonnet), though these details can be found in the chart below.
|
| Still interesting. And this "structuring AI" approach is how
| the next evolution in AI is happening.
| mdp2021 wrote:
| > _What 's wrong with it_
|
| Unfortunately the practice of showing the latter slice runs
| along that of showing the whole bars, so a better convention
| to distinguish the two would be beneficial.
|
| For example, "breaking" the bars (on the left side),
| similarly to when some bars run too far on the right side.
| I.e.: | ==//====| | ==//========|
| | ==//===| +----------------
|
| ...which is not uncommon practice already.
| tucnak wrote:
| The o1 connection is made through "Evaluation of openai o1:
| Opportunities and challenges of AGI"[63]--a paper mill product
| with 50 or so authors. They created that 280-page monstrosity in
| less than two weeks of the o1 release. Did I miss something?
| AFAIK, there's no published literature from OpenAI on o1, and
| nobody knows what o1 is doing exactly, but it seems the Chinese
| have figured it out in the matter of days... They say their model
| performs well on visual benchmarks, but I suspect it probably
| owes to them overfitting on these benchmarks in the first place.
|
| Consider their Proposed Method:
|
| "Each stage is initiated at the model's discretion, without
| external prompt engineering frameworks or additional prompting.
| Specifically, we provide the model with four pairs of special
| tags: <SUMMARY></SUMMARY>, <CAPTION></CAPTION>,
| <REASONING></REASONING>, and <CONCLUSION></CONCLUSION>.
|
| These tags correspond to summarizing the response approach,
| describing relevant image content, conducting reasoning, and
| preparing a final answer, respectively. Upon training, the model
| autonomously selects these tags as needed, activating each stage
| based on its own judgment.
|
| As with OpenAI o1 [63], all stages are completed by the model in
| a single inference pass."
|
| [63]: https://arxiv.org/pdf/2409.18486
| startupsfail wrote:
| Generating data with OpenAI model AND copying the approach from
| OpenAI model. This is a bit unsatisfactory, its like saying you
| wrote some working code, while in fact you've decompiled the
| binary and then compiled it again.
| exe34 wrote:
| well if you have working code at the end, you made progress.
| closedAI can pull any model at any time for a profit.
| yalok wrote:
| This quote summarizes the main secret sauce to me - once they
| generate a wrong token/phrase, the whole answer goes south - and
| it basically explains why the whole CoT approach works - prevent
| LLM from generating a wrong answer with 2 tricks: 1) ask LLM
| explicitly to generate intermediate steps instead of a final
| answer and 2) use beam search (filtering from several answers at
| each stage) to reduce the risk of picking a wrong answer even
| further.
|
| Quote from this paper: " Moreover, they (VLM) frequently deviate
| from a logical reasoning toward conclusions, instead of
| presenting a conclusion prematurely and subsequently attempting
| to justify it. Given that language models generate responses
| token-by-token, once an erroneous conclusion is introduced, the
| model typically continues along a flawed reasoning path."
| Jackson__ wrote:
| Figure 2 in the paper shows what I really dislike about a lot of
| vision model benchmarks.
|
| I care about whether these VLMs can accurately _see_ and
| _describe_ things in a picture. Meanwhile the vision part of
| these benchmarks are a lot of extremely basic OCR that any VLMs
| of the past year can do. The gains in score come from the LM
| improving logic skills not from the actual vision ability
| improving.
| resource_waste wrote:
| What are options to fine tune?
|
| For instance, if I have a CAD model of a screw fastened to a
| wall, can I teach it that its a screw fastened to a wall?
|
| I have years worth of potential training data.
|
| Consider this a multi-million dollar problem.
| abdussamit wrote:
| This is quite an interesting problem. Hope you find a solution
| to this, and wish I had the right knowledge to work on it
| a3w wrote:
| PSA: LLMs don't reason, they pattern match. Renaming stuff means
| different results, since the pattern is not matched. This has
| even been shown for OpenAI o1.
| mptest wrote:
| Has it been shown for o1 conclusively? I'd love to read the
| paper. I recall that Apple paper about non reasoning due to
| fuzzed question data causing performance degradation that
| caught a lot of traction but IIRC o1 had pretty resilient
| performance compared to previous models. which to be clear I
| agree with your sentiment towards. I just have yet to see
| definitive data that shows o1 is not fundamentally more
| resilient to the types of test we use to discern "reasoning"
| from "pattern matching".
|
| I watched a professor lecture on the likely candidates for what
| the open source llm community think is going on in o1[0] and
| I'm not convinced it's still simple pattern matching. [0]
| https://youtu.be/6PEJ96k1kiw
| SubiculumCode wrote:
| Can you provide an example or link?
|
| I'm not so confident that humans reason in a fundamentally
| different way than pattern matching. Perhaps paradigms focused
| on predicting the next token is too limiting. Reasoning
| plausibly involves pattern matching relevant schema
| representations, then executing along that schema. The ability
| to intuit that an existing schema is applicable to a certain
| situation is a good measure of intelligence, IMO. Could even
| make a good llm metric.
| mdp2021 wrote:
| > _humans reason in a fundamentally different way_
|
| After having formulated an idea, do you put it on your
| intellectual bench and re-examine it, purposefully,
| analytically? Well, that is more than plain pattern matching
| over intellectual keys - it is procedural.
|
| And what about those intellectual keys or <<schemas>>, how
| are they generated? Through a verification, consolidation
| that is further to the original (pattern matching) intuition.
| blixt wrote:
| I don't completely disagree but I believe it's a bit more fuzzy
| than that. From what I understand, the models learn a very
| compressed version of what they receive as input and produce as
| output. While not sufficient to generalize, you could say they
| memorize some very high-dimensional function to cause the
| expected text to be produced, and they can turn on and combine
| multiple of these functions (multiply by non-zero, sum, etc).
| So on some level an LLM can kind of perform logic on the input,
| even if it has a slightly novel pattern. But at the same time,
| no model is shown to completely generalize the way a human
| would.
|
| And let's also be fair, it would take a lot of effort for a
| human to generalize to a previously unseen pattern as well, so
| I always wonder just how useful it is to try to make such
| binary statements as "models don't reason" or they're
| "stochastic parrots". But maybe it's to counterweigh the
| statements that they are sentient, AGI is here, etc?
| blovescoffee wrote:
| You're going to PSA an opinion?
| snats wrote:
| This paper is not comparing against MOLMO or Qwen, so I would
| take it with a grain of salt
___________________________________________________________________
(page generated 2024-11-18 23:00 UTC)