[HN Gopher] Can Large Language Models Understand Symbolic Graphi...
___________________________________________________________________
Can Large Language Models Understand Symbolic Graphics Programs?
Author : belter
Score : 62 points
Date : 2024-08-16 16:40 UTC (6 hours ago)
(HTM) web link (www.arxiv.org)
(TXT) w3m dump (www.arxiv.org)
| Lichtso wrote:
| What an awful paper title, saying "Symbolic Graphics Programs"
| when they just mean "vector graphics". I don't understand why
| they can not just use the established term instead. Also, there
| is no "program" here, in the same way that coding HTML is not
| programming, as vector graphics are not supposed to be Turing
| complete. And where they pulled the "symbolic" from is completely
| beyond me.
| tines wrote:
| > Also, there is no "program" here, in the same way that coding
| HTML is not programming, as vector graphics are not supposed to
| be Turing complete. And where they pulled the "symbolic" from
| is completely beyond me.
|
| Aren't HTML and vector graphics descriptions both data
| structures that could be interpreted via a Turing-complete
| interpreter? I don't see the difference between HTML and a C
| AST in this regard.
| jlarocco wrote:
| There's a slippery slope there. Is a Word document a program?
| Is a PNG file?
|
| A computer program is a data structure, but data structures
| are not necessarily computer programs.
| tines wrote:
| True, I suppose HTML the better example, it's a tree
| description language, whereas PNG files, Word documents,
| etc. aren't.
| justsomehnguy wrote:
| > . Is a Word document a program?
|
| Even if not there is always an OLE embedding.
| merlincorey wrote:
| I'm more curious how they think LLM's can imagine things:
|
| > To understand symbolic programs, LLMs may need to possess the
| ability to imagine how the corresponding graphics content would
| look without directly accessing the rendered visual content
|
| To my understanding, LLMs are predictive engines based upon
| their tokens and embeddings without any ability to "imagine"
| things.
|
| As such, an LLM might be able to tell you that the following
| SVG is a black circle because it is in Mozilla
| documentation[0]: <svg viewBox="0 0 100 100"
| xmlns="http://www.w3.org/2000/svg"> <circle cx="50"
| cy="50" r="50" /> </svg>
|
| However, I highly doubt any LLM could tell you the following is
| a "Hidden Mickey" or "Mickey Mouse Head Silhouette":
| <svg viewBox="0 0 175 175" xmlns="http://www.w3.org/2000/svg">
| <circle cx="100" cy="100" r="50" /> <circle cx="50"
| cy="50" r="40" /> <circle cx="150" cy="50" r="40" />
| </svg>
|
| - [0] https://developer.mozilla.org/en-
| US/docs/Web/SVG/Element/cir...
| randomdata wrote:
| _> without any ability to "imagine" things._
|
| What's imagining, then? The way LLMs explore different
| predictive branches in order to find an optimal solution
| doesn't seem all that different than what I consider
| imagining: Thinking about what could be and considering
| different variations on that idea.
|
| An LLM isn't a brain, so there is no implication of it being
| said in the truest human sense, but it seems like a decent
| analogy to me.
| rel_ic wrote:
| Check out Act One of this This American Life episode
| https://www.thisamericanlife.org/803/transcript
|
| TLDR: it seems like an LLM might be able to tell you your SVG
| is a "Mickey Mouse Head Silhouette"
| kgen wrote:
| I was just about to post the same thing -- quite a
| fascinating test of gpt's capabilities
| montebicyclelo wrote:
| Chat GPT:
|
| > Given the arrangement of three overlapping circles, it
| resembles the classic depiction of a *Mickey Mouse* head
| silhouette:
|
| > The two smaller circles represent Mickey's ears.
|
| > The larger circle represents his head.
|
| > This is a stylized version of the iconic Mickey Mouse logo.
|
| Imo: In order to predict the next token for non-trivial
| tokens, (of which there are many on the training data), you
| do have to do some more complex thinking/reasoning than just
| a lookup of past training data.
| Lichtso wrote:
| Can you challenge it and e.g. double the radius of the last
| circle? That should make the rendered output
| unintelligible. Let's see if it still thinks it represents
| the iconic logo.
| jaymzcampbell wrote:
| Changing the last one to r=80 instead of r=40, GPT4o ends
| up saying:
|
| > The circles overlap in certain areas, depending on
| their size and position, creating a layered visual effect
| where the largest circle (third one) dominates most of
| the canvas space.
|
| Though tbf to it, I'm not sure I'd say this looks like MM
| either: https://i.imgur.com/0VHdocf.png (unless I knew
| this was the intent prior).
| montebicyclelo wrote:
| It passes the test you've proposed. For the below
| results, the prompt is the same for A and B, except for
| in B the radius of the last circle is twice as big to
| make the rendered output unintelligible.
|
| A. Valid Mickey is detected by the model. "...This
| arrangement might resemble a basic version of a Mickey
| Mouse shape, where the two smaller circles represent the
| ears and the larger circle represents the head...", https
| ://chatgpt.com/share/3999859a-b6db-4671-8b69-0ec6a5bac3..
| .
|
| B. Invalid Mickey is not misclassified as Mickey by the
| model and is correctly described. "...these circles will
| overlap, creating a pattern where the largest circle
| (Circle 3) dominates the right side of the canvas, with
| the other two smaller circles overlapping it and each
| other in the middle...", https://chatgpt.com/share/df3c57
| ac-495b-4e4c-b00c-bae31781c4...
| jaymzcampbell wrote:
| GPT4o certainly gave it a good go (prompt: "What cartoon
| character would you say this SVG code renders most closely
| to?"). That's maybe a little too "prompting" tbf:
|
| """ The SVG code you provided creates an image with three
| circles:
|
| A large circle at the center (at coordinates 100, 100) with a
| radius of 50. Two smaller circles at the top left (at
| coordinates 50, 50) and top right (at coordinates 150, 50)
| with a radius of 40 each. This configuration closely
| resembles the head and ears of Mickey Mouse. The larger
| circle represents the head, and the two smaller circles
| represent the ears. """
|
| Slightly more generic prompt: "What is this code drawing? Can
| you suggest what it might be representing?"
|
| """ What It Might Represent: Mickey Mouse Head: The
| arrangement resembles a simplified "Mickey Mouse" head, where
| the large central circle represents the head, and the two
| smaller circles represent the ears.
|
| Abstract Art: Alternatively, it could be an abstract design
| with three overlapping circles of different sizes, suggesting
| some form of simple geometric art.
|
| The circles are positioned such that the smaller ones are
| symmetrically placed above the larger central one, which is a
| typical representation of a head with ears. """
|
| And more generic "I've been sent this code, what is it
| drawing?" ends with:
|
| """ You will see three overlapping circles. The largest
| circle is near the center, while two smaller circles are
| positioned above and slightly to the left and right of the
| center. The arrangement may resemble a basic representation
| of a Mickey Mouse head shape, with the large circle as the
| face and the two smaller circles as ears. """
| CaptainFever wrote:
| I'm impressed. How does it do that?
| kevindamm wrote:
| Were those all within the same conversation, or the
| beginnings of separate conversations?
| jaymzcampbell wrote:
| Separate conversations, I wanted to avoid any prehistory,
| though all under my logged in user so can't exclude any
| possible inference that way
| westurner wrote:
| If teh LLM saves the SVG vector graphic to a raster image
| like a PNG and prompts with that instead, it will have no
| trouble labeling what's depicted in the SVG.
|
| So, the task is "describe what an SVG depicts without saving
| it to a raster image and prompting with that"?
| ctoth wrote:
| > However, I highly doubt
|
| I never understand why people don't just test this?
|
| Claude 3.5 sonnet, first test, prompt: Any idea what this SVG
| is describing?:
|
| This SVG describes a simple geometric composition consisting
| of three circles: 1. The largest circle is centered at
| coordinates (100, 100) with a radius of 50 units. 2. A
| smaller circle is positioned at the top-left, centered at
| (50, 50) with a radius of 40 units. 3. Another smaller circle
| is positioned at the top-right, centered at (150, 50), also
| with a radius of 40 units. The overall effect, if rendered,
| would resemble a basic face-like structure or a simplified
| Mickey Mouse silhouette, with two smaller circles
| (potentially representing ears or eyes) above a larger
| central circle (potentially representing a face or head). The
| SVG uses a viewBox of "0 0 175 175", which defines the
| coordinate system and aspect ratio for the image. Would you
| like me to create a visual representation of this SVG for
| you?
| smallnamespace wrote:
| > I never understand why people don't just test this?
|
| One reason might be that you get the benefit of feeling
| superior to LLMs without any cost or risk of finding
| yourself mistaken.
| jchw wrote:
| > Also, there is no "program" here, in the same way that coding
| HTML is not programming, as vector graphics are not supposed to
| be Turing complete.
|
| I think the reason why we don't view HTML as a programming
| language is because it is explicitly designed to be a markup
| language that _declares_ content rather than a series of
| instructions that is interpreted as a program. A program needn
| 't demonstrate turing completeness to be a "computer program",
| it just needs to be a sequence of instructions that a computer
| executes. To me, that suggests that there's a degree of
| abstractness and subjectivity involved. For example, any SVG
| document could also be rewritten 1:1 with no loss in fidelity
| as a series of commands that has the same effect, as can pretty
| much any declarative markup language; what is actually
| happening during parsing is hard to distinguish from an
| interpreter. Humans can "know it when they see it", but I doubt
| there's an exact criteria that can go along with the human
| "feel" of what makes a program, a program.
| brap wrote:
| Crazy that LLMs are able to do that when they're not even trained
| on visual data.
| skybrian wrote:
| It's difficult to tell what they're proposing from the title or
| abstract, but asking questions about an SVG image, given only the
| source code, seems like an interesting challenge.
|
| There are a lot of ways to generate the same image in different
| ways, or to slightly change the source code, resulting in a very
| different image. This could be used to train a deeper
| understanding of program-like text.
| slashdave wrote:
| In the title and abstract there is a proposal that LLMs
| "understand", "reason", and "imagine." It's off putting to me,
| since LLMs are not capable of any of these.
| mcbuilder wrote:
| Not in the traditional sense, but how are we to possibly gauge
| an LLMs "intelligence" without framing questions like this?
| Kuinox wrote:
| Can you come up with a proof with your bold statements, since
| currently leading researcher on the subject have no idea if
| that's the case ?
| slashdave wrote:
| Hard to prove a negative. On the other hand, relying on a
| poorly defined idea of "emergent behavior" when there is no
| architectural rational is rather poor form IMO. It's more
| wordy but safer to qualify these terms appropriately. That
| is, to test the "appearance" or "simulation" of human-like
| abilities.
| warkdarrior wrote:
| That's all I'm hoping for, an ML model that appears to be
| as good as any human in existence, for all human skills.
|
| To be fair, this is what we do with humans as well, we test
| their skills. We (rarely!) take humans apart, inspect their
| brains, and declare that they have reasoning or
| understanding capabilities.
| abernard1 wrote:
| Since people are not reading the paper, here were the
| conclusions:
|
| > Is it really easy to answer semantic reasoning questions over
| symbolic graphics programs? We provide an intriguing experiment
| to demonstrate that SVG programs can be quite difficult for LLMs
| to understand such that even if the corresponding rendered images
| are fairly easy for humans to recognize, all these powerful LLMs
| still fail dramatically, only reaching a chance-level accuracy.
|
| They then fine tune the model for this problem, and it achieves
| better results.
| magicalhippo wrote:
| I was recently playing with some LLMs regarding a relay driver
| circuit. ChatGPT generated a circuit diagram for me in ASCII art
| style, but it contained some errors. For fun I tried to ask some
| others if they could identify the errors.
|
| Here's the diagram (formatting errors were from ChatGPT):
| +5V | | ( ) Relay
| Coil | +----->|----+
| | Diode | | | |
| | +------------+ | |
| Drain (D) of MOSFET | Source (S)
| | ----- GND |
| Gate (G)---[220O]--- GPIO (3.3V MCU) |
| [10kO] | GND
|
| Here's what Llama 3.1 70B said:
|
| _After reviewing the diagram, I 've found a few potential
| issues:_
|
| _- The diode is not connected to the relay coil in a way that
| would prevent back-EMF from damaging the MOSFET. Typically, the
| diode is connected in parallel with the relay coil, with the
| anode connected to the positive terminal of the coil and the
| cathode connected to the negative terminal. In this diagram, the
| diode appears to be connected in series with the relay coil,
| which is not correct._
|
| _- The MOSFET is not properly biased. The gate of the MOSFET is
| connected to the GPIO pin through a 220O resistor, but there is
| no pull-down resistor to ensure the gate is at a known state when
| the GPIO pin is not driving it. The 10kO resistor connected to
| GND is likely intended to be a pull-down resistor, but it 's not
| clear if it's connected to the gate or not._
|
| _- The relay coil is not specified, but it 's likely that it
| requires a higher current than the MOSFET can handle. Typically,
| a relay driver circuit would include a current-limiting resistor
| or a transistor to handle the high current required by the relay
| coil._
|
| So it correctly identified one of the main issues, the flyback
| diode is wrong, but failed to correctly identify the issue with
| the MOSFET gate being directly tied to ground. I admit, it's not
| a typical way to draw things so fair enough, after all it did
| mention there was something fishy with the pull-down.
|
| Claude 3 Haiku provided a very similar answer. It was a bit more
| explicit and included more details. For instance it thought there
| might be an issue driving the MOSFET from 3.3V, though failed to
| suggest a logic-level MOSFET, instead suggesting a level-shifter.
| It also suggested the 220 Ohm gate resistor might not be enough
| to protect the microcontroller. Though it also tried to correct
| the diagram but failed: it changed the diode to ->|<-, and didn't
| put it in parallel with the coil.
|
| All in all, somewhat impressed.
|
| However they get easily confused. I tried to improve the diagram,
| and asked again. Here's the updated diagram:
| +5V | | |
| +----->|--------+ | Diode |
| | | ( ) Relay Coil |
| | | +---------------+
| | | Drain (D) of MOSFET |
| |]- Gate (G)-----+-[220O]--- GPIO (3.3V MCU) |
| | | [10kO] Source (S)
| | | |
| +----------------+ | -----
| GND
|
| And Llama 3.1 now says:
|
| _After examining the diagram, I notice a few potential issues:_
|
| _- The diode is not connected to the relay coil in a way that
| would prevent back-EMF damage to the MOSFET. Typically, a flyback
| diode is connected in parallel with the relay coil, with the
| anode connected to the coil 's positive terminal and the cathode
| connected to the coil's negative terminal. In this diagram, the
| diode appears to be connected in series with the relay coil,
| which is not correct._ _- The MOSFET 's source (S) is not
| connected to GND. In a typical relay driver circuit, the source
| of the MOSFET is connected to GND to provide a return path for
| the current._ _- The 10kO resistor is not connected to anything.
| It 's possible that it's intended to be a pull-down resistor for
| the MOSFET's gate, but it's not clear where it's supposed to be
| connected._
|
| So it totally fails to "see" that the 10k is now correctly
| connected to ground, and that it shares the ground point (in the
| diagram) with the source of the MOSFET. At least it figures it's
| meant to be a pull-down.
|
| However it _did_ notice the diode is now correctly placed in
| parallel with the coil, and did again correctly see the diode was
| the wrong way around. So that 's something.
|
| edit: I changed the diode direction and asked if that fixed it.
| Llama 3.1 did not "see" that and suggested a change.
|
| Claude 3 did much better and noticed the direction change and was
| happy.
|
| As I said, was just fooling around with this and thought I'd
| share. They're confident, but sometimes oh-so-wrong. Still, using
| a service where you can ask the same question to multiple models
| is nice. That way it can be easier to see when they're
| confabulating.
___________________________________________________________________
(page generated 2024-08-16 23:00 UTC)