[HN Gopher] Can Large Language Models Understand Symbolic Graphi...
       ___________________________________________________________________
        
       Can Large Language Models Understand Symbolic Graphics Programs?
        
       Author : belter
       Score  : 62 points
       Date   : 2024-08-16 16:40 UTC (6 hours ago)
        
 (HTM) web link (www.arxiv.org)
 (TXT) w3m dump (www.arxiv.org)
        
       | Lichtso wrote:
       | What an awful paper title, saying "Symbolic Graphics Programs"
       | when they just mean "vector graphics". I don't understand why
       | they can not just use the established term instead. Also, there
       | is no "program" here, in the same way that coding HTML is not
       | programming, as vector graphics are not supposed to be Turing
       | complete. And where they pulled the "symbolic" from is completely
       | beyond me.
        
         | tines wrote:
         | > Also, there is no "program" here, in the same way that coding
         | HTML is not programming, as vector graphics are not supposed to
         | be Turing complete. And where they pulled the "symbolic" from
         | is completely beyond me.
         | 
         | Aren't HTML and vector graphics descriptions both data
         | structures that could be interpreted via a Turing-complete
         | interpreter? I don't see the difference between HTML and a C
         | AST in this regard.
        
           | jlarocco wrote:
           | There's a slippery slope there. Is a Word document a program?
           | Is a PNG file?
           | 
           | A computer program is a data structure, but data structures
           | are not necessarily computer programs.
        
             | tines wrote:
             | True, I suppose HTML the better example, it's a tree
             | description language, whereas PNG files, Word documents,
             | etc. aren't.
        
             | justsomehnguy wrote:
             | > . Is a Word document a program?
             | 
             | Even if not there is always an OLE embedding.
        
         | merlincorey wrote:
         | I'm more curious how they think LLM's can imagine things:
         | 
         | > To understand symbolic programs, LLMs may need to possess the
         | ability to imagine how the corresponding graphics content would
         | look without directly accessing the rendered visual content
         | 
         | To my understanding, LLMs are predictive engines based upon
         | their tokens and embeddings without any ability to "imagine"
         | things.
         | 
         | As such, an LLM might be able to tell you that the following
         | SVG is a black circle because it is in Mozilla
         | documentation[0]:                   <svg viewBox="0 0 100 100"
         | xmlns="http://www.w3.org/2000/svg">           <circle cx="50"
         | cy="50" r="50" />         </svg>
         | 
         | However, I highly doubt any LLM could tell you the following is
         | a "Hidden Mickey" or "Mickey Mouse Head Silhouette":
         | <svg viewBox="0 0 175 175" xmlns="http://www.w3.org/2000/svg">
         | <circle cx="100" cy="100" r="50" />           <circle cx="50"
         | cy="50" r="40" />           <circle cx="150" cy="50" r="40" />
         | </svg>
         | 
         | - [0] https://developer.mozilla.org/en-
         | US/docs/Web/SVG/Element/cir...
        
           | randomdata wrote:
           | _> without any ability to  "imagine" things._
           | 
           | What's imagining, then? The way LLMs explore different
           | predictive branches in order to find an optimal solution
           | doesn't seem all that different than what I consider
           | imagining: Thinking about what could be and considering
           | different variations on that idea.
           | 
           | An LLM isn't a brain, so there is no implication of it being
           | said in the truest human sense, but it seems like a decent
           | analogy to me.
        
           | rel_ic wrote:
           | Check out Act One of this This American Life episode
           | https://www.thisamericanlife.org/803/transcript
           | 
           | TLDR: it seems like an LLM might be able to tell you your SVG
           | is a "Mickey Mouse Head Silhouette"
        
             | kgen wrote:
             | I was just about to post the same thing -- quite a
             | fascinating test of gpt's capabilities
        
           | montebicyclelo wrote:
           | Chat GPT:
           | 
           | > Given the arrangement of three overlapping circles, it
           | resembles the classic depiction of a *Mickey Mouse* head
           | silhouette:
           | 
           | > The two smaller circles represent Mickey's ears.
           | 
           | > The larger circle represents his head.
           | 
           | > This is a stylized version of the iconic Mickey Mouse logo.
           | 
           | Imo: In order to predict the next token for non-trivial
           | tokens, (of which there are many on the training data), you
           | do have to do some more complex thinking/reasoning than just
           | a lookup of past training data.
        
             | Lichtso wrote:
             | Can you challenge it and e.g. double the radius of the last
             | circle? That should make the rendered output
             | unintelligible. Let's see if it still thinks it represents
             | the iconic logo.
        
               | jaymzcampbell wrote:
               | Changing the last one to r=80 instead of r=40, GPT4o ends
               | up saying:
               | 
               | > The circles overlap in certain areas, depending on
               | their size and position, creating a layered visual effect
               | where the largest circle (third one) dominates most of
               | the canvas space.
               | 
               | Though tbf to it, I'm not sure I'd say this looks like MM
               | either: https://i.imgur.com/0VHdocf.png (unless I knew
               | this was the intent prior).
        
               | montebicyclelo wrote:
               | It passes the test you've proposed. For the below
               | results, the prompt is the same for A and B, except for
               | in B the radius of the last circle is twice as big to
               | make the rendered output unintelligible.
               | 
               | A. Valid Mickey is detected by the model. "...This
               | arrangement might resemble a basic version of a Mickey
               | Mouse shape, where the two smaller circles represent the
               | ears and the larger circle represents the head...", https
               | ://chatgpt.com/share/3999859a-b6db-4671-8b69-0ec6a5bac3..
               | .
               | 
               | B. Invalid Mickey is not misclassified as Mickey by the
               | model and is correctly described. "...these circles will
               | overlap, creating a pattern where the largest circle
               | (Circle 3) dominates the right side of the canvas, with
               | the other two smaller circles overlapping it and each
               | other in the middle...", https://chatgpt.com/share/df3c57
               | ac-495b-4e4c-b00c-bae31781c4...
        
           | jaymzcampbell wrote:
           | GPT4o certainly gave it a good go (prompt: "What cartoon
           | character would you say this SVG code renders most closely
           | to?"). That's maybe a little too "prompting" tbf:
           | 
           | """ The SVG code you provided creates an image with three
           | circles:
           | 
           | A large circle at the center (at coordinates 100, 100) with a
           | radius of 50. Two smaller circles at the top left (at
           | coordinates 50, 50) and top right (at coordinates 150, 50)
           | with a radius of 40 each. This configuration closely
           | resembles the head and ears of Mickey Mouse. The larger
           | circle represents the head, and the two smaller circles
           | represent the ears. """
           | 
           | Slightly more generic prompt: "What is this code drawing? Can
           | you suggest what it might be representing?"
           | 
           | """ What It Might Represent: Mickey Mouse Head: The
           | arrangement resembles a simplified "Mickey Mouse" head, where
           | the large central circle represents the head, and the two
           | smaller circles represent the ears.
           | 
           | Abstract Art: Alternatively, it could be an abstract design
           | with three overlapping circles of different sizes, suggesting
           | some form of simple geometric art.
           | 
           | The circles are positioned such that the smaller ones are
           | symmetrically placed above the larger central one, which is a
           | typical representation of a head with ears. """
           | 
           | And more generic "I've been sent this code, what is it
           | drawing?" ends with:
           | 
           | """ You will see three overlapping circles. The largest
           | circle is near the center, while two smaller circles are
           | positioned above and slightly to the left and right of the
           | center. The arrangement may resemble a basic representation
           | of a Mickey Mouse head shape, with the large circle as the
           | face and the two smaller circles as ears. """
        
             | CaptainFever wrote:
             | I'm impressed. How does it do that?
        
             | kevindamm wrote:
             | Were those all within the same conversation, or the
             | beginnings of separate conversations?
        
               | jaymzcampbell wrote:
               | Separate conversations, I wanted to avoid any prehistory,
               | though all under my logged in user so can't exclude any
               | possible inference that way
        
           | westurner wrote:
           | If teh LLM saves the SVG vector graphic to a raster image
           | like a PNG and prompts with that instead, it will have no
           | trouble labeling what's depicted in the SVG.
           | 
           | So, the task is "describe what an SVG depicts without saving
           | it to a raster image and prompting with that"?
        
           | ctoth wrote:
           | > However, I highly doubt
           | 
           | I never understand why people don't just test this?
           | 
           | Claude 3.5 sonnet, first test, prompt: Any idea what this SVG
           | is describing?:
           | 
           | This SVG describes a simple geometric composition consisting
           | of three circles: 1. The largest circle is centered at
           | coordinates (100, 100) with a radius of 50 units. 2. A
           | smaller circle is positioned at the top-left, centered at
           | (50, 50) with a radius of 40 units. 3. Another smaller circle
           | is positioned at the top-right, centered at (150, 50), also
           | with a radius of 40 units. The overall effect, if rendered,
           | would resemble a basic face-like structure or a simplified
           | Mickey Mouse silhouette, with two smaller circles
           | (potentially representing ears or eyes) above a larger
           | central circle (potentially representing a face or head). The
           | SVG uses a viewBox of "0 0 175 175", which defines the
           | coordinate system and aspect ratio for the image. Would you
           | like me to create a visual representation of this SVG for
           | you?
        
             | smallnamespace wrote:
             | > I never understand why people don't just test this?
             | 
             | One reason might be that you get the benefit of feeling
             | superior to LLMs without any cost or risk of finding
             | yourself mistaken.
        
         | jchw wrote:
         | > Also, there is no "program" here, in the same way that coding
         | HTML is not programming, as vector graphics are not supposed to
         | be Turing complete.
         | 
         | I think the reason why we don't view HTML as a programming
         | language is because it is explicitly designed to be a markup
         | language that _declares_ content rather than a series of
         | instructions that is interpreted as a program. A program needn
         | 't demonstrate turing completeness to be a "computer program",
         | it just needs to be a sequence of instructions that a computer
         | executes. To me, that suggests that there's a degree of
         | abstractness and subjectivity involved. For example, any SVG
         | document could also be rewritten 1:1 with no loss in fidelity
         | as a series of commands that has the same effect, as can pretty
         | much any declarative markup language; what is actually
         | happening during parsing is hard to distinguish from an
         | interpreter. Humans can "know it when they see it", but I doubt
         | there's an exact criteria that can go along with the human
         | "feel" of what makes a program, a program.
        
       | brap wrote:
       | Crazy that LLMs are able to do that when they're not even trained
       | on visual data.
        
       | skybrian wrote:
       | It's difficult to tell what they're proposing from the title or
       | abstract, but asking questions about an SVG image, given only the
       | source code, seems like an interesting challenge.
       | 
       | There are a lot of ways to generate the same image in different
       | ways, or to slightly change the source code, resulting in a very
       | different image. This could be used to train a deeper
       | understanding of program-like text.
        
       | slashdave wrote:
       | In the title and abstract there is a proposal that LLMs
       | "understand", "reason", and "imagine." It's off putting to me,
       | since LLMs are not capable of any of these.
        
         | mcbuilder wrote:
         | Not in the traditional sense, but how are we to possibly gauge
         | an LLMs "intelligence" without framing questions like this?
        
         | Kuinox wrote:
         | Can you come up with a proof with your bold statements, since
         | currently leading researcher on the subject have no idea if
         | that's the case ?
        
           | slashdave wrote:
           | Hard to prove a negative. On the other hand, relying on a
           | poorly defined idea of "emergent behavior" when there is no
           | architectural rational is rather poor form IMO. It's more
           | wordy but safer to qualify these terms appropriately. That
           | is, to test the "appearance" or "simulation" of human-like
           | abilities.
        
             | warkdarrior wrote:
             | That's all I'm hoping for, an ML model that appears to be
             | as good as any human in existence, for all human skills.
             | 
             | To be fair, this is what we do with humans as well, we test
             | their skills. We (rarely!) take humans apart, inspect their
             | brains, and declare that they have reasoning or
             | understanding capabilities.
        
       | abernard1 wrote:
       | Since people are not reading the paper, here were the
       | conclusions:
       | 
       | > Is it really easy to answer semantic reasoning questions over
       | symbolic graphics programs? We provide an intriguing experiment
       | to demonstrate that SVG programs can be quite difficult for LLMs
       | to understand such that even if the corresponding rendered images
       | are fairly easy for humans to recognize, all these powerful LLMs
       | still fail dramatically, only reaching a chance-level accuracy.
       | 
       | They then fine tune the model for this problem, and it achieves
       | better results.
        
       | magicalhippo wrote:
       | I was recently playing with some LLMs regarding a relay driver
       | circuit. ChatGPT generated a circuit diagram for me in ASCII art
       | style, but it contained some errors. For fun I tried to ask some
       | others if they could identify the errors.
       | 
       | Here's the diagram (formatting errors were from ChatGPT):
       | +5V                |                |               ( ) Relay
       | Coil                |                +----->|----+
       | |      Diode |                |            |                |
       | |                +------------+                |                |
       | Drain (D) of MOSFET                 |               Source (S)
       | |               -----                GND                 |
       | Gate (G)---[220O]--- GPIO (3.3V MCU)                 |
       | [10kO]                 |                GND
       | 
       | Here's what Llama 3.1 70B said:
       | 
       |  _After reviewing the diagram, I 've found a few potential
       | issues:_
       | 
       |  _- The diode is not connected to the relay coil in a way that
       | would prevent back-EMF from damaging the MOSFET. Typically, the
       | diode is connected in parallel with the relay coil, with the
       | anode connected to the positive terminal of the coil and the
       | cathode connected to the negative terminal. In this diagram, the
       | diode appears to be connected in series with the relay coil,
       | which is not correct._
       | 
       |  _- The MOSFET is not properly biased. The gate of the MOSFET is
       | connected to the GPIO pin through a 220O resistor, but there is
       | no pull-down resistor to ensure the gate is at a known state when
       | the GPIO pin is not driving it. The 10kO resistor connected to
       | GND is likely intended to be a pull-down resistor, but it 's not
       | clear if it's connected to the gate or not._
       | 
       |  _- The relay coil is not specified, but it 's likely that it
       | requires a higher current than the MOSFET can handle. Typically,
       | a relay driver circuit would include a current-limiting resistor
       | or a transistor to handle the high current required by the relay
       | coil._
       | 
       | So it correctly identified one of the main issues, the flyback
       | diode is wrong, but failed to correctly identify the issue with
       | the MOSFET gate being directly tied to ground. I admit, it's not
       | a typical way to draw things so fair enough, after all it did
       | mention there was something fishy with the pull-down.
       | 
       | Claude 3 Haiku provided a very similar answer. It was a bit more
       | explicit and included more details. For instance it thought there
       | might be an issue driving the MOSFET from 3.3V, though failed to
       | suggest a logic-level MOSFET, instead suggesting a level-shifter.
       | It also suggested the 220 Ohm gate resistor might not be enough
       | to protect the microcontroller. Though it also tried to correct
       | the diagram but failed: it changed the diode to ->|<-, and didn't
       | put it in parallel with the coil.
       | 
       | All in all, somewhat impressed.
       | 
       | However they get easily confused. I tried to improve the diagram,
       | and asked again. Here's the updated diagram:
       | +5V                |                |                          |
       | +----->|--------+                |      Diode    |
       | |               |               ( ) Relay Coil   |
       | |               |                +---------------+
       | |                |         Drain (D) of MOSFET                |
       | |]- Gate (G)-----+-[220O]--- GPIO (3.3V MCU)                |
       | |                |             [10kO]            Source (S)
       | |                |                |
       | +----------------+                |              -----
       | GND
       | 
       | And Llama 3.1 now says:
       | 
       |  _After examining the diagram, I notice a few potential issues:_
       | 
       |  _- The diode is not connected to the relay coil in a way that
       | would prevent back-EMF damage to the MOSFET. Typically, a flyback
       | diode is connected in parallel with the relay coil, with the
       | anode connected to the coil 's positive terminal and the cathode
       | connected to the coil's negative terminal. In this diagram, the
       | diode appears to be connected in series with the relay coil,
       | which is not correct._ _- The MOSFET 's source (S) is not
       | connected to GND. In a typical relay driver circuit, the source
       | of the MOSFET is connected to GND to provide a return path for
       | the current._ _- The 10kO resistor is not connected to anything.
       | It 's possible that it's intended to be a pull-down resistor for
       | the MOSFET's gate, but it's not clear where it's supposed to be
       | connected._
       | 
       | So it totally fails to "see" that the 10k is now correctly
       | connected to ground, and that it shares the ground point (in the
       | diagram) with the source of the MOSFET. At least it figures it's
       | meant to be a pull-down.
       | 
       | However it _did_ notice the diode is now correctly placed in
       | parallel with the coil, and did again correctly see the diode was
       | the wrong way around. So that 's something.
       | 
       | edit: I changed the diode direction and asked if that fixed it.
       | Llama 3.1 did not "see" that and suggested a change.
       | 
       | Claude 3 did much better and noticed the direction change and was
       | happy.
       | 
       | As I said, was just fooling around with this and thought I'd
       | share. They're confident, but sometimes oh-so-wrong. Still, using
       | a service where you can ask the same question to multiple models
       | is nice. That way it can be easier to see when they're
       | confabulating.
        
       ___________________________________________________________________
       (page generated 2024-08-16 23:00 UTC)