https://vlmsareblind.github.io/

Vision language models are blind

Pooyan Rahmanzadehgervi^1,*, Logan Bolton^1,*, Mohammad Reza Taesiri^
2, Anh Totti Nguyen^1
^*Equal contribution
^1Auburn University, ^2University of Alberta,
Paper (ArXiv) Code  



Dataset

Abstract

Large language models with vision capabilities (VLMs), e.g., GPT-4o
and Gemini-1.5 Pro are powering countless image-text processing
applications and scoring high on existing vision-understanding
benchmarks. Yet, we find that VLMs fail on 7 visual tasks absurdly
easy to humans such as identifying (a) whether two circles overlap;
(b) whether two lines intersect; (c) which letter is being circled in
a word; and (d) counting the number of circles in an Olympic-like
logo. The shockingly poor performance of four state-of-the-art VLMs
suggests their vision is, at best, like that of a person with myopia
seeing fine details as blurry, and at worst, like an intelligent
person who is blind making educated guesses.

Task 1 Icon Task 1 Line Intersections Task 2 Icon Task 2 Two Circles 
Task 3 Icon Task 3 Circled Letter Task 4 Icon Task 4 Overlapping
Shapes Task 5 Icon Task 5 Nested Squares Task 6 Icon Task 6 Counting
Grid Task 7 Icon Task 7 Subway Map

Task 1: Counting line intersections Two intersecting lines

Given the impressive accuracy of VLMs on answering questions on
diagrams and charts (e.g., Sonnet-3.5 scoring 94.7% on AI2D and 90.8%
on ChartQA) [1], a reasonable hypothesis is that VLMs must be able to
see whether two graphs intersect in a chart. Here, we test this
hypothesis by asking VLMs to count the number of intersections
between two 2-segment piece-wise linear functions.

Images

We create 150 images (see Figure 1) of 2D line plots drawn on a white
canvas. Each line plot consists of two line segments, defined by
three points whose x-coordinates are fixed and equally spaced. The
y-coordinates are randomly sampled to create two plots that intersect
at exactly 0, 1 or 2 points. See Appendix A for more details.

2D line plot example 1 0 intersections
2D line plot example 2 1 intersection
2D line plot example 3 2 intersections
2D line plot example 4 2 intersections
Fig. 1: Examples of 2D line plots used in the task, showing different
numbers of intersections.

Prompts

We ask each question using two different wordings:

 1. "How many times do the blue and red line plots cross each other?"
 2. "How many times do the blue and red lines intersect?"

Groundtruth

Answers are [?] {0, 1, 2} (random-baseline accuracy: 33%).

Results

The following table shows the performance of the four models on the
task of counting line intersections.

Thickness GPT-4o Gemini-1.5 Pro Sonnet-3 Sonnet-3.5
2         45.00  70.00          64.00    80.00
3         47.00  68.00          66.00    79.00
4         54.00  71.00          62.00    73.00
Average   48.67  69.67          64.00    77.33

Qualitative samples

How many times do the blue and red lines intersect?

           Graph 1 Graph 2 Graph 3 Graph 4 Graph 5 Graph 6
  GPT-4o   1      1      2      2      2      1
Gemini-1.5 1      1      1      1      1      1
 Sonnet-3  1      1      2      1      1      1
Sonnet-3.5 1      0      2      1      1      2

GPT-4o GPT-4o
Gemini-1.5 Gemini-1.5 Pro
Sonnet-3 Sonnet-3
Sonnet-3 Sonnet-3.5
Fig. 2: VLMs cannot reliably count the intersections.
---------------------------------------------------------------------

Task 2: Two circles Two intersecting lines

In contrast to Task 1 where we tested VLMs on thin lines, here we
evaluate their ability to perceive interactions between larger
objects - specifically, two same-sized filled circles. This task
assesses VLMs' capability to detect (1) small gaps between circles
and (2) overlapping circles.

Images

We generate 672 images of two circles on a white canvas. The circles
vary in size, distance, and orientation:

  * Circle diameters: 1/4, 1/5, 1/6, or 1/7 of the canvas size
  * Distances between circle perimeters: -0.15 to 0.5 times the
    diameter
  * Orientations: 90deg, 0deg, -45deg, and 45deg angles with the x-axis
  * Canvas sizes: 384, 769, and 1155 pixels

Overlapping circles Overlapping and touching
Touching circles Non-overlapping but touching
Separated circles Non-overlapping and non-touching
Diagonal orientation Different orientation
Fig. 3: Examples of two-circle images used in the task, showing
different configurations.

Prompts

We ask each question using two different wordings:

 1. "Are the two circles touching each other? Answer with Yes/No."
 2. "Are the two circles overlapping? Answer with Yes/No."

Groundtruth

Answers are based on the distance d between circle perimeters:

  * d < 0: Overlapping and touching
  * d = 0: Non-overlapping but touching
  * d > 0: Non-overlapping and non-touching

Random-baseline accuracy: 50%.

Results

The following table shows the performance of the four models on the
task of counting line intersections.

            GPT-4o Gemini-1.5 Pro Sonnet-3 Sonnet-3.5
Overlapping 71.27  93.30          88.09    88.83
Touching    74.10  92.26          80.95    94.49
Average     72.69  92.78          84.52    91.66

Qualitative samples

Are the two circles overlapping? Answer with Yes/No.

           Circle 1 Circle 2 Circle 3 Circle 4 Circle 5 Circle 6
  GPT-4o   Yes     Yes     Yes     Yes     No      Yes
Gemini-1.5 No      Yes     Yes     No      No      No
 Sonnet-3  Yes     Yes     Yes     Yes     Yes     No
Sonnet-3.5 No      No      No      No      No      No

GPT-4o GPT-4o
Gemini-1.5 Gemini-1.5 Pro
Sonnet-3 Sonnet-3
Sonnet-3 Sonnet-3.5
Fig. 4: VLMs consistently fail at smaller distances. However, at a
large gap, GPT-4o remains unreliable (rightmost). Sonnet-3.5 tends to
conservatively answer "No" regardless of the actual distance between
the two circles.
---------------------------------------------------------------------

Task 3: The circled letter Two intersecting lines

Consistent with prior reports [2][3][4], we find that VLMs can 100%
accurately identify a primitive shape (e.g., a red circle )[2] and
can perfectly read an English word (e.g., Subdermatoglyphic) alone.
Here, we superimposed the red circle on every letter, one at a time,
in the word, and ask VLMs to identify which letter is being circled.
While the task is easy to humans, our hypothesis is that if a VLM's
vision is "blurry", it might not be able to identify the exact letter
being circled since there is tiny spacing between the adjacent
letters.

Images

We choose three strings Acknowledgement, Subdermatoglyphic, and
tHyUiKaRbNqWeOpXcZvM because they contain characters of variable
widths and heights. Furthermore, all four tested VLMs can read out
all characters in these strings when they are input to the models as
an image. While Acknowledgement is a common English word, 
Subdermatoglyphic is the longest word without repetitive letters. We
also test VLMs on the random string tHyUiKaRbNqWeOpXcZvM to estimate
how much model accuracy is due to its familiarity with the word.

For each (string, circled-letter) pair, we render a 512x512 image by
choosing among 3 red oval line-thickness levels, 2 font sizes, and 4
random positions in the canvas for a total of 24 images. That is, we
generate 360, 408, and 480 images for Acknowledgement (15 letters), 
Subdermatoglyphic (17 letters), and tHyUiKaRbNqWeOpXcZvM (20
letters), respectively. We ensure each letter to be circled fits
completely the oval.

Circled letter example 1 Acknowledgement with 'n' circled
Circled letter example 2 tHyUiKaRbNqWeOpXcZvM with 't' circled
Circled letter example 3 tHyUiKaRbNqWeOpXcZvM with 'X' circled
Circled letter example 4 Subdermatoglyphic with 'u' circled
Fig. 5: Examples of circled letter images used in the task, showing
different words and circled letters.

Prompts

We ask each question using two different wordings:

 1. "Which letter is being circled?"
 2. "Which character is being highlighted with a red oval?"

Groundtruth

Letters need to match predicted letters exactly (case-insensitive).

Results

The following table shows the performance of the four models on the
task of identifying the circled letter.

        Word         GPT-4o Gemini-1.5 Pro Sonnet-3 Sonnet-3.5
Acknowledgement      69.03  97.50          82.64    91.11
Subdermatoglyphic    63.60  91.05          71.45    94.49
tHyUiKaRbNqWeOpXcZvM 77.92  89.90          65.94    82.08
Average              70.18  92.81          73.34    89.22

Qualitative samples

Which letter is being circled?

            Circled   Circled   Circled   Circled   Circled   Circled
           Letter 1  Letter 2  Letter 3  Letter 4  Letter 5  Letter 6
  GPT-4o   o        e        t        o        o        z
Gemini-1.5 w        m        n        p        o        v
 Sonnet-3  o        e        e        y        a        t
Sonnet-3.5 l        e        t        h        t        m

GPT-4o GPT-4o
Gemini-1.5 Gemini-1.5 Pro
Sonnet-3 Sonnet-3
Sonnet-3 Sonnet-3.5
Fig. 6: Identifying the letter being circled is non-trivial for VLMs
across both English words (Acknowledgement & Subdermatoglyphic) and a
random string (tHyUiKaRbNqWeOpXcZvM). When making mistakes, VLMs tend
to predict letters adjacent to the circled one.
---------------------------------------------------------------------

Task 4: Counting overlapping shapes Two intersecting lines

Aligned with prior research [4], we also find VLMs to be able to
count disjoint circles. Yet, here, we test VLMs on counting circles
that are intersecting like in the Olympic logo--a common cognitive
development exercise for preschoolers [5][6]. Our hypothesis is that
a "blurry" vision may not see the intersection between two circles
clearly and therefore unable to trace circles and count them. For
generalization of our findings, we repeat the experiment with
pentagons as well.

Images

In an image of size CxC, where C [?] {384, 769, 1155}px, we draw N [?]
{5, 6, 7, 8, 9} overlapping, same-sized circles arranged in two rows
like the Olympic logo. A circle diameter ph [?] {C/5, C/10}. We repeat
the images with two different line thickness for rendering circles.
This procedure renders 3 resolutions x 5 x 2 diameters = 60 images.
We repeat for pentagons in addition to circles, resulting in 60 x 2
shapes = 120 images in total. For pentagons, their side length d [?] {C
/5, C/10}.

Olympic-like logo example 1 5 circles, small diameter
Olympic-like logo example 2 6 circles, large diameter
Olympic-like logo example 3 8 colored circles
Olympic-like logo example 4 9 colored pentagons
Fig. 7: Examples of Olympic-like logo images used in the task,
showing different numbers of shapes, sizes, and colors.

Prompts

We ask each question using two different wordings:

 1. "How many {shapes} are in the image? Answer with only the number
    in numerical format."
 2. "Count the {shapes} in the image. Answer with a number in curly
    brackets e.g. {3}."

Where {shapes} is either "circles" or "pentagons" depending on the
image.

Groundtruth

Answers are [?] {5, 6, 7, 8, 9} (random-baseline accuracy: 20%).

Results

The following table shows the performance of the four models on the
task of identifying the circled letter.

          GPT-4o Gemini-1.5 Pro Sonnet-3 Sonnet-3.5
Circles   42.50  20.83          31.66    44.16
Pentagons 19.16  9.16           11.66    75.83

Qualitative samples

How many circles are in the image? Answer with only the number in
numerical format.

           Circle 1 Circle 2 Circle 3 Circle 4 Circle 5 Circle 6
  GPT-4o   5       6       5       10      10      5
Gemini-1.5 5       5       5       5       5       5
 Sonnet-3  5       5       5       10      10      5
Sonnet-3.5 5       6       6       10      9       7

GPT-4o GPT-4o
Gemini-1.5 Gemini-1.5 Pro
Sonnet-3 Sonnet-3
Sonnet-3 Sonnet-3.5
Fig. 8: Gemini-1.5 Pro often predicts "5" circles.
---------------------------------------------------------------------

Task 5: Counting the nested squares Two intersecting lines

Motivated by the findings that VLMs struggle in counting the
intersected circles (Task 4), here, we arrange the shapes differently
so that their edges do not intersect. That is, each shape is nested
entirely inside another. For completeness, we test squares in this
task.

Images

In a canvas of size CxC, we render N [?] {2, 3, 4, 5} nested squares.
The outermost square is rendered first using a random edge length d
and a line thickness [?] {2, 3, 4}px. The remaining N-1 squares are
drawn using a size reduction factor, 0.75 x d and placed at a random
coordinate that ensures they do not touch outer squares. For each
line thickness, we generate 10 images (where squares have different,
random locations) to create 3 x 10 = 30 images. Repeating the process
for all N values results in 4 x 30 = 120 images.

2 nested squares 2 nested squares
3 nested squares 3 nested squares
4 nested squares 4 nested squares
5 nested squares 5 nested squares
Fig. 9: Examples of nested square images used in the task, showing
different numbers of squares.

Prompts

We ask each question using the following wording:

 1. "Count the total number of squares in the image."

Where {shapes} is either "circles" or "pentagons" depending on the
image.

Groundtruth

Answers are [?] {2, 3, 4, 5} (random-baseline accuracy: 25%).

Results

The following table shows the performance of the four models on the
task of counting nested squares.

        GPT-4o Gemini-1.5 Pro Sonnet-3 Sonnet-3.5
Squares 48.33  80.00          55.00    87.50

Qualitative samples

Count total number of squares in the image.

            Nested    Nested    Nested    Nested    Nested    Nested
           Squares 1 Squares 2 Squares 3 Squares 4 Squares 5 Squares 6
  GPT-4o   5        5        5        5        6        6
Gemini-1.5 5        5        5        5        5        5
 Sonnet-3  5        5        5        5        4        4
Sonnet-3.5 4        4        4        4        4        4

GPT-4o GPT-4o
Gemini-1.5 Gemini-1.5 Pro
Sonnet-3 Sonnet-3
Sonnet-3 Sonnet-3.5
Fig. 10: Only Sonnet-3.5 can count the squares in a majority of the
images.
---------------------------------------------------------------------

Task 6: Counting the rows and columns of a grid Two intersecting
lines

The results from prior tasks show VLMs cannot always count shapes
that are overlapping (Task 4) or nested (Task 5). What about adjacent
shapes? Here, we tile up shapes (specifically, squares) into a grid
and challenge VLMs to count--a task that is supposedly simple to VLMs
given their remarkable performance (>= 90% accuracy) on DocVQA, which
includes many questions with tables. To simplify the task, we ask
models to count the number of rows and columns in a given table.

Images

A grid may have NxN, NxN', or N'xN cells, where N[?]{3, 4, 5, 6, 7, 8,
9} and N' = N + 1. Each grid is rendered with two different
line-thicknesses on a canvas of size CxC where C[?]{500, 1250, 2000}px.
Besides empty grids, we also replicate the procedure to make grids
contain text (which is more common in real-world tables) where each
cell contains a single random word. Two versions combined have 2x222
= 444 images.

Text grid 3x3 Text grid (3x3)
Text grid 3x4 Text grid (3x4)
Empty grid 4x4 Empty grid (4x4)
Empty grid 4x5 Empty grid (4x5)
Fig. 9: Examples of grid images used in the task, showing text-filled
and empty grids with various dimensions.

Prompts

We ask each question using two different wordings:

 1. "Count the number of rows and columns and answer with numbers in
    curly brackets. For example, rows={5} columns={6}"
 2. "How many rows and columns are in the table? Answer with only the
    numbers in a pair (row, column), e.g., (5,6)"

Groundtruth

Answers include both the number of rows and columns. An answer is
correct when both column and row counts are correctly predicted.

Results

The following table shows the performance of the four models on the
task of counting rows and columns in grids.

Grid type GPT-4o Gemini-1.5 Pro Sonnet-3 Sonnet-3.5
Blank     26.13  25.75          25.00    59.84
Text      53.03  45.83          47.34    88.68
Average   39.58  35.79          36.17    74.26

Qualitative samples

Count the number of rows and columns and answer with numbers in curly
brackets. For example, rows={5} columns={6}

           Grid 1 Grid 2 Grid 3 Grid 4 Grid 5 Grid 6
  GPT-4o   4x4   6x6   7x7   6x6   6x6   6x6
Gemini-1.5 5x5   6x6   7x7   10x10 5x6   10x10
 Sonnet-3  5x5   7x8   6x6   9x9   6x6   9x12
Sonnet-3.5 4x5   6x7   7x7   8x7   5x6   8x8

GPT-4o GPT-4o
Gemini-1.5 Gemini-1.5 Pro
Sonnet-3 Sonnet-3
Sonnet-3 Sonnet-3.5
Fig. 12: Examples from the benchmark show that models consistently
fail at counting rows and columns of blank grids.

How many rows and columns are in the table? Answer with only the
numbers in a pair (row, column), e.g., (5,6).

           Grid 1 Grid 2 Grid 3 Grid 4 Grid 5 Grid 6
  GPT-4o   4x4   4x5   5x4   5x6   6x8   7x8
Gemini-1.5 4x4   4x5   5x4   5x6   6x8   7x8
 Sonnet-3  4x4   5x5   5x4   6x6   7x7   8x7
Sonnet-3.5 4x4   4x5   5x4   5x6   6x7   7x7

GPT-4o GPT-4o
Gemini-1.5 Gemini-1.5 Pro
Sonnet-3 Sonnet-3
Sonnet-3 Sonnet-3.5
Fig. 13: When text is included in the cells of the grid, the
performance of all VLMs improves, especially Sonnet-3.5.
---------------------------------------------------------------------

Task 7: Following single-colored paths Two intersecting lines

It is important for VLMs to be able to follow paths in order to read
maps or charts, interpret graphs, and understand user notations
(e.g., arrows) in input images. To assess path-following capability,
this task asks models to count the unique-color paths between two
given stations in a simplified subway map. This is another
easy-to-humans task that challenges VLMs significantly.

Images

We create each subway map on an image of size CxC, where C [?] {512,
1024}px. We write 4 station names (A, B, C, D) at 4 fixed
coordinates. We divide the canvas into an invisible grid of 18x18
cells and initialize 3 path-starting points C/18px away from each
station. We draw a path, using the depth-first search algorithm
starting from a random station and a random starting point, where a
valid move is one cell in any direction: North, south, east or west.
We repeat the process so that each station has exactly N [?] {1, 2, 3}
outgoing paths, for a total of 180 maps.

Station with 1 path 1 path, 10px width
Station with 2 paths 2 paths, 20px width
Station with 2 paths 2 paths, 20px width
Station with 3 paths 3 paths, 10px width
Fig. 14: Examples of subway map images used in the task, showing
different numbers of paths and variations in path thickness.

Prompts

We ask each question using two different wordings:

 1. "How many single-colored paths go from A to C? Answer with a
    number in curly brackets, e.g., {3}"
 2. "Count the one-colored routes that go from A to C. Answer with a
    number in curly brackets, e.g., {3}."

Groundtruth

Answers are [?] {0, 1, 2, 3} (random-baseline accuracy: 25%).

Results

The following table shows the performance of the four models on the
task of counting single-colored paths between stations.

 Paths  GPT-4o Gemini-1.5 Pro Sonnet-3 Sonnet-3.5
1       67.50  85.41          23.75    95.00
2       44.37  28.75          37.18    56.25
3       36.71  25.78          15.42    25.39
Average 45.89  40.01          23.78    50.18

Qualitative samples

How many single-color paths go from A to D? Answer with a number in
curly brackets e.g. {3}

            Subway    Subway    Subway    Subway    Subway    Subway
             Map 1     Map 2     Map 3     Map 4     Map 5     Map 6
  GPT-4o   1        1        2        3        2        1
Gemini-1.5 2        2        4        1        1        4
 Sonnet-3  2        1        3        2        4        4
Sonnet-3.5 1        1        3        3        2        3

GPT-4o GPT-4o
Gemini-1.5 Gemini-1.5 Pro
Sonnet-3 Sonnet-3
Sonnet-3 Sonnet-3.5
Fig. 15: Some VLMs (Gemini-1.5, Sonnet-3) surprisingly fail in even
extremely easy cases (leftmost). As the number of paths exiting each
station increases, VLMs tend to perform worse.

This website is forked from the Nerfies website and source code