[HN Gopher] Should LLMs just treat text content as an image?
___________________________________________________________________
Should LLMs just treat text content as an image?
Author : ingve
Score : 131 points
Date : 2025-10-21 06:10 UTC (6 days ago)
(HTM) web link (www.seangoedecke.com)
(TXT) w3m dump (www.seangoedecke.com)
| onesandofgrain wrote:
| A picture is worth a thousand words
| hshdhdhehd wrote:
| And a picture of a thousand words is worth a thousand words.
| genghisjahn wrote:
| I type at .08 pictures per minute.
| favoboa wrote:
| A picture of a thousand words with some of them colored,
| bolded, underlined, etc is worth more than a thousand words
| Havoc wrote:
| Seems wildly counterintuitive to me frankly.
|
| Even if true though not sure what we'd do with it. The bulk of
| knowledge available on the internet is text. Aside from maybe
| YouTube so I guess it could work for world model type things?
| Understanding physical interactions of objects etc
| hshdhdhehd wrote:
| Trivial to convert text to images to process. But counter-
| intuitive to me too.
| bilsbie wrote:
| All text is technically converted to images before we see it.
| thfuran wrote:
| Only if you see it instead of hearing it or touching it.
| ToJans wrote:
| A series of tokens is one-dimensional (a sequence). An image is
| 2-dimensional. What about 3D/4D/... representation (until we end
| up with an LLM-dimensional solution ofc).
| dvt wrote:
| This isn't exactly true, as tokens live in the embedding space,
| which is n-dimensional, like 256 or 512 or whatever (so you
| might see one word, but it's actually an array of a bunch of
| numbers). With that said, I think it's pretty intuitive that
| continuous tokens are more efficient than discrete ones, simply
| due to the fact that the LLM itself is basically a continuous
| function (with coefficients/parameters [?] R).
| wongarsu wrote:
| We call an embedding-space n-dimensional, but in this context
| I would consider it 1-dimensional, as in it's a 1d vector of
| n values. The terminology just sucks. If we described images
| the same way we describe embeddings a 2 megapixel image would
| have to be called 2-million-dimensional (or 8-million-
| dimensional if we consider rgba to be four separate values)
|
| I would also argue tokens are outside the embedding space,
| and a large part of the magic of LLMs (and many other neural
| network types) is the ability to map sequences of rather
| crude inputs (tokens) into a more meaningful embedding space,
| and then map from a meaningful embedding space back to tokens
| we humans understand
| LudwigNagasena wrote:
| Those are just dimensions of different things, and it's
| usually pretty clear from context what is meant. Color
| space has 3 dimensions; or 4 with transparency; an image
| pixel has 6 dimensions (xy+RGBA) if we take its color into
| account, but only 2 spatial dimensions; if you think of an
| image as a function that maps continuous xy coordinates
| into continuous rgba coordinates, then you have an
| infinitely dimensional function space; embeddings have
| their own dimensions, but none of them relate to their
| position in text at hand, which is why text in this context
| said to be 1D and image said to be 2D.
| bonsai_spool wrote:
| This doesn't cite the very significant example of DeepVariant
| (and as of 10/16/25 DeepSomatic) which convert genomic data to
| images in order to find DNA mutations. This has been done since
| the late 2010s
|
| https://google.github.io/deepvariant/posts/2020-02-20-lookin...
| TZubiri wrote:
| Using these definitions, mapping datapoints in a graph is also
| converting data into an image in order to analyze it.
|
| Tabulating data into tables similarly converts image visually
| so that mistakes or outliers can be spotted.
| bonsai_spool wrote:
| > Using these definitions
|
| There's a transformation of the data that is analogous to how
| a human would use the data to find a variant. It's closer to
| inspecting a stack of widgets to find a defective widget than
| it is listing numbers in a table
| LysPJ wrote:
| Andrej Karpathy made an interesting comment on the same paper:
| https://x.com/karpathy/status/1980397031542989305
| onionisafruit wrote:
| > It makes two characters that look identical to the eye look
| as two completely different tokens internally in the network. A
| smiling emoji looks like a weird token, not an... actual
| smiling face, pixels and all
|
| This goes against my limited understanding of how LLMs work --
| and computers generally for that matter. Isn't that rendering
| of a smiling emoji still just a series of bits that need to be
| interpreted as a smiley face? The similar looking characters
| point makes more sense to me though assuming it's something
| along the lines of recognizing that "S" and "$" are roughly the
| same thing except for the line down the middle. Still that
| seems like something that doesn't come up much and is probably
| covered by observations made in the training corpus.
|
| All that said, Karpathy knows way more than I will ever know on
| the subject, and I'm only posting my uninformed take here in
| hopes somebody will correct me in a way I understand.
| jncfhnb wrote:
| You're reading it backwards. He is not praising that
| behavior, he is complaining about it. He is saying that bots
| _should_ parse smiling face emoji's as smiling face emoji's,
| but they don't do that currently because as text they get
| passed as gross unicode that has a lot of ambiguity and just
| happens to ultimately get rendered as a face to end users.
| ares623 wrote:
| Wouldn't the training or whatever make that unicode
| sequence effectively a smiley face?
| scotty79 wrote:
| Don't ask ChatGPT about seahorse emoji.
| tensor wrote:
| Don't ask humans either, apparently.
| jncfhnb wrote:
| Yes, but the same face gets represented by many unique
| strings. Strings which may more may not be tokenized into
| a single clean "smiley face" token.
| themoxon wrote:
| There's a new paper from ICCV which basically tries to render
| every modality as images:
| https://openaccess.thecvf.com/content/ICCV2025/papers/Hudson...
| pcwelder wrote:
| I can guarantee that the OSR can't read this sentense correstlu.
| geysersam wrote:
| Really? How so?
| moduspol wrote:
| Looks like he's using atypical "c" characters.
| syntaxing wrote:
| What's correct though? Even as a human, I read that
| "correctly". Using weird representations of C doesn't change
| the word?
| metalliqaz wrote:
| Yeah OCR would be much more likely to read that sentence the
| way a human would.
| LudwigNagasena wrote:
| I would even say that OCR can read the sentense correstlu,
| while a tokenizer can't.
| kgeist wrote:
| Qwen3 8b perfectly understood it after 14 seconds of
| thinking.
| bitdivision wrote:
| A lot of Cyrillic characters:
| https://apps.timwhitlock.info/unicode/inspect?s=I+%CF%B2%D0%...
| vindex10 wrote:
| reminds me of the difference between fasttext and word2vec.
|
| fasttext can learn words it haven't seen before by combining
| words from ngrams, word2vec can learn better meaning of the whole
| words, but then missing out on the "unknown words".
|
| image tokens are "text2vec" here, while text tokens are a proxy
| towards building a text embedding of even unseen before texts.
| mohsen1 wrote:
| My understanding is that text tokens are too rigid. The way we
| read is not to process each character (tokens for LLMs)
| precociously but to see a word or sometimes a collection of
| familiar words and make sense of writing. That concept that we
| understand from written text is really what we read and not
| letter or words exactly. This is why we can easily read written
| text with typos. They are just similar enough. By letting LLMs
| not to be too hung up on exact tokens and "skim" through text we
| can make them more efficient just like how humans efficiently
| read.
| bonsai_spool wrote:
| You miss that we already have 'context' when we begin reading
| something, and that probably enables our fast reading. Maybe
| there's a way to give that background setting information to an
| llm but then we could also just have it read the entire input
| stream
| lupire wrote:
| Google AI mode:
|
| User: cna uto nfeir waht tihs si msieplnig?
|
| Google: Here are top web results for exploring this topic [few
| websites containing or discussing gibberish in general]
|
| User: what was my previous question?
|
| Google: Your previous question was "cna uto nfeir waht tihs si
| msieplnig?". I was able to understand this through a
| combination of contextual understanding and sub-word
| tokenization, which allowed me to infer the intended meaning
| despite the numerous typos. The corrected question is: "Can you
| infer what this is misspelling?".
| Miraltar wrote:
| I couldn't read it
| valdork59 wrote:
| GPT-5 thinking doesn't struggle with this. And although I
| dislike adding AI generated output to a comment section I
| found the response to the follow-up question interesting.
|
| q: c u i t m o t s? a: "Can you infer the meaning of this
| sentence?"
| nacozarina wrote:
| the enshittifiers simply haven't yet weighted image processing
| fees with potential token charges; once they have, your cost
| advantage goes bye bye
| ghoul2 wrote:
| But does this not miss the "context" that the embeddings of the
| text tokens carry? An LLM embedding of a text token has a
| compressed version of the entire preceding set of tokens that
| came before it in the context. While the image embeddings are
| just representations of pixel values.
|
| Sort of at the level of word2vec, where the representation of
| "flies" in "fruit flies like a banana" vs "time flies like an
| arrow" would be the same.
| wongarsu wrote:
| Look long enough at literature on any machine learning task, and
| someone invariably gets the idea to turn the data into an image
| and do machine learning on that. Sometimes it works out (turning
| binaries into images and doing malware detection with a CNN
| surprisingly works), usually it doesn't. Just like in this
| example the images usually end up as a kludge to fix some
| deficiency in the prevalent input encoding.
|
| I can certainly believe that images bring certain advantages over
| text for LLMs: the image representation does contain useful
| information that we as humans use (like better information
| hierarchies encoded in text size, boldness, color, saturation and
| position, not just n levels of markdown headings), letter shapes
| are already optimized for this kind of encoding, and continuous
| tokens seem to bring some advantages over discrete ones. But none
| of these advantages need the roundtrip via images, they merely
| point to how crude the state of the art of text tokenization is
| pegasus wrote:
| Exactly. The example the article give of reducing resolution as
| a form of compression highlights the limitations of the visual-
| only proposal. Blurring text is a poor form of compression,
| preserving at most information about paragraph sizes.
| Summarizing early paragraphs (as context compression does in
| coding agents) would be much more efficient.
| vanderZwan wrote:
| Doesn't this more or less boil down to OCR scans of books
| having more privileged information than a plaintext file? In
| which case a roundtrip won't add anything?
|
| [0]
| https://web.archive.org/web/20140402025221/http://m.nautil.u...
| adzm wrote:
| A great example of this is changing music into an image and
| using that to train and generate new images that get converted
| back into music. It was surprisingly successful. I think this
| approach is still used by the current music generators.
| bjourne wrote:
| You are talking about piano roll notation, I think. While
| it's 2d data, it's not quite the same as actual image data.
| E.g., 2d conv and pooling operations are useless for music.
| The patterns and dependencies are too subtle to be captured
| by spatial filters.
| yberreby wrote:
| I've seen this approach applied to spectrograms.
| Convolutions do make enough sense there.
| adzm wrote:
| I am talking about using spectrograms (Fourier transform
| into frequency domain then plotted over time) that results
| in a 2d image of the song, which is then used to train
| something like stable diffusion (and actually using stable
| diffusion by some) to be able to generate these, which is
| then converted back into audio. Riffusion used this
| approach.
| arbot360 wrote:
| Another great example of this working is the genomic variant
| calling models from Deepmind "DeepVariant". They use the
| "alignment pile-up" images which are also used by humans to
| debug genomic alignments, with some additional channels to
| further feature engineer the CNN.
| rangestransform wrote:
| This reminds me of how trajectory prediction networks for
| autonomous driving used to use a CNN to encode scene context
| (from map and object detection rasters) until vectornet showed
| up
| aitchnyu wrote:
| The amount of video/imagery to make a million tokens vs the
| amount of text to do the same is a surprisingly low ratio. Did
| they have the same intuition?
| metalliqaz wrote:
| Future headline: "The unreasonable effectiveness of text
| encoding"
| leemcalilly wrote:
| and reading (aka "ocr") is the fastest way for the brain to
| process language.
| qiine wrote:
| or maybe 3d objects, since that's closer to what real life is and
| how the brain shaped itself around?
| mannykannot wrote:
| Language was spoken long before it was written (or so it seems.)
| This article almost suggests that sound might be a superior input
| medium over either digital text or images.
| falcor84 wrote:
| I've never worked in that area, but recall reading about how
| images of spectrograms are often superior inputs to neural nets
| in comparison to the raw audio data.
| rhdunn wrote:
| Speech to text and text to speech typically operate on the
| audio spectrogram, specifically the Mel-scale spectrum. This
| is a filtered spectrogram that decreases the noise in the
| data. Thus, they are not working on the images of these
| spectra but the computed values -- each spectral slice will
| be a matrix row or column of values.
|
| The theory is that vowels and voiced consonants have a
| fundamental frequency and 5-6 frequencies above that. For
| vowels the first two frequencies are enough to identify the
| vowel. For rhotic vowels (r-sounding vowels like American
| stARt) the 3rd frequency is important.
|
| By converting the audio to the Mel-scale spectrum, it is
| easier to detect these features. Text to speech using the
| Mel-spectrum works by modelling and generating these values,
| which is often easier as the number of parameters is lower
| and the data is easier to work with [1].
|
| [1] There are other approaches to text to speech such as
| overlapping short audio segments.
| HarHarVeryFunny wrote:
| The Mel-scale spectrogram doesn't do anything specific to
| reduce noise compared to an FFT. It's just preferred for
| traditional speech recognition because it uses a non-linear
| frequency scale that better matches human perception.
|
| Speech recognition is based around recognizing the
| frequency correlates of speech generation/articulation,
| mainly the frequency bands that are attenuated by vocal
| tract resonances as articulation changes the shape of the
| vocal tract.
|
| The fundamental frequency, f0, of someone's voice is not
| important to speech recognition - that is just the
| frequency with which their vocal chords are opening and
| closing, corresponding to a high pitched voice (e.g.
| typical female or child) vs a low pitched one (male).
|
| What happens during speech production is that due to the
| complex waveform generated by the asymmetrically timed
| opening and closing of the vocal chords (slow open, fast
| close), not only is the fundamental frequency, f0,
| generated, but also harmonics of it - 2xf0, 3xf0, 4xf0,
| etc. The resonances of the vocal tract then attenuate
| certain frequency ranges within this spectrum of
| frequencies, and it's these changing attenuated frequency
| ranges, aka formants, that effectively carry the
| articulation/speech information.
|
| The frequency ranges of the formats also vary according to
| the length of the vocal tract, which varies between
| individuals, so it's not specific frequencies such as f0 or
| its harmonics that carry the speech information, but rather
| changing patterns of attenuation (formants).
| scotty79 wrote:
| Raw audio data is unnatural. Ear doesn't capture pressure
| samples thousands of times per second. It captures
| frequencies and sonic energy carried by them. Result of doing
| a spectrogram on the raw data is what comes out raw out of
| our biological sensor.
| DonHopkins wrote:
| Wouldn't it be ironic if Comic Sans turned out to be the most
| efficient font for LLM OCR understanding.
| skywhopper wrote:
| There's some poor logic in this writeup. Yes, images can contain
| more information than words, but the extra information an image
| of a word conveys is usually not relevant to the intent of the
| communication, at least not for the purposes assumed in this
| writeup. Ie, pre-converting the text you would have typed into
| ChatGPT and uploading that as an image instead will not better
| convey the meaning and intent behind your words.
|
| If it gives better results (something that there's no evidence
| presented of), that'd be interesting, but it wouldn't be because
| of the larger data size of the uploaded image vs the text.
| sojuz151 wrote:
| This means that current tokenisers are bad, and something better
| is needed if text rendering + image input is a better tokeniser.
| rebeccaskinner wrote:
| Although thus isn't directly related to the idea in the article,
| I'm reminded that one of the most effective hacks I've found for
| working with ChatGPT has been to attach screen shots of files
| rather than the files themselves. I've noticed the model will
| almost always pay attention to an image and pull relevant data
| out of it, but it requires a lot of detailed prompting to get it
| to reliably pay attention to text and pdf attachments instead of
| just hallucinating their contents.
| amluto wrote:
| Hmm. Yesterday I stuck a >100 page PDF into a Claude Project
| and asked Claude to reference a table in the middle of it (I
| gave page numbers) to generate machine readable text. I watched
| with some bafflement as Claude convinced itself that the PDF
| wasn't a PDF, but then it managed to recover all on its own and
| generated 100% correct output. (Well, 100% correct in terms of
| reading the PDF - it did get a bit confused a few times
| following my instructions.)
| sxndmxn wrote:
| PDF is a trash format
| oceanplexian wrote:
| This is probably because your provider is generating embeddings
| over the document to save money, and then simply running a
| vector search across it instead of fitting it all in context.
| fathermarz wrote:
| I have been seeing this a lot. Does this add new meaning to "a
| picture is worth 1000 words?"
|
| I think I saw a benchmark on HN for formats of text files given
| to an LLM to see which would give better results. I wish that
| study had thrown in images to the comparison.
| TZubiri wrote:
| Maybe, but it wouldn't be a language model.
| Bolwin wrote:
| Does anyone know if vlms suffer more from quantization? One thing
| I've noticed is practically every improvement in llms is already
| half there in quantization.
| mingtianzhang wrote:
| We actually don't need OCR: https://pageindex.ai/blog/do-we-need-
| ocr
| cubefox wrote:
| I don't understand this paragraph:
|
| > The first explanation is that text tokens are discrete while
| image tokens are continuous. Each model has a finite number of
| text tokens - say, around 50,000. Each of those tokens
| corresponds to an embedding of, say, 1000 floating-point numbers.
| Text tokens thus only occupy a scattering of single points in the
| space of all possible embeddings. By contrast, the embedding of
| an image token can be sequence of those 1000 numbers. So an image
| token can be far more expressive than a series of text tokens.
|
| Does someone understand the difference he is pointing at?
| unglaublich wrote:
| Generalize it to a video at that point.
| HarHarVeryFunny wrote:
| Vision tokens would only be a viable alternative to text if/when
| the LLM had learnt to read, and was able to control the page
| scanning - how to segment the page into sections of text and non-
| text, segment the text sections into lines, scan the lines in
| language-specific direction (left to right, or right to left),
| segment into words, etc - basically everything that an OCR
| program needs to do prior to the actual OCR bit.
|
| Even having learnt to do all of this, or perhaps with a page-of-
| text sequence-of-word extractor pre-processor, the LLM would then
| need to learn to generalize over different font faces and sizes,
| and imperfect speckled and/or distorted scans.
|
| Finally, but surely not least, if the goal is to reduce
| (inference?) computational load by representing multiple words as
| a single image token, then it seems that more training epochs may
| be needed, with variations in word grouping, since the same
| sequence of words would not always be grouped together, so the
| LLM would have to learn that an image token representing "the cat
| sat" may also have been split up as "today the cat" and "sat on
| the".
|
| A better way to reduce number of tokens being processed might be
| to have the LLM learn how to combine multiple adjacent tokens
| into one, perhaps starting with individual letters at the input,
| although this would of course require a fairly major change to
| the Transformer architecture.
| reissbaker wrote:
| Multimodal LLMs already learn to generalize over text inside
| images. In my experience most multimodal LLMs are significantly
| better than traditional OCR, especially if there's any unusual
| formatting going on.
| HarHarVeryFunny wrote:
| This thread is considering image input as an alternative to
| text input for text, not as an alternative to other types of
| OCR, so the accuracy bar is 100%.
|
| I've had mixed results with LLMs for OCR.. sometimes
| excellent (zero errors on a photo of my credit card bill),
| but poor if the source wasn't a printed page - sometimes
| "reusing" the same image section for multiple extracted
| words!
|
| FWIW, I highly doubt that LLMs have just learnt to scan pages
| from (page image, page text) training pairs - more likely
| text-heavy image input is triggering special OCR handling.
| dang wrote:
| Recent and related:
|
| _Karpathy on DeepSeek-OCR paper: Are pixels better inputs to
| LLMs than text?_ - https://news.ycombinator.com/item?id=45658928
| - Oct 2025 (172 comments)
| LarsDu88 wrote:
| This reminds me of a trick from the world of "competitive
| memorization" illustrated in the wonderful book "Moonwalking with
| Einstein".
|
| To improve your memory recall on any task, like say memorizing
| the order of an entire deck of cards... convert the number
| sequences to a set of visual images!
| BrandiATMuhkuh wrote:
| I'm using this approach quite often. I don't know of any
| documents created by humans for humans that have no formatting.
| The formatting, position etc. are usually an important part of
| the document.
|
| Since the first multimodal llms came out, I'm using this approach
| when I deal with documents. It makes the code much simpler
| because everything is an image and it's surprisingly robust.
|
| Works also for embeddings (cohere embed v4)
| pmarreck wrote:
| Our own minds do, so...
| pmarreck wrote:
| Wasn't there an AI project that made a brief radar blip on the
| news a few months ago where someone used autocompleted images to
| autocomplete text?
| qingcharles wrote:
| OCR is fine for books which are just swathes of text, but for
| things like magazines it breaks down heavily. You have columns
| breaking in weird places, going up, down, left, right, fonts
| changing in the middle of a paragraph. And then the pages are
| heavy on images which the text is often referencing either
| explicitly or implicitly. Without the images, the meaning of the
| text is often changed or redundant.
|
| Anyone have an LLM that can take a 300 page PDF magazine (with no
| OCR) and summarize it? :)
| northlondoner wrote:
| This is interesting. There might be an information-theoretic
| reason -- perhaps 'spatial tokenization' is more informative than
| 'words tokenization'.
___________________________________________________________________
(page generated 2025-10-27 23:00 UTC)