[HN Gopher] Should LLMs just treat text content as an image?
       ___________________________________________________________________
        
       Should LLMs just treat text content as an image?
        
       Author : ingve
       Score  : 131 points
       Date   : 2025-10-21 06:10 UTC (6 days ago)
        
 (HTM) web link (www.seangoedecke.com)
 (TXT) w3m dump (www.seangoedecke.com)
        
       | onesandofgrain wrote:
       | A picture is worth a thousand words
        
         | hshdhdhehd wrote:
         | And a picture of a thousand words is worth a thousand words.
        
           | genghisjahn wrote:
           | I type at .08 pictures per minute.
        
           | favoboa wrote:
           | A picture of a thousand words with some of them colored,
           | bolded, underlined, etc is worth more than a thousand words
        
       | Havoc wrote:
       | Seems wildly counterintuitive to me frankly.
       | 
       | Even if true though not sure what we'd do with it. The bulk of
       | knowledge available on the internet is text. Aside from maybe
       | YouTube so I guess it could work for world model type things?
       | Understanding physical interactions of objects etc
        
         | hshdhdhehd wrote:
         | Trivial to convert text to images to process. But counter-
         | intuitive to me too.
        
         | bilsbie wrote:
         | All text is technically converted to images before we see it.
        
           | thfuran wrote:
           | Only if you see it instead of hearing it or touching it.
        
       | ToJans wrote:
       | A series of tokens is one-dimensional (a sequence). An image is
       | 2-dimensional. What about 3D/4D/... representation (until we end
       | up with an LLM-dimensional solution ofc).
        
         | dvt wrote:
         | This isn't exactly true, as tokens live in the embedding space,
         | which is n-dimensional, like 256 or 512 or whatever (so you
         | might see one word, but it's actually an array of a bunch of
         | numbers). With that said, I think it's pretty intuitive that
         | continuous tokens are more efficient than discrete ones, simply
         | due to the fact that the LLM itself is basically a continuous
         | function (with coefficients/parameters [?] R).
        
           | wongarsu wrote:
           | We call an embedding-space n-dimensional, but in this context
           | I would consider it 1-dimensional, as in it's a 1d vector of
           | n values. The terminology just sucks. If we described images
           | the same way we describe embeddings a 2 megapixel image would
           | have to be called 2-million-dimensional (or 8-million-
           | dimensional if we consider rgba to be four separate values)
           | 
           | I would also argue tokens are outside the embedding space,
           | and a large part of the magic of LLMs (and many other neural
           | network types) is the ability to map sequences of rather
           | crude inputs (tokens) into a more meaningful embedding space,
           | and then map from a meaningful embedding space back to tokens
           | we humans understand
        
             | LudwigNagasena wrote:
             | Those are just dimensions of different things, and it's
             | usually pretty clear from context what is meant. Color
             | space has 3 dimensions; or 4 with transparency; an image
             | pixel has 6 dimensions (xy+RGBA) if we take its color into
             | account, but only 2 spatial dimensions; if you think of an
             | image as a function that maps continuous xy coordinates
             | into continuous rgba coordinates, then you have an
             | infinitely dimensional function space; embeddings have
             | their own dimensions, but none of them relate to their
             | position in text at hand, which is why text in this context
             | said to be 1D and image said to be 2D.
        
       | bonsai_spool wrote:
       | This doesn't cite the very significant example of DeepVariant
       | (and as of 10/16/25 DeepSomatic) which convert genomic data to
       | images in order to find DNA mutations. This has been done since
       | the late 2010s
       | 
       | https://google.github.io/deepvariant/posts/2020-02-20-lookin...
        
         | TZubiri wrote:
         | Using these definitions, mapping datapoints in a graph is also
         | converting data into an image in order to analyze it.
         | 
         | Tabulating data into tables similarly converts image visually
         | so that mistakes or outliers can be spotted.
        
           | bonsai_spool wrote:
           | > Using these definitions
           | 
           | There's a transformation of the data that is analogous to how
           | a human would use the data to find a variant. It's closer to
           | inspecting a stack of widgets to find a defective widget than
           | it is listing numbers in a table
        
       | LysPJ wrote:
       | Andrej Karpathy made an interesting comment on the same paper:
       | https://x.com/karpathy/status/1980397031542989305
        
         | onionisafruit wrote:
         | > It makes two characters that look identical to the eye look
         | as two completely different tokens internally in the network. A
         | smiling emoji looks like a weird token, not an... actual
         | smiling face, pixels and all
         | 
         | This goes against my limited understanding of how LLMs work --
         | and computers generally for that matter. Isn't that rendering
         | of a smiling emoji still just a series of bits that need to be
         | interpreted as a smiley face? The similar looking characters
         | point makes more sense to me though assuming it's something
         | along the lines of recognizing that "S" and "$" are roughly the
         | same thing except for the line down the middle. Still that
         | seems like something that doesn't come up much and is probably
         | covered by observations made in the training corpus.
         | 
         | All that said, Karpathy knows way more than I will ever know on
         | the subject, and I'm only posting my uninformed take here in
         | hopes somebody will correct me in a way I understand.
        
           | jncfhnb wrote:
           | You're reading it backwards. He is not praising that
           | behavior, he is complaining about it. He is saying that bots
           | _should_ parse smiling face emoji's as smiling face emoji's,
           | but they don't do that currently because as text they get
           | passed as gross unicode that has a lot of ambiguity and just
           | happens to ultimately get rendered as a face to end users.
        
             | ares623 wrote:
             | Wouldn't the training or whatever make that unicode
             | sequence effectively a smiley face?
        
               | scotty79 wrote:
               | Don't ask ChatGPT about seahorse emoji.
        
               | tensor wrote:
               | Don't ask humans either, apparently.
        
               | jncfhnb wrote:
               | Yes, but the same face gets represented by many unique
               | strings. Strings which may more may not be tokenized into
               | a single clean "smiley face" token.
        
       | themoxon wrote:
       | There's a new paper from ICCV which basically tries to render
       | every modality as images:
       | https://openaccess.thecvf.com/content/ICCV2025/papers/Hudson...
        
       | pcwelder wrote:
       | I can guarantee that the OSR can't read this sentense correstlu.
        
         | geysersam wrote:
         | Really? How so?
        
           | moduspol wrote:
           | Looks like he's using atypical "c" characters.
        
         | syntaxing wrote:
         | What's correct though? Even as a human, I read that
         | "correctly". Using weird representations of C doesn't change
         | the word?
        
           | metalliqaz wrote:
           | Yeah OCR would be much more likely to read that sentence the
           | way a human would.
        
           | LudwigNagasena wrote:
           | I would even say that OCR can read the sentense correstlu,
           | while a tokenizer can't.
        
             | kgeist wrote:
             | Qwen3 8b perfectly understood it after 14 seconds of
             | thinking.
        
         | bitdivision wrote:
         | A lot of Cyrillic characters:
         | https://apps.timwhitlock.info/unicode/inspect?s=I+%CF%B2%D0%...
        
       | vindex10 wrote:
       | reminds me of the difference between fasttext and word2vec.
       | 
       | fasttext can learn words it haven't seen before by combining
       | words from ngrams, word2vec can learn better meaning of the whole
       | words, but then missing out on the "unknown words".
       | 
       | image tokens are "text2vec" here, while text tokens are a proxy
       | towards building a text embedding of even unseen before texts.
        
       | mohsen1 wrote:
       | My understanding is that text tokens are too rigid. The way we
       | read is not to process each character (tokens for LLMs)
       | precociously but to see a word or sometimes a collection of
       | familiar words and make sense of writing. That concept that we
       | understand from written text is really what we read and not
       | letter or words exactly. This is why we can easily read written
       | text with typos. They are just similar enough. By letting LLMs
       | not to be too hung up on exact tokens and "skim" through text we
       | can make them more efficient just like how humans efficiently
       | read.
        
         | bonsai_spool wrote:
         | You miss that we already have 'context' when we begin reading
         | something, and that probably enables our fast reading. Maybe
         | there's a way to give that background setting information to an
         | llm but then we could also just have it read the entire input
         | stream
        
         | lupire wrote:
         | Google AI mode:
         | 
         | User: cna uto nfeir waht tihs si msieplnig?
         | 
         | Google: Here are top web results for exploring this topic [few
         | websites containing or discussing gibberish in general]
         | 
         | User: what was my previous question?
         | 
         | Google: Your previous question was "cna uto nfeir waht tihs si
         | msieplnig?". I was able to understand this through a
         | combination of contextual understanding and sub-word
         | tokenization, which allowed me to infer the intended meaning
         | despite the numerous typos. The corrected question is: "Can you
         | infer what this is misspelling?".
        
           | Miraltar wrote:
           | I couldn't read it
        
           | valdork59 wrote:
           | GPT-5 thinking doesn't struggle with this. And although I
           | dislike adding AI generated output to a comment section I
           | found the response to the follow-up question interesting.
           | 
           | q: c u i t m o t s? a: "Can you infer the meaning of this
           | sentence?"
        
       | nacozarina wrote:
       | the enshittifiers simply haven't yet weighted image processing
       | fees with potential token charges; once they have, your cost
       | advantage goes bye bye
        
       | ghoul2 wrote:
       | But does this not miss the "context" that the embeddings of the
       | text tokens carry? An LLM embedding of a text token has a
       | compressed version of the entire preceding set of tokens that
       | came before it in the context. While the image embeddings are
       | just representations of pixel values.
       | 
       | Sort of at the level of word2vec, where the representation of
       | "flies" in "fruit flies like a banana" vs "time flies like an
       | arrow" would be the same.
        
       | wongarsu wrote:
       | Look long enough at literature on any machine learning task, and
       | someone invariably gets the idea to turn the data into an image
       | and do machine learning on that. Sometimes it works out (turning
       | binaries into images and doing malware detection with a CNN
       | surprisingly works), usually it doesn't. Just like in this
       | example the images usually end up as a kludge to fix some
       | deficiency in the prevalent input encoding.
       | 
       | I can certainly believe that images bring certain advantages over
       | text for LLMs: the image representation does contain useful
       | information that we as humans use (like better information
       | hierarchies encoded in text size, boldness, color, saturation and
       | position, not just n levels of markdown headings), letter shapes
       | are already optimized for this kind of encoding, and continuous
       | tokens seem to bring some advantages over discrete ones. But none
       | of these advantages need the roundtrip via images, they merely
       | point to how crude the state of the art of text tokenization is
        
         | pegasus wrote:
         | Exactly. The example the article give of reducing resolution as
         | a form of compression highlights the limitations of the visual-
         | only proposal. Blurring text is a poor form of compression,
         | preserving at most information about paragraph sizes.
         | Summarizing early paragraphs (as context compression does in
         | coding agents) would be much more efficient.
        
         | vanderZwan wrote:
         | Doesn't this more or less boil down to OCR scans of books
         | having more privileged information than a plaintext file? In
         | which case a roundtrip won't add anything?
         | 
         | [0]
         | https://web.archive.org/web/20140402025221/http://m.nautil.u...
        
         | adzm wrote:
         | A great example of this is changing music into an image and
         | using that to train and generate new images that get converted
         | back into music. It was surprisingly successful. I think this
         | approach is still used by the current music generators.
        
           | bjourne wrote:
           | You are talking about piano roll notation, I think. While
           | it's 2d data, it's not quite the same as actual image data.
           | E.g., 2d conv and pooling operations are useless for music.
           | The patterns and dependencies are too subtle to be captured
           | by spatial filters.
        
             | yberreby wrote:
             | I've seen this approach applied to spectrograms.
             | Convolutions do make enough sense there.
        
             | adzm wrote:
             | I am talking about using spectrograms (Fourier transform
             | into frequency domain then plotted over time) that results
             | in a 2d image of the song, which is then used to train
             | something like stable diffusion (and actually using stable
             | diffusion by some) to be able to generate these, which is
             | then converted back into audio. Riffusion used this
             | approach.
        
         | arbot360 wrote:
         | Another great example of this working is the genomic variant
         | calling models from Deepmind "DeepVariant". They use the
         | "alignment pile-up" images which are also used by humans to
         | debug genomic alignments, with some additional channels to
         | further feature engineer the CNN.
        
         | rangestransform wrote:
         | This reminds me of how trajectory prediction networks for
         | autonomous driving used to use a CNN to encode scene context
         | (from map and object detection rasters) until vectornet showed
         | up
        
       | aitchnyu wrote:
       | The amount of video/imagery to make a million tokens vs the
       | amount of text to do the same is a surprisingly low ratio. Did
       | they have the same intuition?
        
       | metalliqaz wrote:
       | Future headline: "The unreasonable effectiveness of text
       | encoding"
        
       | leemcalilly wrote:
       | and reading (aka "ocr") is the fastest way for the brain to
       | process language.
        
       | qiine wrote:
       | or maybe 3d objects, since that's closer to what real life is and
       | how the brain shaped itself around?
        
       | mannykannot wrote:
       | Language was spoken long before it was written (or so it seems.)
       | This article almost suggests that sound might be a superior input
       | medium over either digital text or images.
        
         | falcor84 wrote:
         | I've never worked in that area, but recall reading about how
         | images of spectrograms are often superior inputs to neural nets
         | in comparison to the raw audio data.
        
           | rhdunn wrote:
           | Speech to text and text to speech typically operate on the
           | audio spectrogram, specifically the Mel-scale spectrum. This
           | is a filtered spectrogram that decreases the noise in the
           | data. Thus, they are not working on the images of these
           | spectra but the computed values -- each spectral slice will
           | be a matrix row or column of values.
           | 
           | The theory is that vowels and voiced consonants have a
           | fundamental frequency and 5-6 frequencies above that. For
           | vowels the first two frequencies are enough to identify the
           | vowel. For rhotic vowels (r-sounding vowels like American
           | stARt) the 3rd frequency is important.
           | 
           | By converting the audio to the Mel-scale spectrum, it is
           | easier to detect these features. Text to speech using the
           | Mel-spectrum works by modelling and generating these values,
           | which is often easier as the number of parameters is lower
           | and the data is easier to work with [1].
           | 
           | [1] There are other approaches to text to speech such as
           | overlapping short audio segments.
        
             | HarHarVeryFunny wrote:
             | The Mel-scale spectrogram doesn't do anything specific to
             | reduce noise compared to an FFT. It's just preferred for
             | traditional speech recognition because it uses a non-linear
             | frequency scale that better matches human perception.
             | 
             | Speech recognition is based around recognizing the
             | frequency correlates of speech generation/articulation,
             | mainly the frequency bands that are attenuated by vocal
             | tract resonances as articulation changes the shape of the
             | vocal tract.
             | 
             | The fundamental frequency, f0, of someone's voice is not
             | important to speech recognition - that is just the
             | frequency with which their vocal chords are opening and
             | closing, corresponding to a high pitched voice (e.g.
             | typical female or child) vs a low pitched one (male).
             | 
             | What happens during speech production is that due to the
             | complex waveform generated by the asymmetrically timed
             | opening and closing of the vocal chords (slow open, fast
             | close), not only is the fundamental frequency, f0,
             | generated, but also harmonics of it - 2xf0, 3xf0, 4xf0,
             | etc. The resonances of the vocal tract then attenuate
             | certain frequency ranges within this spectrum of
             | frequencies, and it's these changing attenuated frequency
             | ranges, aka formants, that effectively carry the
             | articulation/speech information.
             | 
             | The frequency ranges of the formats also vary according to
             | the length of the vocal tract, which varies between
             | individuals, so it's not specific frequencies such as f0 or
             | its harmonics that carry the speech information, but rather
             | changing patterns of attenuation (formants).
        
           | scotty79 wrote:
           | Raw audio data is unnatural. Ear doesn't capture pressure
           | samples thousands of times per second. It captures
           | frequencies and sonic energy carried by them. Result of doing
           | a spectrogram on the raw data is what comes out raw out of
           | our biological sensor.
        
       | DonHopkins wrote:
       | Wouldn't it be ironic if Comic Sans turned out to be the most
       | efficient font for LLM OCR understanding.
        
       | skywhopper wrote:
       | There's some poor logic in this writeup. Yes, images can contain
       | more information than words, but the extra information an image
       | of a word conveys is usually not relevant to the intent of the
       | communication, at least not for the purposes assumed in this
       | writeup. Ie, pre-converting the text you would have typed into
       | ChatGPT and uploading that as an image instead will not better
       | convey the meaning and intent behind your words.
       | 
       | If it gives better results (something that there's no evidence
       | presented of), that'd be interesting, but it wouldn't be because
       | of the larger data size of the uploaded image vs the text.
        
       | sojuz151 wrote:
       | This means that current tokenisers are bad, and something better
       | is needed if text rendering + image input is a better tokeniser.
        
       | rebeccaskinner wrote:
       | Although thus isn't directly related to the idea in the article,
       | I'm reminded that one of the most effective hacks I've found for
       | working with ChatGPT has been to attach screen shots of files
       | rather than the files themselves. I've noticed the model will
       | almost always pay attention to an image and pull relevant data
       | out of it, but it requires a lot of detailed prompting to get it
       | to reliably pay attention to text and pdf attachments instead of
       | just hallucinating their contents.
        
         | amluto wrote:
         | Hmm. Yesterday I stuck a >100 page PDF into a Claude Project
         | and asked Claude to reference a table in the middle of it (I
         | gave page numbers) to generate machine readable text. I watched
         | with some bafflement as Claude convinced itself that the PDF
         | wasn't a PDF, but then it managed to recover all on its own and
         | generated 100% correct output. (Well, 100% correct in terms of
         | reading the PDF - it did get a bit confused a few times
         | following my instructions.)
        
           | sxndmxn wrote:
           | PDF is a trash format
        
         | oceanplexian wrote:
         | This is probably because your provider is generating embeddings
         | over the document to save money, and then simply running a
         | vector search across it instead of fitting it all in context.
        
       | fathermarz wrote:
       | I have been seeing this a lot. Does this add new meaning to "a
       | picture is worth 1000 words?"
       | 
       | I think I saw a benchmark on HN for formats of text files given
       | to an LLM to see which would give better results. I wish that
       | study had thrown in images to the comparison.
        
       | TZubiri wrote:
       | Maybe, but it wouldn't be a language model.
        
       | Bolwin wrote:
       | Does anyone know if vlms suffer more from quantization? One thing
       | I've noticed is practically every improvement in llms is already
       | half there in quantization.
        
       | mingtianzhang wrote:
       | We actually don't need OCR: https://pageindex.ai/blog/do-we-need-
       | ocr
        
       | cubefox wrote:
       | I don't understand this paragraph:
       | 
       | > The first explanation is that text tokens are discrete while
       | image tokens are continuous. Each model has a finite number of
       | text tokens - say, around 50,000. Each of those tokens
       | corresponds to an embedding of, say, 1000 floating-point numbers.
       | Text tokens thus only occupy a scattering of single points in the
       | space of all possible embeddings. By contrast, the embedding of
       | an image token can be sequence of those 1000 numbers. So an image
       | token can be far more expressive than a series of text tokens.
       | 
       | Does someone understand the difference he is pointing at?
        
       | unglaublich wrote:
       | Generalize it to a video at that point.
        
       | HarHarVeryFunny wrote:
       | Vision tokens would only be a viable alternative to text if/when
       | the LLM had learnt to read, and was able to control the page
       | scanning - how to segment the page into sections of text and non-
       | text, segment the text sections into lines, scan the lines in
       | language-specific direction (left to right, or right to left),
       | segment into words, etc - basically everything that an OCR
       | program needs to do prior to the actual OCR bit.
       | 
       | Even having learnt to do all of this, or perhaps with a page-of-
       | text sequence-of-word extractor pre-processor, the LLM would then
       | need to learn to generalize over different font faces and sizes,
       | and imperfect speckled and/or distorted scans.
       | 
       | Finally, but surely not least, if the goal is to reduce
       | (inference?) computational load by representing multiple words as
       | a single image token, then it seems that more training epochs may
       | be needed, with variations in word grouping, since the same
       | sequence of words would not always be grouped together, so the
       | LLM would have to learn that an image token representing "the cat
       | sat" may also have been split up as "today the cat" and "sat on
       | the".
       | 
       | A better way to reduce number of tokens being processed might be
       | to have the LLM learn how to combine multiple adjacent tokens
       | into one, perhaps starting with individual letters at the input,
       | although this would of course require a fairly major change to
       | the Transformer architecture.
        
         | reissbaker wrote:
         | Multimodal LLMs already learn to generalize over text inside
         | images. In my experience most multimodal LLMs are significantly
         | better than traditional OCR, especially if there's any unusual
         | formatting going on.
        
           | HarHarVeryFunny wrote:
           | This thread is considering image input as an alternative to
           | text input for text, not as an alternative to other types of
           | OCR, so the accuracy bar is 100%.
           | 
           | I've had mixed results with LLMs for OCR.. sometimes
           | excellent (zero errors on a photo of my credit card bill),
           | but poor if the source wasn't a printed page - sometimes
           | "reusing" the same image section for multiple extracted
           | words!
           | 
           | FWIW, I highly doubt that LLMs have just learnt to scan pages
           | from (page image, page text) training pairs - more likely
           | text-heavy image input is triggering special OCR handling.
        
       | dang wrote:
       | Recent and related:
       | 
       |  _Karpathy on DeepSeek-OCR paper: Are pixels better inputs to
       | LLMs than text?_ - https://news.ycombinator.com/item?id=45658928
       | - Oct 2025 (172 comments)
        
       | LarsDu88 wrote:
       | This reminds me of a trick from the world of "competitive
       | memorization" illustrated in the wonderful book "Moonwalking with
       | Einstein".
       | 
       | To improve your memory recall on any task, like say memorizing
       | the order of an entire deck of cards... convert the number
       | sequences to a set of visual images!
        
       | BrandiATMuhkuh wrote:
       | I'm using this approach quite often. I don't know of any
       | documents created by humans for humans that have no formatting.
       | The formatting, position etc. are usually an important part of
       | the document.
       | 
       | Since the first multimodal llms came out, I'm using this approach
       | when I deal with documents. It makes the code much simpler
       | because everything is an image and it's surprisingly robust.
       | 
       | Works also for embeddings (cohere embed v4)
        
       | pmarreck wrote:
       | Our own minds do, so...
        
       | pmarreck wrote:
       | Wasn't there an AI project that made a brief radar blip on the
       | news a few months ago where someone used autocompleted images to
       | autocomplete text?
        
       | qingcharles wrote:
       | OCR is fine for books which are just swathes of text, but for
       | things like magazines it breaks down heavily. You have columns
       | breaking in weird places, going up, down, left, right, fonts
       | changing in the middle of a paragraph. And then the pages are
       | heavy on images which the text is often referencing either
       | explicitly or implicitly. Without the images, the meaning of the
       | text is often changed or redundant.
       | 
       | Anyone have an LLM that can take a 300 page PDF magazine (with no
       | OCR) and summarize it? :)
        
       | northlondoner wrote:
       | This is interesting. There might be an information-theoretic
       | reason -- perhaps 'spatial tokenization' is more informative than
       | 'words tokenization'.
        
       ___________________________________________________________________
       (page generated 2025-10-27 23:00 UTC)