[HN Gopher] How Does GPT-4o Encode Images?
       ___________________________________________________________________
        
       How Does GPT-4o Encode Images?
        
       Author : olooney
       Score  : 267 points
       Date   : 2024-06-07 12:54 UTC (10 hours ago)
        
 (HTM) web link (www.oranlooney.com)
 (TXT) w3m dump (www.oranlooney.com)
        
       | simonw wrote:
       | The way this tests GPT-4o performance by feeding in a 7x7 grid of
       | colored shapes and requesting them back as JSON (about half way
       | down the page) is really clever.
        
         | blixt wrote:
         | I did something similar when GPT-4V came out, partially with
         | the goal to figure out the input format (I did not get anywhere
         | other than "magic vectors"), but also to roughly estimate the
         | amount of data you can get back out of a 512x512 (the low
         | quality option) image.
         | 
         | What I found is that you can sometimes get more text out of
         | 85-token image than you can out of 85 tokens of text! That
         | said, I think there will be plenty of edge cases where it
         | actually loses some information, and maybe you could argue that
         | if you remove every other word in the text, it could still
         | restore the text.
         | 
         | I never went deeper on this, but I believe there's something
         | clever to be done in the context window with the fact that
         | images are relatively cheap tokens-wise.
        
           | _dark_matter_ wrote:
           | The author mentions this in the article, that more than 170
           | tokens of text can be pulled from an image.
        
             | blixt wrote:
             | Ah, you're right! My bad!
        
       | GaggiX wrote:
       | An important aspect that is not considered in the article is that
       | GPT-4o can generate images by itself (even though the feature is
       | not enable to the public) meaning that it's very likely trained
       | on sequential image tokens and the images are quantized using a
       | VQGAN, my guess is that the VQGAN takes 512x512 images and
       | outputs 13x13 tokens (169 image tokens + special token), the
       | VQGAN can be a convolutional network like shown in the article,
       | for a transformer-based VQGAN I cannot think of a configuration
       | with overlapping patches where it would output 13x13 tokens on a
       | 512x512 image (unless they just added a padding of 4 on the
       | entire image and the patches are not overlapping).
        
         | edude03 wrote:
         | How do we know it generates the images itself and isn't passing
         | the text to dalle? It's supposedly how the current gpt4 model
         | does listen mode (with whisper but same idea)
        
           | GaggiX wrote:
           | Go to the "Explorations of capabilities" and explore all the
           | capabilities: https://openai.com/index/hello-gpt-4o/
           | 
           | You cannot have this level of control by prompting Dalle,
           | also GPT-4o isn't using Whisper (older GPT-4s yes).
        
           | ec109685 wrote:
           | At least ChatGPT 4o still looks like it is using dalle.
           | 
           | https://x.com/krishnanrohit/status/1755123169353236848?s=46
        
       | simonw wrote:
       | Something I don't get is why OpenAI don't provide clear,
       | comprehensive documentation as to how this actually works,
       | 
       | I get that there's competition from other providers now so they
       | have an instinct to keep implementation details secret, but as
       | someone building on their APIs this lack of documentation really
       | holds me back.
       | 
       | To make good judgements about how to use this stuff I need to
       | know how it works!
       | 
       | I had a hilarious bug a few weeks ago where I loaded in a single
       | image representing multiple pages of a PDF and GPT-4 vision
       | effectively hallucinated the contents of the document when asked
       | to OCR it, presumably because the image was too big and was first
       | resized to a point where the text was illegible:
       | https://simonwillison.net/2024/Apr/17/ai-for-data-journalism...
       | 
       | If OpenAI had clear documentation about how their image handling
       | works I could avoid those kinds of problems much more
       | effectively.
        
         | Onawa wrote:
         | I was trying to figure out this exact same issue. OCR on a PDF
         | worked great, up until a certain point when it just started
         | hallucinating like crazy. I was working on a whole pipeline to
         | just feed in a PDF one page at a time to try and get around
         | this issue. Otherwise, the OCR works absolutely fantastic
         | compared to all other other tools I've been trying lately.
         | These include OCRmyPDF (Tesseract), SuryaOCR, and some of the
         | models on the Visual LLM Leaderboard.
         | 
         | I've also seen some people recommend Paddle OCR, but I find
         | their documentation to be lacking and I haven't got that one
         | working yet to evaluate.
        
           | infecto wrote:
           | For document text/table extraction, nothing beats the quality
           | from the cloud providers. It can get costly but the accuracy
           | is much higher than what you will find using an openai API.
        
           | raybb wrote:
           | Simon wilson recently had a thread going through some of the
           | options here https://x.com/simonw/status/1797526667797442773
        
             | Onawa wrote:
             | Funny enough, Simon Willison is the op of this comment
             | thread lol.
        
         | infecto wrote:
         | But they do document that the images are resized and give you
         | some rough guidelines on how you should be sizing your images.
         | Low resolution is 1024 x 1024 with no tiling and High
         | Resolution starts at 2048 x 2048 which then gets tiled. It
         | could use further documentation but it is enough to know more
         | than one page should never be used via the API.
        
           | alach11 wrote:
           | Right. But I still have a lot of questions. How does the
           | model handle when something important overlaps multiple tiles
           | in high-resolution mode? Am I better off doing the tiling
           | myself with some overlap?
        
         | nolok wrote:
         | The fact that it's so eager to hallucinate random things that
         | sounds plausible enough if you're not paying attention without
         | warning you or giving any error should make people reconsider
         | using it for "data journalism" or similar.
         | 
         | If you make your system and it "works", then how will you see
         | the one time out of X where it confidently provides you false
         | information that you happily use because it usually work ?
        
           | TeMPOraL wrote:
           | > _how will you see the one time out of X where it
           | confidently provides you false information that you happily
           | use because it usually work ?_
           | 
           | You don't. You treat it like you would a human worker: set
           | your process to detect or tolerate wrong output. If you
           | can't, don't apply this tool to your work.
        
             | IanCal wrote:
             | This is true but misses a key fact, that typical llm errors
             | are different to human errors. Not that they're worse or
             | better but just that you need to understand where and when
             | they're more likely to make mistakes and how to manage
             | that.
        
           | simonw wrote:
           | Right, that's why I've been recommending dedicated OCR tools
           | (Textract etc) over vision LLMs:
           | https://simonwillison.net/2024/Apr/17/ai-for-data-
           | journalism...
        
         | ilaksh wrote:
         | There is an effectively infinite number of possibilities of
         | things people could throw at it and they can't know ahead of
         | time whether your use case will work or not. Even if they told
         | you exactly how it worked, you wouldn't know for sure until you
         | tried it. And giving a vague explanation wouldn't help you
         | either.
        
         | resters wrote:
         | Is there documentation (is it possible?) on how to upload a PDF
         | to gpt-4o using the API?
        
           | simonw wrote:
           | I think you have to split it into a page per image and then
           | upload each page separately. That's how I've been doing it.
        
       | tantalor wrote:
       | > CLIP embeds the entire image as a single vector, not 170 of
       | them.
       | 
       | Single token?
       | 
       | > GPT-4o must be using a different, more advanced strategy
       | internally
       | 
       | Why
        
         | freediver wrote:
         | The embeddings do not offer the level of fidelity to recognize
         | fine details on an image, hand writing for example.
        
       | blixt wrote:
       | I went through a similar journey back when GPT-4V came out.
       | Here's an additional puzzle for you: GPT-4V knows the _exact_
       | pixel dimensions of the image (post-resize since there is a max
       | size for images in the pipeline, besides 512x512), but I 'm 99%
       | sure it's not provided as text tokens. How am I so sure? It's
       | easy to get GPT to divulge everything from system prompt to tool
       | details, etc. but I've tried every trick in the book and then
       | some, multiple times over, and there is no way to get it to quote
       | the dimensions as text. The only way to get it to give you the
       | dimensions is to tell it to output a structure that contains
       | width and height and just pick something reasonable, and they
       | will "randomly" be the correct values:
       | 
       | https://x.com/blixt/status/1722298733470024076
        
         | dannyw wrote:
         | Perhaps images aren't tokens at all... and 170 tokens is just
         | an approximation of the compute cost.
        
           | qarl wrote:
           | They address this question in the article.
        
           | blixt wrote:
           | I think that would have pretty serious implications for the
           | transformer architecture though. If they're not embedded like
           | text tokens, how would attention, etc work? And a
           | conversation with multiple images back and forth? Not to
           | mention with GPT-4o now having audio support as well. I would
           | assume it does become tokens.
        
         | llm_trw wrote:
         | > It's easy to get GPT to divulge everything from system prompt
         | to tool details,
         | 
         | It's easy enough to get it to hallucinate those things. It
         | doesn't actually tell them to you.
        
           | blixt wrote:
           | I'm well aware of that, but there are plenty of ways to
           | induce verbatim quoting from "hidden" information, and mostly
           | verify it (through sampling a large number of times in
           | separate runs).
           | 
           | Models are improving in truly hiding or ignoring information
           | these days though. As the author of the article states,
           | you'll have a hard time tricking GPT-4o to read text in
           | images as instructions, most likely thanks to this research:
           | https://openai.com/index/the-instruction-hierarchy/
           | 
           | I do feel pretty confident that when the model is happily
           | spitting out its system prompt, and all metadata around the
           | image, but not its pixel dimensions, that probably those
           | dimensions were not provided in any system/assistant/tool
           | message. So maybe part of the image embeddings also encode
           | the pixel dimensions somehow (it would also help the model
           | not think of the image as a squished square for non-1:1
           | images that have been resized to 512x512).
        
       | alach11 wrote:
       | I really hope we see improvements to the resolutions large
       | multimodal models can handle. Right now this patchwork approach
       | leads to lots of unwieldly workarounds in applications.
        
       | eminence32 wrote:
       | I'm assuming that the tokens used to encode an image are entirely
       | distinct from the tokens used to encode text. Does anyone know if
       | this is actually the case?
        
         | tempusalaria wrote:
         | It's probable that there is a separate vision encoder which
         | projects the image tiles into the distribution space of the
         | text tokenizer a la CLIP/LLava
        
         | blixt wrote:
         | I would assume it has a "mode" token where it switches between
         | text/image (and now audio), or you'd have to try to maximize
         | the number of reserved tokens between multiple modes. GPT-4o
         | did go from 100K to 200K vocabulary, but as far as I understand
         | all of that vocabulary is in use for text (reducing the token
         | cost for non-English).
        
       | rvnx wrote:
       | Author claims that the most likely is that there is Tesseract
       | running behind ChatGPT-4v/o.
       | 
       | There is no way that this is Tesseract.
       | 
       | -> Tesseract accuracy is very low, it can barely do OCR on
       | printed documents.
        
         | jerrygenser wrote:
         | Even if tesseract accuracy is low, if the tesseract result in
         | addition to the image is then passed to the LLM, it can result
         | in a much more accurate OCR.
         | 
         | For example, GPT4 with some vision capability would be able to
         | fill in the incorrect OCR with the additional word co-
         | occurrence understanding.
         | 
         | I've tested this approach with purely text LLM to correct OCR
         | mistakes and it works quite well.
         | 
         | Also note that in some newer OCR pipelines that don't involve
         | LLMs, there is a vision component and then a text correcting
         | model that is in some ways similar to some forms of spell
         | check, which can further improve results.
        
           | lyu07282 wrote:
           | you can tell that the OCR fails more in cases without natural
           | language like with code/random characters. OAI seems to claim
           | 4o is a fully end to end multimodal model, but we will never
           | know for sure, we can't trust a single word OpenAI is saying.
        
         | kherud wrote:
         | Shouldn't this theory be testable? The response time for an
         | image of the same size should remain constant (assuming a
         | generated response of constant size). You could then try to put
         | an increasing amount of text inside of the image. If this text
         | is fed to the LLM using OCR, the total amount of tokens grows.
         | You should then be able to observe an increase in response
         | time.
        
         | RicoElectrico wrote:
         | Yeah, Tesseract is barely production quality.
        
           | lyu07282 wrote:
           | yeah it was SOTA in 2006, 18 years ago
        
             | jascha_eng wrote:
             | Other than proprietary models, what is better than it
             | today? Just asking in case I ever need OCR and don't want
             | to pay the cloud providers for it :D
        
               | lyu07282 wrote:
               | checkout https://github.com/mindee/doctr or
               | https://github.com/VikParuchuri/surya for something
               | practical
               | 
               | multimodal llm would of course blow it all out the water,
               | so some llama3-like model is probably SOTA in terms of
               | what you can run yourself. something like
               | https://huggingface.co/blog/idefics2
        
         | freedmand wrote:
         | Agreed. Tesseract is not able to handle handwriting or text
         | that is distorted well, e.g. colored text over an image
         | background -- to the point that it would hurt any downstream
         | LLM trying to make sense of the contents. It won't even pick
         | out bounding boxes.
         | 
         | I doubt they are running an OCR model, but if they actually
         | were it would likely be an in-house one trained with more
         | modern techniques.
        
         | llm_trw wrote:
         | Because no one knows how to prep the images. With the right
         | file type and resolution I get under a single character error
         | per 10 pages and it's been that good since the late 00s.
        
           | yorwba wrote:
           | How do _you_ prep the images?
        
             | llm_trw wrote:
             | May hourly rate starts at $300. If you'd like to hire me
             | you're more than welcome to. I've done this work for a
             | number of companies in the past.
        
           | alach11 wrote:
           | With handwriting? With mixed fonts? Tesseract requires
           | _heavy_ customization and extension to perform reasonably on
           | these workloads. The off-the-shelf options from major cloud
           | providers blow it out of the water.
        
             | llm_trw wrote:
             | Never had to use it with handwriting, mixed fonts and text
             | where location carries semantic infirmation: absolutely.
        
       | surfingdino wrote:
       | OCR is hard https://www.vice.com/en/article/gvy4gb/one-mans-
       | david-and-go...
        
       | joelburget wrote:
       | Vision transformers should be our default guess as to how GPT-4o
       | works, yet this article never mentions them.
        
       | valine wrote:
       | Llava1.6, IntenVL, CogVLM2 can all do OCR with nothing but tiled
       | image embeddings and an LLM. Feeding in OCR results from
       | tesseract improves the reliability of the transcript, especially
       | for long strings of random characters, but it's not strictly
       | necessary for the model to read the text out of the image.
       | 
       | Clip embeddings can absolutely "read" text if the text is large
       | enough. Tiling enables the model to read small text.
        
         | Onawa wrote:
         | Do you know of any guides or tutorials to doing this? I tried
         | using the MiniCPM model for this task, but it just OCRed a tiny
         | bit of information then told me that it couldn't extract the
         | rest.
        
           | pwillia7 wrote:
           | I bet you could get this working in
           | https://github.com/comfyanonymous/ComfyUI
           | 
           | I have done some other LLava stuff in it
        
             | 3abiton wrote:
             | I thought ComfyUI was mainly for SD. I should get into the
             | game again.
        
               | lagniappe wrote:
               | You can build just about anything with it
        
             | pests wrote:
             | thanks been trying to remember the name of this project for
             | weeks now
        
         | tictacttoe wrote:
         | I found llava to be disappointing, but Claude Haiku is quite
         | good
        
         | cpursley wrote:
         | How well does this work on complex data tables?
        
         | qeternity wrote:
         | They can do it. They can not do it particularly well compared
         | to SoTA OCR systems.
        
       | iknownothow wrote:
       | I'm probably wrong but the author may have have misunderstood
       | input embeddings. Input embeddings are just dictionary lookup
       | tables. The tokenizer generates tokens and for each token you
       | find its embedding from the lookup.
       | 
       | The author is speculating about an embedding model but in reality
       | they're speculating about the image-tokenizer.
       | 
       | If I'm not wrong the text tokenizer Tiktoken has a dictionary
       | size of 50k. The image tokenizer could have a very large
       | dictionary size or a very small dictionary size. The 170 tokens
       | this image tokenizer generates might actually have repeating
       | tokens!
       | 
       | EDIT: PS. What I meant to say was that input embeddings do not
       | come from another trained model. Tokens come from other trained
       | models. The input embedding matrix undergoes back propagation
       | (learning). This is very important. This allows the model to move
       | the embeddings of the tokens together or apart as it sees fit. If
       | you use embeddings from another model as input embeddings, you're
       | basically adding noise.
        
         | iknownothow wrote:
         | I've pondered a bit more about it and I was the one who was
         | mistaken. I think the author made great observations. It's just
         | that I don't _want_ to go back to non token thinking. I don 't
         | want there to be a 13x13xE final output from the CNN. I really
         | _want_ there to be a visual vocabulary from which tokens are
         | chosen. And I want this visual vocabulary to be fixed
         | /untrainable/deterministic. That'd be very cool.
         | 
         | But why only choose 13x13 + 1? :(
         | 
         | I'm willing to bet that the author's conclusion of embeddings
         | coming from CNNs is wrong. However, I cannot get the 13x13 + 1
         | observation out my head though. He's definitely hit on
         | something there. I'm with them that there is very likely a CNN
         | involved. And I'm going to put my bet on the final filters and
         | kernel are the visual vocabulary.
         | 
         | And how do you go from 50k convolutional kernels (think tokens)
         | to always 170 chosen tokens for any image? I don't know...
        
         | kolinko wrote:
         | Input embeddings are taken from a dictionary in case of text
         | tokens, but they don't need to be - they can be any vector
         | really.
        
           | iknownothow wrote:
           | But don't input embeddings need to undergo backprop during
           | training? Won't the external-model's embeddings just be noise
           | since they don't share embedding space with the model that is
           | being trained?
           | 
           | If the external-model also undergoes training along with the
           | model then I think that might work.
        
       | riemannzeta wrote:
       | Love this curious and open-minded exploration of how this stuff
       | works.
       | 
       | The pyramid strategy loosely tracks with renormalization group
       | theory, which has been formally studied for years as a method of
       | interpreting machine learning models:
       | 
       | https://arxiv.org/abs/1410.3831
       | 
       | I love the convergence we're seeing in the use of models from
       | different fields to understand machine learning, fundamental
       | physics, and human consciousness. What a time to be alive.
        
       | yorwba wrote:
       | It would be interesting to see what happens when you slightly
       | shift the grid of objects until they're split across multiple
       | tiles, and how that affects accuracy.
        
       | rafaelero wrote:
       | They are very likely using VQVAE to create a dictionary of tokens
       | and then just converting images into them with an encoder.
        
         | lisperforlife wrote:
         | Why is this not the top comment? FAIR published their C3MLeon
         | paper about decoder-only autoregressive models that work with
         | both text and image tokens. I believe GPT-4o's vocabulary has
         | room for both image and audio tokens. For audio tokens, they
         | probably trained an RVQ-VAE model like Encodec or Soundstream.
        
         | HarHarVeryFunny wrote:
         | Wouldn't that be more applicable to image generation, or at
         | least wanting to encode the image as a whole?
         | 
         | If you need to be able to reason about multiple objects in the
         | image and their relative positions, then don't you need to use
         | a tiled approach?
        
           | rafaelero wrote:
           | VQVAE is trained to reconstruct the image, so in theory it
           | should contain all the information (both content and
           | location) inside its embeddings.
        
       | cs702 wrote:
       | One possibility is that mapping images to a token embedding
       | consumes ~170x more compute+space than mapping a token id.
       | 
       | Another possibility is that OpenAI is mapping each image to ~170
       | vectors in an embedding space that is shared with token IDs. If
       | that's the case, the architecture of the image-to-fixed-number-
       | of-tokens model has not been disclosed. It could be a standard
       | CNN, a ViT-like model, an autoencoder, a model that routes a
       | variable number of vectors with RGB data to a fixed number of
       | vectors, or something else that has not yet been ublished. The
       | whole thing is likely trained end-to-end.
        
         | cs702 wrote:
         | *that has not yet been published
        
         | CuriouslyC wrote:
         | At some point we're going to go from tokens to embeddings for
         | everything. I saw some research on variable length embeddings,
         | I wouldn't be surprised if someone generated a huge embedding
         | space, did some form of PCA on generated embeddings, threw away
         | low eigenvalue vectors, then trained a distilled model that
         | generated variable length embeddings directly from that.
        
           | cs702 wrote:
           | _> At some point we 're going to go from tokens to embeddings
           | for everything._
           | 
           | Yes, I agree.
           | 
           | Further down the road, I imagine we will end up finding
           | interesting connections to the symbolic approaches of GOFAI,
           | given that the embedding of a token, object, concept, or
           | other entity in some vector space is basically a _kind of
           | symbol_ that represents that token, object, concept, or
           | entity in that vector space.
           | 
           | Interestingly, old terms like "representation" and "capsule,"
           | which didn't become as widely adopted as "embedding," tried
           | more explicitly to convey this idea of using vectors/matrices
           | of feature activations to stand in for objects, concepts, and
           | other entities.
           | 
           | For example, see Figure 1 in this paper from 2009-2012: http:
           | //www.cs.princeton.edu/courses/archive/spring13/cos598C... --
           | it's basically what we're talking about!
        
       | sva_ wrote:
       | Great article. Perhaps some part of this magic number simply
       | factors in the amount of compute necessary to run the image
       | through the CNN (proportional to compute use per token in the
       | LM).
        
       | jmount wrote:
       | Scanning images is quite the problem in the presence of
       | compression (and now interpolation)
       | https://www.bbc.com/news/technology-23588202 .
        
       | geor9e wrote:
       | Nit: the implied premise that this isn't a beautiful and skilled
       | painting https://www.oranlooney.com/post/gpt-
       | cnn_files/malicious_dogs...
        
         | olooney wrote:
         | The painting is _Charlie and Sheba_ , from the Museum of Bad
         | Art:
         | 
         | https://museumofbadart.org/zoo/
         | 
         | I found it while Googling for a test image for the malicious
         | prompt test, where it was used as the lead photo for this blog
         | post:
         | 
         | https://www.artsy.net/article/artsy-editorial-bad-art-good
         | 
         | There's definitely something eye-catching about it that really
         | makes it stand out from the crowd.
        
       | HarHarVeryFunny wrote:
       | I don't think a 13x13 tiling (of N channels/features) can be
       | ruled out just because it can't recognize a grid of 13x13
       | objects. There is presumably a lot of overlap between the
       | receptive fields of the tiles (due to kernel step sizes).
       | 
       | A pyramid of overlapped tiling resolutions is of course possible
       | too.
        
       | ComputerGuru wrote:
       | We desperately need a modern open source replacement for
       | tesseract built on current SoTA ML tech. It is insane that we are
       | resorting to using LLMs -- which aside from being the wrong tool
       | and far too overpowered for the job also are prone to
       | hallucinations, have insanely expensive training and inference
       | costs, etc -- for this purpose because the "best" non-LLM
       | solution is so bad it can't even correctly ocr monospaced hi-res
       | scans of ascii text with sufficient accuracy.
        
         | asadm wrote:
         | hmmm I haven't tried but does apple's OCR api do better here?
         | ie. is it possible to do it.
        
           | rgovostes wrote:
           | The API: https://developer.apple.com/documentation/vision/rec
           | ognizing...
           | 
           | In my experience it works remarkably well for features like
           | scanning documents in Notes and in copying or translating
           | text embedded in images in Safari.
           | 
           | It is not open source, but free to use locally. Someone has
           | written a Python wrapper (apple-ocr) around it if you want to
           | use it in other workflows. The model files might be in
           | /System/Library/PrivateFrameworks/TextRecognition.framework
           | if you wanted to port them to other platforms.
        
             | nexuist wrote:
             | I also wrote a Swift CLI that wraps over the Vision
             | framework: https://github.com/nexuist/seev
             | 
             | Text extraction is included (including the ability to
             | specify custom words not found in the dictionary) but there
             | are also utilities for face detection, classification, etc.
        
         | rvnx wrote:
         | One good self-hosted OCR is PaddleOCR,
         | https://github.com/PaddlePaddle/PaddleOCR
         | 
         | Beats everything else, truly international and multi-lingual,
         | including Chinese (as it is made in China)
        
           | paul-tharun wrote:
           | It is insanely fast compared alternatives and has really high
           | accuracy even on new tasks without any training.
           | 
           | Their PaddleLayout models are also miles ahead compared to
           | LayoutParser or TableTransformers in both inference speed and
           | output quality
        
           | ComputerGuru wrote:
           | Why is it "self-hosted" and not "library + desktop/cli app"?
           | "Self-hosted" implies it need a full web stack and rdbms
           | backend?
        
             | rvnx wrote:
             | It was just to show that you can run it locally, in
             | opposition to "cloud APIs" referred in the thread, but you
             | are right, the more correct term is local
        
           | jakderrida wrote:
           | I think that's Baidu. I remember
           | https://github.com/PaddlePaddle/ from when Ernie 3.0 was
           | released back when text encoder models weren't forgotten with
           | the progress of decoder-only ones.
        
         | rfoo wrote:
         | There's certainly smaller and even better models for OCR.
         | 
         | But the whole "point" of LLM (forget it, it's not AGI) is you
         | don't need to make many specialized models and cursed pipelines
         | anymore, to solve a definitely-in-reach-without-LLM problem
         | your farmer neighbor wants to pay $500 for.
         | 
         | Before LLM it's not going to be done as it takes more than $500
         | engineer hours. Now we just brute force. Sure, more compute,
         | but we get it done!
         | 
         | I guess your OCR dream is covered by this.
        
         | orbital-decay wrote:
         | A good open source model for handwriting recognition is sorely
         | missing as well.
        
           | nine_k wrote:
           | Often in humans, too, depending on the badness of the
           | particular handwritten word.
        
           | ComputerGuru wrote:
           | The United States Postal Service probably has the best in the
           | world, though its training probably restricts it to a subset
           | of possible inputs. I wonder if it would be possible to get a
           | senator or congressman to push for open sourcing it.
        
         | AndrewKemendo wrote:
         | Fully agree
         | 
         | Improving OCR would require innovation within CV - separate
         | from transformer architectures and frankly I don't expect much
         | new work to happen here
        
         | daemonologist wrote:
         | Has anyone tried Kosmos [0] ? I came across it the other day
         | and it looked shiny and interesting, but I haven't had a chance
         | to put it to the test much yet.
         | 
         | [0] - https://github.com/microsoft/unilm/tree/master/kosmos-2.5
        
       | imranhou wrote:
       | Not to be nit-picky but double checking myself, isn't a token
       | just 0.75 words, so 170 token would be 127 words and not 227?
        
       | enjoylife wrote:
       | > Interestingly enough, it's actually more efficient to send text
       | as images: A 512x512 image with a small but readable font can
       | easily fit 400-500 tokens worth of text, yet you're only charged
       | for 170 input tokens plus the 85 for the 'master thumbnail' for a
       | grand total of 255 tokens--far less than the number of words on
       | the image.
       | 
       | Sounds like an arbitrage opportunity for all those gpt wrappers.
       | Price your cost per token the same, send over the prompt via
       | image, pocket the difference?
        
       | SubiculumCode wrote:
       | I'm not sure how chatgpt4o routes information. If a picture is
       | submitted that contains text, does the text then get resubmitted
       | to chatgpt4o as a textual query, or do the model weights
       | themselves essentially transform the textual images to textual
       | tokens. I do wonder if a response to the textual images is
       | similar to a response to text queries...i.e. processed by the the
       | same weights.
        
       | sashank_1509 wrote:
       | I would be disappointed if OpenAI had a separate model for OCR,
       | though I guess that is believable. Much cooler if the LLM just
       | understands language from text
        
       | comboy wrote:
       | I love how well this is written. Definitely "look how interesting
       | this is" rather than "look how much do I know". And it dives as
       | deep as needs to, while being accessible for almost everyone. One
       | really needs to master some topic to be able to describe it
       | simply. Great job.
        
       | jamesy0ung wrote:
       | I've always wondered how Text to Image LLMs like stable diffusion
       | work, do they just encode RGB values into a matrix and then have
       | a helper tool convert that data into a jpg?
        
       ___________________________________________________________________
       (page generated 2024-06-07 23:01 UTC)