[HN Gopher] How Does GPT-4o Encode Images?
___________________________________________________________________
How Does GPT-4o Encode Images?
Author : olooney
Score : 267 points
Date : 2024-06-07 12:54 UTC (10 hours ago)
(HTM) web link (www.oranlooney.com)
(TXT) w3m dump (www.oranlooney.com)
| simonw wrote:
| The way this tests GPT-4o performance by feeding in a 7x7 grid of
| colored shapes and requesting them back as JSON (about half way
| down the page) is really clever.
| blixt wrote:
| I did something similar when GPT-4V came out, partially with
| the goal to figure out the input format (I did not get anywhere
| other than "magic vectors"), but also to roughly estimate the
| amount of data you can get back out of a 512x512 (the low
| quality option) image.
|
| What I found is that you can sometimes get more text out of
| 85-token image than you can out of 85 tokens of text! That
| said, I think there will be plenty of edge cases where it
| actually loses some information, and maybe you could argue that
| if you remove every other word in the text, it could still
| restore the text.
|
| I never went deeper on this, but I believe there's something
| clever to be done in the context window with the fact that
| images are relatively cheap tokens-wise.
| _dark_matter_ wrote:
| The author mentions this in the article, that more than 170
| tokens of text can be pulled from an image.
| blixt wrote:
| Ah, you're right! My bad!
| GaggiX wrote:
| An important aspect that is not considered in the article is that
| GPT-4o can generate images by itself (even though the feature is
| not enable to the public) meaning that it's very likely trained
| on sequential image tokens and the images are quantized using a
| VQGAN, my guess is that the VQGAN takes 512x512 images and
| outputs 13x13 tokens (169 image tokens + special token), the
| VQGAN can be a convolutional network like shown in the article,
| for a transformer-based VQGAN I cannot think of a configuration
| with overlapping patches where it would output 13x13 tokens on a
| 512x512 image (unless they just added a padding of 4 on the
| entire image and the patches are not overlapping).
| edude03 wrote:
| How do we know it generates the images itself and isn't passing
| the text to dalle? It's supposedly how the current gpt4 model
| does listen mode (with whisper but same idea)
| GaggiX wrote:
| Go to the "Explorations of capabilities" and explore all the
| capabilities: https://openai.com/index/hello-gpt-4o/
|
| You cannot have this level of control by prompting Dalle,
| also GPT-4o isn't using Whisper (older GPT-4s yes).
| ec109685 wrote:
| At least ChatGPT 4o still looks like it is using dalle.
|
| https://x.com/krishnanrohit/status/1755123169353236848?s=46
| simonw wrote:
| Something I don't get is why OpenAI don't provide clear,
| comprehensive documentation as to how this actually works,
|
| I get that there's competition from other providers now so they
| have an instinct to keep implementation details secret, but as
| someone building on their APIs this lack of documentation really
| holds me back.
|
| To make good judgements about how to use this stuff I need to
| know how it works!
|
| I had a hilarious bug a few weeks ago where I loaded in a single
| image representing multiple pages of a PDF and GPT-4 vision
| effectively hallucinated the contents of the document when asked
| to OCR it, presumably because the image was too big and was first
| resized to a point where the text was illegible:
| https://simonwillison.net/2024/Apr/17/ai-for-data-journalism...
|
| If OpenAI had clear documentation about how their image handling
| works I could avoid those kinds of problems much more
| effectively.
| Onawa wrote:
| I was trying to figure out this exact same issue. OCR on a PDF
| worked great, up until a certain point when it just started
| hallucinating like crazy. I was working on a whole pipeline to
| just feed in a PDF one page at a time to try and get around
| this issue. Otherwise, the OCR works absolutely fantastic
| compared to all other other tools I've been trying lately.
| These include OCRmyPDF (Tesseract), SuryaOCR, and some of the
| models on the Visual LLM Leaderboard.
|
| I've also seen some people recommend Paddle OCR, but I find
| their documentation to be lacking and I haven't got that one
| working yet to evaluate.
| infecto wrote:
| For document text/table extraction, nothing beats the quality
| from the cloud providers. It can get costly but the accuracy
| is much higher than what you will find using an openai API.
| raybb wrote:
| Simon wilson recently had a thread going through some of the
| options here https://x.com/simonw/status/1797526667797442773
| Onawa wrote:
| Funny enough, Simon Willison is the op of this comment
| thread lol.
| infecto wrote:
| But they do document that the images are resized and give you
| some rough guidelines on how you should be sizing your images.
| Low resolution is 1024 x 1024 with no tiling and High
| Resolution starts at 2048 x 2048 which then gets tiled. It
| could use further documentation but it is enough to know more
| than one page should never be used via the API.
| alach11 wrote:
| Right. But I still have a lot of questions. How does the
| model handle when something important overlaps multiple tiles
| in high-resolution mode? Am I better off doing the tiling
| myself with some overlap?
| nolok wrote:
| The fact that it's so eager to hallucinate random things that
| sounds plausible enough if you're not paying attention without
| warning you or giving any error should make people reconsider
| using it for "data journalism" or similar.
|
| If you make your system and it "works", then how will you see
| the one time out of X where it confidently provides you false
| information that you happily use because it usually work ?
| TeMPOraL wrote:
| > _how will you see the one time out of X where it
| confidently provides you false information that you happily
| use because it usually work ?_
|
| You don't. You treat it like you would a human worker: set
| your process to detect or tolerate wrong output. If you
| can't, don't apply this tool to your work.
| IanCal wrote:
| This is true but misses a key fact, that typical llm errors
| are different to human errors. Not that they're worse or
| better but just that you need to understand where and when
| they're more likely to make mistakes and how to manage
| that.
| simonw wrote:
| Right, that's why I've been recommending dedicated OCR tools
| (Textract etc) over vision LLMs:
| https://simonwillison.net/2024/Apr/17/ai-for-data-
| journalism...
| ilaksh wrote:
| There is an effectively infinite number of possibilities of
| things people could throw at it and they can't know ahead of
| time whether your use case will work or not. Even if they told
| you exactly how it worked, you wouldn't know for sure until you
| tried it. And giving a vague explanation wouldn't help you
| either.
| resters wrote:
| Is there documentation (is it possible?) on how to upload a PDF
| to gpt-4o using the API?
| simonw wrote:
| I think you have to split it into a page per image and then
| upload each page separately. That's how I've been doing it.
| tantalor wrote:
| > CLIP embeds the entire image as a single vector, not 170 of
| them.
|
| Single token?
|
| > GPT-4o must be using a different, more advanced strategy
| internally
|
| Why
| freediver wrote:
| The embeddings do not offer the level of fidelity to recognize
| fine details on an image, hand writing for example.
| blixt wrote:
| I went through a similar journey back when GPT-4V came out.
| Here's an additional puzzle for you: GPT-4V knows the _exact_
| pixel dimensions of the image (post-resize since there is a max
| size for images in the pipeline, besides 512x512), but I 'm 99%
| sure it's not provided as text tokens. How am I so sure? It's
| easy to get GPT to divulge everything from system prompt to tool
| details, etc. but I've tried every trick in the book and then
| some, multiple times over, and there is no way to get it to quote
| the dimensions as text. The only way to get it to give you the
| dimensions is to tell it to output a structure that contains
| width and height and just pick something reasonable, and they
| will "randomly" be the correct values:
|
| https://x.com/blixt/status/1722298733470024076
| dannyw wrote:
| Perhaps images aren't tokens at all... and 170 tokens is just
| an approximation of the compute cost.
| qarl wrote:
| They address this question in the article.
| blixt wrote:
| I think that would have pretty serious implications for the
| transformer architecture though. If they're not embedded like
| text tokens, how would attention, etc work? And a
| conversation with multiple images back and forth? Not to
| mention with GPT-4o now having audio support as well. I would
| assume it does become tokens.
| llm_trw wrote:
| > It's easy to get GPT to divulge everything from system prompt
| to tool details,
|
| It's easy enough to get it to hallucinate those things. It
| doesn't actually tell them to you.
| blixt wrote:
| I'm well aware of that, but there are plenty of ways to
| induce verbatim quoting from "hidden" information, and mostly
| verify it (through sampling a large number of times in
| separate runs).
|
| Models are improving in truly hiding or ignoring information
| these days though. As the author of the article states,
| you'll have a hard time tricking GPT-4o to read text in
| images as instructions, most likely thanks to this research:
| https://openai.com/index/the-instruction-hierarchy/
|
| I do feel pretty confident that when the model is happily
| spitting out its system prompt, and all metadata around the
| image, but not its pixel dimensions, that probably those
| dimensions were not provided in any system/assistant/tool
| message. So maybe part of the image embeddings also encode
| the pixel dimensions somehow (it would also help the model
| not think of the image as a squished square for non-1:1
| images that have been resized to 512x512).
| alach11 wrote:
| I really hope we see improvements to the resolutions large
| multimodal models can handle. Right now this patchwork approach
| leads to lots of unwieldly workarounds in applications.
| eminence32 wrote:
| I'm assuming that the tokens used to encode an image are entirely
| distinct from the tokens used to encode text. Does anyone know if
| this is actually the case?
| tempusalaria wrote:
| It's probable that there is a separate vision encoder which
| projects the image tiles into the distribution space of the
| text tokenizer a la CLIP/LLava
| blixt wrote:
| I would assume it has a "mode" token where it switches between
| text/image (and now audio), or you'd have to try to maximize
| the number of reserved tokens between multiple modes. GPT-4o
| did go from 100K to 200K vocabulary, but as far as I understand
| all of that vocabulary is in use for text (reducing the token
| cost for non-English).
| rvnx wrote:
| Author claims that the most likely is that there is Tesseract
| running behind ChatGPT-4v/o.
|
| There is no way that this is Tesseract.
|
| -> Tesseract accuracy is very low, it can barely do OCR on
| printed documents.
| jerrygenser wrote:
| Even if tesseract accuracy is low, if the tesseract result in
| addition to the image is then passed to the LLM, it can result
| in a much more accurate OCR.
|
| For example, GPT4 with some vision capability would be able to
| fill in the incorrect OCR with the additional word co-
| occurrence understanding.
|
| I've tested this approach with purely text LLM to correct OCR
| mistakes and it works quite well.
|
| Also note that in some newer OCR pipelines that don't involve
| LLMs, there is a vision component and then a text correcting
| model that is in some ways similar to some forms of spell
| check, which can further improve results.
| lyu07282 wrote:
| you can tell that the OCR fails more in cases without natural
| language like with code/random characters. OAI seems to claim
| 4o is a fully end to end multimodal model, but we will never
| know for sure, we can't trust a single word OpenAI is saying.
| kherud wrote:
| Shouldn't this theory be testable? The response time for an
| image of the same size should remain constant (assuming a
| generated response of constant size). You could then try to put
| an increasing amount of text inside of the image. If this text
| is fed to the LLM using OCR, the total amount of tokens grows.
| You should then be able to observe an increase in response
| time.
| RicoElectrico wrote:
| Yeah, Tesseract is barely production quality.
| lyu07282 wrote:
| yeah it was SOTA in 2006, 18 years ago
| jascha_eng wrote:
| Other than proprietary models, what is better than it
| today? Just asking in case I ever need OCR and don't want
| to pay the cloud providers for it :D
| lyu07282 wrote:
| checkout https://github.com/mindee/doctr or
| https://github.com/VikParuchuri/surya for something
| practical
|
| multimodal llm would of course blow it all out the water,
| so some llama3-like model is probably SOTA in terms of
| what you can run yourself. something like
| https://huggingface.co/blog/idefics2
| freedmand wrote:
| Agreed. Tesseract is not able to handle handwriting or text
| that is distorted well, e.g. colored text over an image
| background -- to the point that it would hurt any downstream
| LLM trying to make sense of the contents. It won't even pick
| out bounding boxes.
|
| I doubt they are running an OCR model, but if they actually
| were it would likely be an in-house one trained with more
| modern techniques.
| llm_trw wrote:
| Because no one knows how to prep the images. With the right
| file type and resolution I get under a single character error
| per 10 pages and it's been that good since the late 00s.
| yorwba wrote:
| How do _you_ prep the images?
| llm_trw wrote:
| May hourly rate starts at $300. If you'd like to hire me
| you're more than welcome to. I've done this work for a
| number of companies in the past.
| alach11 wrote:
| With handwriting? With mixed fonts? Tesseract requires
| _heavy_ customization and extension to perform reasonably on
| these workloads. The off-the-shelf options from major cloud
| providers blow it out of the water.
| llm_trw wrote:
| Never had to use it with handwriting, mixed fonts and text
| where location carries semantic infirmation: absolutely.
| surfingdino wrote:
| OCR is hard https://www.vice.com/en/article/gvy4gb/one-mans-
| david-and-go...
| joelburget wrote:
| Vision transformers should be our default guess as to how GPT-4o
| works, yet this article never mentions them.
| valine wrote:
| Llava1.6, IntenVL, CogVLM2 can all do OCR with nothing but tiled
| image embeddings and an LLM. Feeding in OCR results from
| tesseract improves the reliability of the transcript, especially
| for long strings of random characters, but it's not strictly
| necessary for the model to read the text out of the image.
|
| Clip embeddings can absolutely "read" text if the text is large
| enough. Tiling enables the model to read small text.
| Onawa wrote:
| Do you know of any guides or tutorials to doing this? I tried
| using the MiniCPM model for this task, but it just OCRed a tiny
| bit of information then told me that it couldn't extract the
| rest.
| pwillia7 wrote:
| I bet you could get this working in
| https://github.com/comfyanonymous/ComfyUI
|
| I have done some other LLava stuff in it
| 3abiton wrote:
| I thought ComfyUI was mainly for SD. I should get into the
| game again.
| lagniappe wrote:
| You can build just about anything with it
| pests wrote:
| thanks been trying to remember the name of this project for
| weeks now
| tictacttoe wrote:
| I found llava to be disappointing, but Claude Haiku is quite
| good
| cpursley wrote:
| How well does this work on complex data tables?
| qeternity wrote:
| They can do it. They can not do it particularly well compared
| to SoTA OCR systems.
| iknownothow wrote:
| I'm probably wrong but the author may have have misunderstood
| input embeddings. Input embeddings are just dictionary lookup
| tables. The tokenizer generates tokens and for each token you
| find its embedding from the lookup.
|
| The author is speculating about an embedding model but in reality
| they're speculating about the image-tokenizer.
|
| If I'm not wrong the text tokenizer Tiktoken has a dictionary
| size of 50k. The image tokenizer could have a very large
| dictionary size or a very small dictionary size. The 170 tokens
| this image tokenizer generates might actually have repeating
| tokens!
|
| EDIT: PS. What I meant to say was that input embeddings do not
| come from another trained model. Tokens come from other trained
| models. The input embedding matrix undergoes back propagation
| (learning). This is very important. This allows the model to move
| the embeddings of the tokens together or apart as it sees fit. If
| you use embeddings from another model as input embeddings, you're
| basically adding noise.
| iknownothow wrote:
| I've pondered a bit more about it and I was the one who was
| mistaken. I think the author made great observations. It's just
| that I don't _want_ to go back to non token thinking. I don 't
| want there to be a 13x13xE final output from the CNN. I really
| _want_ there to be a visual vocabulary from which tokens are
| chosen. And I want this visual vocabulary to be fixed
| /untrainable/deterministic. That'd be very cool.
|
| But why only choose 13x13 + 1? :(
|
| I'm willing to bet that the author's conclusion of embeddings
| coming from CNNs is wrong. However, I cannot get the 13x13 + 1
| observation out my head though. He's definitely hit on
| something there. I'm with them that there is very likely a CNN
| involved. And I'm going to put my bet on the final filters and
| kernel are the visual vocabulary.
|
| And how do you go from 50k convolutional kernels (think tokens)
| to always 170 chosen tokens for any image? I don't know...
| kolinko wrote:
| Input embeddings are taken from a dictionary in case of text
| tokens, but they don't need to be - they can be any vector
| really.
| iknownothow wrote:
| But don't input embeddings need to undergo backprop during
| training? Won't the external-model's embeddings just be noise
| since they don't share embedding space with the model that is
| being trained?
|
| If the external-model also undergoes training along with the
| model then I think that might work.
| riemannzeta wrote:
| Love this curious and open-minded exploration of how this stuff
| works.
|
| The pyramid strategy loosely tracks with renormalization group
| theory, which has been formally studied for years as a method of
| interpreting machine learning models:
|
| https://arxiv.org/abs/1410.3831
|
| I love the convergence we're seeing in the use of models from
| different fields to understand machine learning, fundamental
| physics, and human consciousness. What a time to be alive.
| yorwba wrote:
| It would be interesting to see what happens when you slightly
| shift the grid of objects until they're split across multiple
| tiles, and how that affects accuracy.
| rafaelero wrote:
| They are very likely using VQVAE to create a dictionary of tokens
| and then just converting images into them with an encoder.
| lisperforlife wrote:
| Why is this not the top comment? FAIR published their C3MLeon
| paper about decoder-only autoregressive models that work with
| both text and image tokens. I believe GPT-4o's vocabulary has
| room for both image and audio tokens. For audio tokens, they
| probably trained an RVQ-VAE model like Encodec or Soundstream.
| HarHarVeryFunny wrote:
| Wouldn't that be more applicable to image generation, or at
| least wanting to encode the image as a whole?
|
| If you need to be able to reason about multiple objects in the
| image and their relative positions, then don't you need to use
| a tiled approach?
| rafaelero wrote:
| VQVAE is trained to reconstruct the image, so in theory it
| should contain all the information (both content and
| location) inside its embeddings.
| cs702 wrote:
| One possibility is that mapping images to a token embedding
| consumes ~170x more compute+space than mapping a token id.
|
| Another possibility is that OpenAI is mapping each image to ~170
| vectors in an embedding space that is shared with token IDs. If
| that's the case, the architecture of the image-to-fixed-number-
| of-tokens model has not been disclosed. It could be a standard
| CNN, a ViT-like model, an autoencoder, a model that routes a
| variable number of vectors with RGB data to a fixed number of
| vectors, or something else that has not yet been ublished. The
| whole thing is likely trained end-to-end.
| cs702 wrote:
| *that has not yet been published
| CuriouslyC wrote:
| At some point we're going to go from tokens to embeddings for
| everything. I saw some research on variable length embeddings,
| I wouldn't be surprised if someone generated a huge embedding
| space, did some form of PCA on generated embeddings, threw away
| low eigenvalue vectors, then trained a distilled model that
| generated variable length embeddings directly from that.
| cs702 wrote:
| _> At some point we 're going to go from tokens to embeddings
| for everything._
|
| Yes, I agree.
|
| Further down the road, I imagine we will end up finding
| interesting connections to the symbolic approaches of GOFAI,
| given that the embedding of a token, object, concept, or
| other entity in some vector space is basically a _kind of
| symbol_ that represents that token, object, concept, or
| entity in that vector space.
|
| Interestingly, old terms like "representation" and "capsule,"
| which didn't become as widely adopted as "embedding," tried
| more explicitly to convey this idea of using vectors/matrices
| of feature activations to stand in for objects, concepts, and
| other entities.
|
| For example, see Figure 1 in this paper from 2009-2012: http:
| //www.cs.princeton.edu/courses/archive/spring13/cos598C... --
| it's basically what we're talking about!
| sva_ wrote:
| Great article. Perhaps some part of this magic number simply
| factors in the amount of compute necessary to run the image
| through the CNN (proportional to compute use per token in the
| LM).
| jmount wrote:
| Scanning images is quite the problem in the presence of
| compression (and now interpolation)
| https://www.bbc.com/news/technology-23588202 .
| geor9e wrote:
| Nit: the implied premise that this isn't a beautiful and skilled
| painting https://www.oranlooney.com/post/gpt-
| cnn_files/malicious_dogs...
| olooney wrote:
| The painting is _Charlie and Sheba_ , from the Museum of Bad
| Art:
|
| https://museumofbadart.org/zoo/
|
| I found it while Googling for a test image for the malicious
| prompt test, where it was used as the lead photo for this blog
| post:
|
| https://www.artsy.net/article/artsy-editorial-bad-art-good
|
| There's definitely something eye-catching about it that really
| makes it stand out from the crowd.
| HarHarVeryFunny wrote:
| I don't think a 13x13 tiling (of N channels/features) can be
| ruled out just because it can't recognize a grid of 13x13
| objects. There is presumably a lot of overlap between the
| receptive fields of the tiles (due to kernel step sizes).
|
| A pyramid of overlapped tiling resolutions is of course possible
| too.
| ComputerGuru wrote:
| We desperately need a modern open source replacement for
| tesseract built on current SoTA ML tech. It is insane that we are
| resorting to using LLMs -- which aside from being the wrong tool
| and far too overpowered for the job also are prone to
| hallucinations, have insanely expensive training and inference
| costs, etc -- for this purpose because the "best" non-LLM
| solution is so bad it can't even correctly ocr monospaced hi-res
| scans of ascii text with sufficient accuracy.
| asadm wrote:
| hmmm I haven't tried but does apple's OCR api do better here?
| ie. is it possible to do it.
| rgovostes wrote:
| The API: https://developer.apple.com/documentation/vision/rec
| ognizing...
|
| In my experience it works remarkably well for features like
| scanning documents in Notes and in copying or translating
| text embedded in images in Safari.
|
| It is not open source, but free to use locally. Someone has
| written a Python wrapper (apple-ocr) around it if you want to
| use it in other workflows. The model files might be in
| /System/Library/PrivateFrameworks/TextRecognition.framework
| if you wanted to port them to other platforms.
| nexuist wrote:
| I also wrote a Swift CLI that wraps over the Vision
| framework: https://github.com/nexuist/seev
|
| Text extraction is included (including the ability to
| specify custom words not found in the dictionary) but there
| are also utilities for face detection, classification, etc.
| rvnx wrote:
| One good self-hosted OCR is PaddleOCR,
| https://github.com/PaddlePaddle/PaddleOCR
|
| Beats everything else, truly international and multi-lingual,
| including Chinese (as it is made in China)
| paul-tharun wrote:
| It is insanely fast compared alternatives and has really high
| accuracy even on new tasks without any training.
|
| Their PaddleLayout models are also miles ahead compared to
| LayoutParser or TableTransformers in both inference speed and
| output quality
| ComputerGuru wrote:
| Why is it "self-hosted" and not "library + desktop/cli app"?
| "Self-hosted" implies it need a full web stack and rdbms
| backend?
| rvnx wrote:
| It was just to show that you can run it locally, in
| opposition to "cloud APIs" referred in the thread, but you
| are right, the more correct term is local
| jakderrida wrote:
| I think that's Baidu. I remember
| https://github.com/PaddlePaddle/ from when Ernie 3.0 was
| released back when text encoder models weren't forgotten with
| the progress of decoder-only ones.
| rfoo wrote:
| There's certainly smaller and even better models for OCR.
|
| But the whole "point" of LLM (forget it, it's not AGI) is you
| don't need to make many specialized models and cursed pipelines
| anymore, to solve a definitely-in-reach-without-LLM problem
| your farmer neighbor wants to pay $500 for.
|
| Before LLM it's not going to be done as it takes more than $500
| engineer hours. Now we just brute force. Sure, more compute,
| but we get it done!
|
| I guess your OCR dream is covered by this.
| orbital-decay wrote:
| A good open source model for handwriting recognition is sorely
| missing as well.
| nine_k wrote:
| Often in humans, too, depending on the badness of the
| particular handwritten word.
| ComputerGuru wrote:
| The United States Postal Service probably has the best in the
| world, though its training probably restricts it to a subset
| of possible inputs. I wonder if it would be possible to get a
| senator or congressman to push for open sourcing it.
| AndrewKemendo wrote:
| Fully agree
|
| Improving OCR would require innovation within CV - separate
| from transformer architectures and frankly I don't expect much
| new work to happen here
| daemonologist wrote:
| Has anyone tried Kosmos [0] ? I came across it the other day
| and it looked shiny and interesting, but I haven't had a chance
| to put it to the test much yet.
|
| [0] - https://github.com/microsoft/unilm/tree/master/kosmos-2.5
| imranhou wrote:
| Not to be nit-picky but double checking myself, isn't a token
| just 0.75 words, so 170 token would be 127 words and not 227?
| enjoylife wrote:
| > Interestingly enough, it's actually more efficient to send text
| as images: A 512x512 image with a small but readable font can
| easily fit 400-500 tokens worth of text, yet you're only charged
| for 170 input tokens plus the 85 for the 'master thumbnail' for a
| grand total of 255 tokens--far less than the number of words on
| the image.
|
| Sounds like an arbitrage opportunity for all those gpt wrappers.
| Price your cost per token the same, send over the prompt via
| image, pocket the difference?
| SubiculumCode wrote:
| I'm not sure how chatgpt4o routes information. If a picture is
| submitted that contains text, does the text then get resubmitted
| to chatgpt4o as a textual query, or do the model weights
| themselves essentially transform the textual images to textual
| tokens. I do wonder if a response to the textual images is
| similar to a response to text queries...i.e. processed by the the
| same weights.
| sashank_1509 wrote:
| I would be disappointed if OpenAI had a separate model for OCR,
| though I guess that is believable. Much cooler if the LLM just
| understands language from text
| comboy wrote:
| I love how well this is written. Definitely "look how interesting
| this is" rather than "look how much do I know". And it dives as
| deep as needs to, while being accessible for almost everyone. One
| really needs to master some topic to be able to describe it
| simply. Great job.
| jamesy0ung wrote:
| I've always wondered how Text to Image LLMs like stable diffusion
| work, do they just encode RGB values into a matrix and then have
| a helper tool convert that data into a jpg?
___________________________________________________________________
(page generated 2024-06-07 23:01 UTC)