[HN Gopher] GPT-4 Turbo with Vision Generally Available
       ___________________________________________________________________
        
       GPT-4 Turbo with Vision Generally Available
        
       Author : davidbarker
       Score  : 157 points
       Date   : 2024-04-09 18:53 UTC (4 hours ago)
        
 (HTM) web link (platform.openai.com)
 (TXT) w3m dump (platform.openai.com)
        
       | tucnak wrote:
       | This is just responding to Anthropic, right? Funny how it took
       | competition for them to make Vision-class models available.
        
         | chimney wrote:
         | Probably the Gemini 1.5 Pro GA that was announced hours
         | earlier.
        
         | choilive wrote:
         | That's typically how competition works right? Your competitors
         | push you to do better
        
         | maciejgryka wrote:
         | Competition is great! But in this case, I don't know, adding
         | JSON mode and function calling was a pretty obvious next steps
         | for the vision model - I bet it'd have happened anyway.
        
         | minimaxir wrote:
         | GPT-4-Vision has been around for awhile in beta, it's just GA
         | now.
         | 
         | It's expensive though: Anthropic's Claude Haiku can process
         | images significantly cheaper.
        
       | nuz wrote:
       | People like to say AI is moving blazingly fast these days but
       | this has been like a year in the waiting que. Guessing sora will
       | take equally long if not way longer before the general audience
       | gets to touch it.
        
         | brandon272 wrote:
         | I could be misjudging the situation entirely, but Sora seems
         | like it is on a much longer "general availability" timeline.
        
           | freedomben wrote:
           | Yeah OpenAI's head of something (CTO?) said late fall, if
           | it's safe enough. Gonna be awhile until us normal people get
           | our hands on it.
        
           | CuriouslyC wrote:
           | I think with Sora, "general availability" will be a much more
           | expensive higher tiered sub with a limited number of gens per
           | day, and I have my doubts that you'll just be able to sign up
           | for this sub through the web, I wouldn't be surprised if it's
           | an invite only partners thing.
        
         | belter wrote:
         | Sounds like a complaint about a chair in the sky...
         | https://youtu.be/8r1CZTLk-Gk
        
           | arcanemachiner wrote:
           | I think about this bit a lot, mostly while I'm being mildly
           | inconvenienced by something.
        
           | throwup238 wrote:
           | The chair in the sky keeps on turnin'... and I don't know if
           | I'll have access tomorrow.
        
           | dbbk wrote:
           | Haven't seen this in years, love this video
        
       | dvfjsdhgfv wrote:
       | Someone could explain why on their documentation page
       | (https://help.openai.com/en/articles/8555496-gpt-4-vision-api)
       | they link to
       | https://web.archive.org/web/20240324122632/https://platform....
       | instead of https://platform.openai.com/docs/guides/vision? As a
       | parson who pays OpenAI a lot of money each month, I see this as a
       | bit parasitic (unless they donate substantially to the Internet
       | Archive).
        
         | blowski wrote:
         | I would assume a mistake, albeit one that raises questions
         | about their QA processes. A static HTML page is hardly going to
         | break their bank.
        
         | SushiHippie wrote:
         | Seems to be changed to the correct link now
        
       | simonw wrote:
       | They also added both JSON and function support to the vision
       | model - previously it didn't have those.
       | 
       | This means you can now use gpt-4-turbo vision to extract
       | structured data from an image!
       | 
       | I was previously using a nasty hack where I'd run the image
       | through the vision model to extract just the text, then run that
       | text through regular gpt-4-turbo to extract structured data. I
       | ditched that hack just now:
       | https://github.com/datasette/datasette-extract/issues/19
        
         | maciejgryka wrote:
         | He, cool to hear you're doing something like this too! We ended
         | up close, but we also need good spatial relationships, which
         | GPT4V isn't great at, so we're using other OCR system and
         | adding the result to the context.
        
         | MuffinFlavored wrote:
         | > This means you can now use gpt-4-turbo vision to extract
         | structured data from an image!
         | 
         | How consistent and reliable is the extracted structure?
         | 
         | Did they add any kind of concept "for whatever token you
         | generate/think it is next, "unit test it" / make sure it passes
         | some sort of rules?"
        
           | maciejgryka wrote:
           | My experience is that it's pretty good at reading text and
           | pretty bad at understanding layouts. So e.g. asking it to
           | work with tables is asking for trouble.
        
             | dontupvoteme wrote:
             | Yeah it's absolutely horrible at layouts.
             | 
             | I'm not 100% sure it's related but if you ask it to draw
             | bounding boxes around things it's always off by quite a
             | bit.
        
           | a_wild_dandan wrote:
           | You restrict the model's next prediction to valid JSON
           | tokens. (If you mean format reliability.)
        
             | rockwotj wrote:
             | I'm waiting for being able to restrict to a specific JSON
             | Schema
        
           | jgalt212 wrote:
           | > How consistent and reliable is the extracted structure?
           | 
           | That's the $100B question.
        
             | szundi wrote:
             | Or $T these days
        
           | simonw wrote:
           | It's pretty good, but it's not reliable enough to exclude the
           | need to check everything it does.
           | 
           | Same story as basically everything relating to LLMs to be
           | honest.
        
           | joshstrange wrote:
           | In my testing I was better off running the image through AWS
           | Textract then taking the output and feeding it to OpenAI. It
           | was also much cheaper. Of course if all you are looking for
           | is extraction then maybe you don't need OpenAI at all. I used
           | it to clean up the OCR'd data and reformat it.
        
           | Lucasoato wrote:
           | It's very consistent, check this guy, he was able to
           | structure LLM output using Pydantic in an elegant solution:
           | 
           | https://www.youtube.com/watch?v=yj-wSRJwrrc
        
         | geepytee wrote:
         | One of the OpenAI PM's was also saying the model got
         | substantially better at math:
         | https://x.com/owencm/status/1777770827985150022
         | 
         | I'm trying it for coding and have added it to my VS Code
         | Copilot extension Overall I'd say it's better at coding than
         | the previous GPT-4 Turbo. https://double.bot if anyone wants to
         | try it :)
        
           | g9yuayon wrote:
           | Is being good math that important to the ChatGPT users,
           | though? The ChatGPT's ability to do math is so limited that
           | I'm not sure what math problems we have to ask ChatGPT to
           | solve.
        
             | DiggyJohnson wrote:
             | It means you don't have to be as sketched out if you're
             | looking for something that requires basic math. Imagine
             | generating the correct result of a unit test or something.
             | I wouldn't trust it either way, but I think this is a
             | believable example.
        
             | pama wrote:
             | Math is not just arithmetic and yes better math does help
             | at least some GPT-4 users.
        
         | abrichr wrote:
         | We've been using `gpt-4-1106-vision-preview` and simply
         | prompting the model to return json, with excellent results:
         | https://github.com/OpenAdaptAI/OpenAdapt/pull/610 (work in
         | progress).
        
       | philip1209 wrote:
       | `Moderation` doesn't support images yet, I believe. Does anybody
       | have a good image moderation API they are using?
        
       | maciejgryka wrote:
       | It's cool to see this update and JSON mode and function calling
       | will both be useful with vision. I wonder, though, if there were
       | any other specific changes to the models since the `preview`
       | versions besides that?
        
       | jessenaser wrote:
       | Why does the production model merge vision with the base turbo
       | model such that the output tokens remains 2048 instead of 4096?
       | 
       | Because if we are using just text, why is the extra output size
       | reduced still?
        
         | tedsanders wrote:
         | That's our mistake - it's still 4096. Assuming you saw this on
         | the Playground, we'll fix it shortly. If you saw it somewhere
         | else, please let me know.
         | 
         | https://platform.openai.com/playground/chat?model=gpt-4-turb...
        
           | jessenaser wrote:
           | Yes I saw it in the playground. I will keep checking for
           | updates for 4096. Thank you for the clarification.
           | 
           | Edit (4:57 ET): "gpt-4-turbo" shows the updated 4096 in
           | playground. "gpt-4-turbo-2024-04-09" remains 2048 in
           | playground.
        
       | greatpostman wrote:
       | Apparently Gemini by google has a 20% LLM market share.
        
         | ceejayoz wrote:
         | According to whom? How would you assess it?
        
           | chimney wrote:
           | Probably saw this tweet from nat friedman
           | https://twitter.com/natfriedman/status/1777739863678386268
        
             | ceejayoz wrote:
             | This methodology would ignore every API-driven use of the
             | models that doesn't go through the first-party web
             | interfaces.
        
       | brcmthrowaway wrote:
       | KPIs going down
        
       | ilaksh wrote:
       | For all of these posts that says function calling is now
       | available, I feel like it's actually more of an optimization than
       | a new capability.
       | 
       | All of the leading edge models will output JSON in any format you
       | ask for, such as {"fn_name": {"arg1":10}}. I think this is about
       | making it more accurate and having a standard input/output
       | format.
        
         | minimaxir wrote:
         | You have to specify the schema for that regardless.
         | 
         | Function calling has been available for text inputs for awhile,
         | now it's also available for image inputs. OpenAI's function
         | calling/structured data mode is much more strict and reliable
         | at following the schema than just putting "return the output in
         | JSON" in a system prompt.
        
         | hansonw wrote:
         | Yes. But also note that the new function calling is actually
         | "tool calling" where the model is also fine-tuned to expect and
         | react to the _output_ of the function (and there are various
         | other nuances like being able to call multiple functions in
         | parallel and matching up the outputs to function calls
         | precisely).
         | 
         | When used in multi-turn "call/response" mode it actually does
         | start to unlock some new capabilities.
        
       | yieldcrv wrote:
       | I've used LLaVa a little bit through LM Studio, but its really
       | subpar mostly due to the GUI I think, is there a model and GUI
       | that's better than LM Studio for vision adapters?
       | 
       | GPT-4V in Open AI's chat interface is so seamless. LLM, Text,
       | Speech input, Speech output with tones and emotion, Image
       | Generation, Vision input, and soon its going to be outputting
       | video with Sora...
       | 
       | its kind of amusing that GPT-4 early 2023 is still the benchmark
       | of competition on both closed source and open source models when
       | the lead is just expanding
        
         | robterrell wrote:
         | What do you dislike about it?
        
           | yieldcrv wrote:
           | the interface, that I'm needing to load multiple adapters,
           | and that I don't have a language model at the same time. LM
           | Studio is way better for language models only at the moment.
        
       | htrp wrote:
       | Is gpt-4-turbo-2024-04-09 basically an updated version of the
       | gpt-4-1106-vision-preview ?
        
       | IshKebab wrote:
       | Slightly OT question - can I use GTP-4 vision to drive a web
       | browser, e.g. to automate tasks like "sign up for this website
       | using this email and password; don't subscribe to promotional
       | emails"?
        
         | wewtyflakes wrote:
         | I believe tying it together might be a challenge. For example,
         | if you were to use the model to get the text of buttons, you
         | would still have to write code to find the HTML elements for
         | those buttons and drive the click/fill actions.
        
         | peterleiser wrote:
         | I've used GPT-4 to help with selenium and it gets the answer
         | eventually, but almost never on the first try. So automating
         | this without human intervention sounds tricky.
        
       | hubraumhugo wrote:
       | Their naming and versioning choices have definitely created some
       | confusion. Apparently GPT-4 Turbo doesn't just come with the the
       | new vision capabilities, but also with improved non-vision
       | capabilities:
       | 
       | > "I'm hoping we can get evals out shortly to help quantify this.
       | Until then - it's a new model with various data and training
       | improvements, resulting in better reasoning."[0]
       | 
       | > "- major improvements across the board in our evals (especially
       | math)"[1]
       | 
       | [0] https://twitter.com/owencm/status/1777784000712761430
       | 
       | [1] https://twitter.com/stevenheidel/status/1777789577438318625
        
         | MyFirstSass wrote:
         | As a regular user of ChatGPT-4 i this press release makes
         | little sense.
         | 
         | What does Vision mean? I've already been able to both upload
         | docs, create images etc. for a while now?
         | 
         | What are the "other improvements" and what is Turbo, what is
         | 4.5, what is this new one called?
         | 
         | How do i even see what version of the model i'm using in their
         | interface when it just says "4"?
        
           | tedsanders wrote:
           | Worth noting that GPT can be accessed both through ChatGPT
           | and via the OpenAI API. The link in this thread is pointing
           | to documentation for the OpenAI API.
           | 
           | Vision means the model can see image inputs.
           | 
           | In the API, GPT with vision was previously available in a
           | limited capacity (only some people, and didn't work with all
           | features, like JSON mode and function calling). Now this
           | model is available to everyone and it works with JSON mode
           | and function calling. It also should be smarter at some
           | tasks.
           | 
           | This model is now available in the API, and will roll out to
           | users in ChatGPT. In the API, it's named
           | `gpt-4-turbo-2024-04-09`. In ChatGPT, it will be under the
           | umbrella of GPT-4.
        
       | andrewstuart wrote:
       | "a fix for a bug" <laugh emoji>
       | 
       | gpt-3.5-turbo-0125 New
       | 
       | Updated GPT 3.5 Turbo The latest GPT-3.5 Turbo model with higher
       | accuracy at responding in requested formats and a fix for a bug
       | which caused a text encoding issue for non-English language
       | function calls. Returns a maximum of 4,096 output tokens. Learn
       | more.
        
       | ugh123 wrote:
       | Over Easter I asked GPT4 to count a mess of colored eggs on the
       | floor that I was preparing for an egg hunt. They were mostly
       | evenly separated and clearly visible (there were just 36).
       | 
       | I gave it two tries to respond and it wasn't even close to the
       | correct answer.
       | 
       | Was it confused on colored eggs vs. "natural" eggs it might have
       | been expecting? Should it have understood what I meant?
        
         | ravenstine wrote:
         | I imagine it would be better at describing an image in a
         | general sense, but probably isn't processing it in a sense
         | where it would actually be counting individual features. I
         | could be wrong about that, but it seems like a combination of
         | traditional CV and an LLM might be what's needed for more
         | precise feature identification.
        
           | 0x008 wrote:
           | The use case of "counting objects" is basically completely
           | solved already by yolov8. There is no need to use an LLM for
           | that.
        
         | Sharlin wrote:
         | I'm pretty sure LLM vision capabilities are right now limited
         | to something similar to subitizing in humans at best, ie. being
         | able to perceive the number of items less than ~7 without
         | counting. Expecting it to be able to actually _count_ objects
         | is a bit too much.
        
         | fzzzy wrote:
         | LLMs are very bad at counting.
        
         | tkgally wrote:
         | I've been disappointed with both GPT-4's and Gemini 1.5's image
         | recognition abilities in general, not just counting. When I
         | have asked them to describe a photo containing multiple objects
         | --a street scene, a room--they identify some of the objects
         | correctly but invariably hallucinate others, naming things that
         | are not present. Usually the hallucinations are of things that
         | might appear in similar photos but definitely are not in the
         | photo I gave them.
        
       | bilsbie wrote:
       | Is it worth still paying the $20?
        
         | ndr_ wrote:
         | This news piece is, first and foremost, about the Model, not
         | the ChatGPT System. (More about the difference between "Model"
         | and "System": https://ndurner.github.io/antropic-claude-amazon-
         | bedrock). Not sure what their upgrade policy/process for
         | ChatGPT is like, though.
        
           | pama wrote:
           | As per Steven Heidel's tweet this version will be released to
           | chatGPT soon.
        
       | ChrisLTD wrote:
       | Is it possible to upload an image to the OpenAI Chat Playground
       | to try it out?
        
         | ndr_ wrote:
         | Apparently not. But here:
         | https://huggingface.co/spaces/ndurner/oai_chat
        
       | anonymousDan wrote:
       | Can I ask, how are people affording to play around with GPT4? Are
       | you all doing it at work? Or is there some way I am unaware of to
       | keep the costs down enough to play around with it for
       | experimenting? It's so expensive!
        
         | ShamelessC wrote:
         | How much are you using it? I have been accessing gpt-4-turbo
         | via API in a small discord with a few friends using it as well.
         | Have never gone above 5$/month in usage.
        
       | alexpogosyan wrote:
       | What software people use to interact with these models via chat?
        
         | ndr_ wrote:
         | https://huggingface.co/spaces/ndurner/oai_chat (bring your own
         | API key)
        
       | jiggawatts wrote:
       | Rubs crystal ball: Widespread availability in Azure will take
       | four months. No wait, it's just a software change, I'm being
       | silly... six months.
        
       ___________________________________________________________________
       (page generated 2024-04-09 23:01 UTC)