[HN Gopher] GPT-4 Turbo with Vision Generally Available
___________________________________________________________________
GPT-4 Turbo with Vision Generally Available
Author : davidbarker
Score : 157 points
Date : 2024-04-09 18:53 UTC (4 hours ago)
(HTM) web link (platform.openai.com)
(TXT) w3m dump (platform.openai.com)
| tucnak wrote:
| This is just responding to Anthropic, right? Funny how it took
| competition for them to make Vision-class models available.
| chimney wrote:
| Probably the Gemini 1.5 Pro GA that was announced hours
| earlier.
| choilive wrote:
| That's typically how competition works right? Your competitors
| push you to do better
| maciejgryka wrote:
| Competition is great! But in this case, I don't know, adding
| JSON mode and function calling was a pretty obvious next steps
| for the vision model - I bet it'd have happened anyway.
| minimaxir wrote:
| GPT-4-Vision has been around for awhile in beta, it's just GA
| now.
|
| It's expensive though: Anthropic's Claude Haiku can process
| images significantly cheaper.
| nuz wrote:
| People like to say AI is moving blazingly fast these days but
| this has been like a year in the waiting que. Guessing sora will
| take equally long if not way longer before the general audience
| gets to touch it.
| brandon272 wrote:
| I could be misjudging the situation entirely, but Sora seems
| like it is on a much longer "general availability" timeline.
| freedomben wrote:
| Yeah OpenAI's head of something (CTO?) said late fall, if
| it's safe enough. Gonna be awhile until us normal people get
| our hands on it.
| CuriouslyC wrote:
| I think with Sora, "general availability" will be a much more
| expensive higher tiered sub with a limited number of gens per
| day, and I have my doubts that you'll just be able to sign up
| for this sub through the web, I wouldn't be surprised if it's
| an invite only partners thing.
| belter wrote:
| Sounds like a complaint about a chair in the sky...
| https://youtu.be/8r1CZTLk-Gk
| arcanemachiner wrote:
| I think about this bit a lot, mostly while I'm being mildly
| inconvenienced by something.
| throwup238 wrote:
| The chair in the sky keeps on turnin'... and I don't know if
| I'll have access tomorrow.
| dbbk wrote:
| Haven't seen this in years, love this video
| dvfjsdhgfv wrote:
| Someone could explain why on their documentation page
| (https://help.openai.com/en/articles/8555496-gpt-4-vision-api)
| they link to
| https://web.archive.org/web/20240324122632/https://platform....
| instead of https://platform.openai.com/docs/guides/vision? As a
| parson who pays OpenAI a lot of money each month, I see this as a
| bit parasitic (unless they donate substantially to the Internet
| Archive).
| blowski wrote:
| I would assume a mistake, albeit one that raises questions
| about their QA processes. A static HTML page is hardly going to
| break their bank.
| SushiHippie wrote:
| Seems to be changed to the correct link now
| simonw wrote:
| They also added both JSON and function support to the vision
| model - previously it didn't have those.
|
| This means you can now use gpt-4-turbo vision to extract
| structured data from an image!
|
| I was previously using a nasty hack where I'd run the image
| through the vision model to extract just the text, then run that
| text through regular gpt-4-turbo to extract structured data. I
| ditched that hack just now:
| https://github.com/datasette/datasette-extract/issues/19
| maciejgryka wrote:
| He, cool to hear you're doing something like this too! We ended
| up close, but we also need good spatial relationships, which
| GPT4V isn't great at, so we're using other OCR system and
| adding the result to the context.
| MuffinFlavored wrote:
| > This means you can now use gpt-4-turbo vision to extract
| structured data from an image!
|
| How consistent and reliable is the extracted structure?
|
| Did they add any kind of concept "for whatever token you
| generate/think it is next, "unit test it" / make sure it passes
| some sort of rules?"
| maciejgryka wrote:
| My experience is that it's pretty good at reading text and
| pretty bad at understanding layouts. So e.g. asking it to
| work with tables is asking for trouble.
| dontupvoteme wrote:
| Yeah it's absolutely horrible at layouts.
|
| I'm not 100% sure it's related but if you ask it to draw
| bounding boxes around things it's always off by quite a
| bit.
| a_wild_dandan wrote:
| You restrict the model's next prediction to valid JSON
| tokens. (If you mean format reliability.)
| rockwotj wrote:
| I'm waiting for being able to restrict to a specific JSON
| Schema
| jgalt212 wrote:
| > How consistent and reliable is the extracted structure?
|
| That's the $100B question.
| szundi wrote:
| Or $T these days
| simonw wrote:
| It's pretty good, but it's not reliable enough to exclude the
| need to check everything it does.
|
| Same story as basically everything relating to LLMs to be
| honest.
| joshstrange wrote:
| In my testing I was better off running the image through AWS
| Textract then taking the output and feeding it to OpenAI. It
| was also much cheaper. Of course if all you are looking for
| is extraction then maybe you don't need OpenAI at all. I used
| it to clean up the OCR'd data and reformat it.
| Lucasoato wrote:
| It's very consistent, check this guy, he was able to
| structure LLM output using Pydantic in an elegant solution:
|
| https://www.youtube.com/watch?v=yj-wSRJwrrc
| geepytee wrote:
| One of the OpenAI PM's was also saying the model got
| substantially better at math:
| https://x.com/owencm/status/1777770827985150022
|
| I'm trying it for coding and have added it to my VS Code
| Copilot extension Overall I'd say it's better at coding than
| the previous GPT-4 Turbo. https://double.bot if anyone wants to
| try it :)
| g9yuayon wrote:
| Is being good math that important to the ChatGPT users,
| though? The ChatGPT's ability to do math is so limited that
| I'm not sure what math problems we have to ask ChatGPT to
| solve.
| DiggyJohnson wrote:
| It means you don't have to be as sketched out if you're
| looking for something that requires basic math. Imagine
| generating the correct result of a unit test or something.
| I wouldn't trust it either way, but I think this is a
| believable example.
| pama wrote:
| Math is not just arithmetic and yes better math does help
| at least some GPT-4 users.
| abrichr wrote:
| We've been using `gpt-4-1106-vision-preview` and simply
| prompting the model to return json, with excellent results:
| https://github.com/OpenAdaptAI/OpenAdapt/pull/610 (work in
| progress).
| philip1209 wrote:
| `Moderation` doesn't support images yet, I believe. Does anybody
| have a good image moderation API they are using?
| maciejgryka wrote:
| It's cool to see this update and JSON mode and function calling
| will both be useful with vision. I wonder, though, if there were
| any other specific changes to the models since the `preview`
| versions besides that?
| jessenaser wrote:
| Why does the production model merge vision with the base turbo
| model such that the output tokens remains 2048 instead of 4096?
|
| Because if we are using just text, why is the extra output size
| reduced still?
| tedsanders wrote:
| That's our mistake - it's still 4096. Assuming you saw this on
| the Playground, we'll fix it shortly. If you saw it somewhere
| else, please let me know.
|
| https://platform.openai.com/playground/chat?model=gpt-4-turb...
| jessenaser wrote:
| Yes I saw it in the playground. I will keep checking for
| updates for 4096. Thank you for the clarification.
|
| Edit (4:57 ET): "gpt-4-turbo" shows the updated 4096 in
| playground. "gpt-4-turbo-2024-04-09" remains 2048 in
| playground.
| greatpostman wrote:
| Apparently Gemini by google has a 20% LLM market share.
| ceejayoz wrote:
| According to whom? How would you assess it?
| chimney wrote:
| Probably saw this tweet from nat friedman
| https://twitter.com/natfriedman/status/1777739863678386268
| ceejayoz wrote:
| This methodology would ignore every API-driven use of the
| models that doesn't go through the first-party web
| interfaces.
| brcmthrowaway wrote:
| KPIs going down
| ilaksh wrote:
| For all of these posts that says function calling is now
| available, I feel like it's actually more of an optimization than
| a new capability.
|
| All of the leading edge models will output JSON in any format you
| ask for, such as {"fn_name": {"arg1":10}}. I think this is about
| making it more accurate and having a standard input/output
| format.
| minimaxir wrote:
| You have to specify the schema for that regardless.
|
| Function calling has been available for text inputs for awhile,
| now it's also available for image inputs. OpenAI's function
| calling/structured data mode is much more strict and reliable
| at following the schema than just putting "return the output in
| JSON" in a system prompt.
| hansonw wrote:
| Yes. But also note that the new function calling is actually
| "tool calling" where the model is also fine-tuned to expect and
| react to the _output_ of the function (and there are various
| other nuances like being able to call multiple functions in
| parallel and matching up the outputs to function calls
| precisely).
|
| When used in multi-turn "call/response" mode it actually does
| start to unlock some new capabilities.
| yieldcrv wrote:
| I've used LLaVa a little bit through LM Studio, but its really
| subpar mostly due to the GUI I think, is there a model and GUI
| that's better than LM Studio for vision adapters?
|
| GPT-4V in Open AI's chat interface is so seamless. LLM, Text,
| Speech input, Speech output with tones and emotion, Image
| Generation, Vision input, and soon its going to be outputting
| video with Sora...
|
| its kind of amusing that GPT-4 early 2023 is still the benchmark
| of competition on both closed source and open source models when
| the lead is just expanding
| robterrell wrote:
| What do you dislike about it?
| yieldcrv wrote:
| the interface, that I'm needing to load multiple adapters,
| and that I don't have a language model at the same time. LM
| Studio is way better for language models only at the moment.
| htrp wrote:
| Is gpt-4-turbo-2024-04-09 basically an updated version of the
| gpt-4-1106-vision-preview ?
| IshKebab wrote:
| Slightly OT question - can I use GTP-4 vision to drive a web
| browser, e.g. to automate tasks like "sign up for this website
| using this email and password; don't subscribe to promotional
| emails"?
| wewtyflakes wrote:
| I believe tying it together might be a challenge. For example,
| if you were to use the model to get the text of buttons, you
| would still have to write code to find the HTML elements for
| those buttons and drive the click/fill actions.
| peterleiser wrote:
| I've used GPT-4 to help with selenium and it gets the answer
| eventually, but almost never on the first try. So automating
| this without human intervention sounds tricky.
| hubraumhugo wrote:
| Their naming and versioning choices have definitely created some
| confusion. Apparently GPT-4 Turbo doesn't just come with the the
| new vision capabilities, but also with improved non-vision
| capabilities:
|
| > "I'm hoping we can get evals out shortly to help quantify this.
| Until then - it's a new model with various data and training
| improvements, resulting in better reasoning."[0]
|
| > "- major improvements across the board in our evals (especially
| math)"[1]
|
| [0] https://twitter.com/owencm/status/1777784000712761430
|
| [1] https://twitter.com/stevenheidel/status/1777789577438318625
| MyFirstSass wrote:
| As a regular user of ChatGPT-4 i this press release makes
| little sense.
|
| What does Vision mean? I've already been able to both upload
| docs, create images etc. for a while now?
|
| What are the "other improvements" and what is Turbo, what is
| 4.5, what is this new one called?
|
| How do i even see what version of the model i'm using in their
| interface when it just says "4"?
| tedsanders wrote:
| Worth noting that GPT can be accessed both through ChatGPT
| and via the OpenAI API. The link in this thread is pointing
| to documentation for the OpenAI API.
|
| Vision means the model can see image inputs.
|
| In the API, GPT with vision was previously available in a
| limited capacity (only some people, and didn't work with all
| features, like JSON mode and function calling). Now this
| model is available to everyone and it works with JSON mode
| and function calling. It also should be smarter at some
| tasks.
|
| This model is now available in the API, and will roll out to
| users in ChatGPT. In the API, it's named
| `gpt-4-turbo-2024-04-09`. In ChatGPT, it will be under the
| umbrella of GPT-4.
| andrewstuart wrote:
| "a fix for a bug" <laugh emoji>
|
| gpt-3.5-turbo-0125 New
|
| Updated GPT 3.5 Turbo The latest GPT-3.5 Turbo model with higher
| accuracy at responding in requested formats and a fix for a bug
| which caused a text encoding issue for non-English language
| function calls. Returns a maximum of 4,096 output tokens. Learn
| more.
| ugh123 wrote:
| Over Easter I asked GPT4 to count a mess of colored eggs on the
| floor that I was preparing for an egg hunt. They were mostly
| evenly separated and clearly visible (there were just 36).
|
| I gave it two tries to respond and it wasn't even close to the
| correct answer.
|
| Was it confused on colored eggs vs. "natural" eggs it might have
| been expecting? Should it have understood what I meant?
| ravenstine wrote:
| I imagine it would be better at describing an image in a
| general sense, but probably isn't processing it in a sense
| where it would actually be counting individual features. I
| could be wrong about that, but it seems like a combination of
| traditional CV and an LLM might be what's needed for more
| precise feature identification.
| 0x008 wrote:
| The use case of "counting objects" is basically completely
| solved already by yolov8. There is no need to use an LLM for
| that.
| Sharlin wrote:
| I'm pretty sure LLM vision capabilities are right now limited
| to something similar to subitizing in humans at best, ie. being
| able to perceive the number of items less than ~7 without
| counting. Expecting it to be able to actually _count_ objects
| is a bit too much.
| fzzzy wrote:
| LLMs are very bad at counting.
| tkgally wrote:
| I've been disappointed with both GPT-4's and Gemini 1.5's image
| recognition abilities in general, not just counting. When I
| have asked them to describe a photo containing multiple objects
| --a street scene, a room--they identify some of the objects
| correctly but invariably hallucinate others, naming things that
| are not present. Usually the hallucinations are of things that
| might appear in similar photos but definitely are not in the
| photo I gave them.
| bilsbie wrote:
| Is it worth still paying the $20?
| ndr_ wrote:
| This news piece is, first and foremost, about the Model, not
| the ChatGPT System. (More about the difference between "Model"
| and "System": https://ndurner.github.io/antropic-claude-amazon-
| bedrock). Not sure what their upgrade policy/process for
| ChatGPT is like, though.
| pama wrote:
| As per Steven Heidel's tweet this version will be released to
| chatGPT soon.
| ChrisLTD wrote:
| Is it possible to upload an image to the OpenAI Chat Playground
| to try it out?
| ndr_ wrote:
| Apparently not. But here:
| https://huggingface.co/spaces/ndurner/oai_chat
| anonymousDan wrote:
| Can I ask, how are people affording to play around with GPT4? Are
| you all doing it at work? Or is there some way I am unaware of to
| keep the costs down enough to play around with it for
| experimenting? It's so expensive!
| ShamelessC wrote:
| How much are you using it? I have been accessing gpt-4-turbo
| via API in a small discord with a few friends using it as well.
| Have never gone above 5$/month in usage.
| alexpogosyan wrote:
| What software people use to interact with these models via chat?
| ndr_ wrote:
| https://huggingface.co/spaces/ndurner/oai_chat (bring your own
| API key)
| jiggawatts wrote:
| Rubs crystal ball: Widespread availability in Azure will take
| four months. No wait, it's just a software change, I'm being
| silly... six months.
___________________________________________________________________
(page generated 2024-04-09 23:01 UTC)