[HN Gopher] GPT-5 is behind schedule
___________________________________________________________________
GPT-5 is behind schedule
Author : owenthejumper
Score : 118 points
Date : 2024-12-22 12:29 UTC (10 hours ago)
(HTM) web link (www.wsj.com)
(TXT) w3m dump (www.wsj.com)
| A_D_E_P_T wrote:
| Counterpoint: o1-Pro is insanely good -- subjectively, it's as
| far above GPT4 as GPT4 was above 3. It's almost _too_ good. Use
| it properly for an extended period of time, and one begins to
| worry about the future of one 's children and the utility of
| their schooling.
|
| o3, by all accounts, is better still.
|
| Seems to me that things are progressing quickly enough.
| apwell23 wrote:
| what do you use it for ?
| phito wrote:
| I keep reading this on HN so I believe it has to be true in
| some ways, but I don't really feel like there is any difference
| in my limited use (programming questions or explaining some
| concepts).
|
| If anything I feel like it's all been worse compared to the
| first release of ChatGPT, but I might be wearing rose colored
| glasses.
| delusional wrote:
| I'd say the same. I've tried a bunch of different AI tools,
| and none of them really seem all that helpful.
| ogogmad wrote:
| One use-case: They help with learning things quickly by
| having a chat and asking questions. And they never get
| tired or emotional. Tutoring 24/7.
|
| They also generate small code or scripts, as well as
| automate small things, when you're not sure how, but you
| know there's a way. You need to ensure you have a way to
| verify the results.
|
| They do language tasks like grammar-fixing, perfect
| translation, etc.
|
| They're 100 times easier and faster than search engines, if
| you limit your uses to that.
| vintermann wrote:
| They can't help you learn what they don't know
| themselves.
|
| I'm trying to use them to read historical handwritten
| documents in old Norwegian (Danish, pretty much). Not
| only do they not handle the German-style handwriting, but
| what they spit out looks like the sort of thing GPT-2
| would spit out if you asked it to write Norwegian (only
| slightly better than Swedish Muppet Swedish Chef's
| Swedish). It seems the experimental tuning has made it
| _worse_ at the task I most desperately want to use it
| for.
|
| And when you think about it, how could it _not_ overfit
| in some sense, when trained on its own output? No new
| information is coming in, so it pretty much has to get
| worse at _something_ to get better at all the benchmarks.
| ben_w wrote:
| > perfect translation
|
| Hah, no. They're good, but they definitely make stuff up
| when the context gets too long. Always check their
| output, just the same as you already note they need for
| small code and scripts.
| omega3 wrote:
| Same, on every release from openai, anthropic I keep reading
| how the new model is so much better (insert hyperbole here)
| than the previous one yet when using it I feel like they are
| mostly the same as last year.
| mathieuh wrote:
| It's the same for me. I genuinely don't understand how I can
| be having such a completely different experience from the
| people who rave about ChatGPT. Every time I've tried it's
| been useless.
|
| How can some people think it's amazing and has completely
| changed how they work, while for me it makes mistakes that a
| static analyser would catch? It's not like I'm doing anything
| remarkable, for the past couple of months I've been doing
| fairly standard web dev and it can't even fix basic problems
| with HTML. It will suggest things that just don't work at all
| and my IDE catches, it invents APIs for packages.
|
| One guy I work with uses it extensively and what it produces
| is essentially black boxes. If I find a problem with
| something "he" (or rather ChatGPT) has produced it takes him
| ages to commune with the machine spirit again to figure out
| how to fix it, and then he still doesn't understand it.
|
| I can't help but see this as a time-bomb, how much completely
| inscrutable shite are these tools producing? In five years
| are we going to end up with a bunch of "senior engineers" who
| don't actually understand what they're doing?
|
| Before people cry "o tempora o mores" at me and make
| parallels with the introduction of high-level languages, at
| least in order to write in a high-level language you need
| some basic understanding of the logic that is being executed.
| lm28469 wrote:
| > How can some people think it's amazing and has completely
| changed how they work, while for me it makes mistakes that
| should a static analyser would catch?
|
| There are a lot of code monkeys working on boilerplate
| code, these people used to rely on stack overflow and now
| that chatgpt is here it's a huge improvement for them
|
| If you work on anything remotely complex or which hasn't
| been solved 10 times on stack overflow chatgpt isn't
| remotely as useful
| globular-toast wrote:
| The ones who use it extensively are the same that used to
| hit up stackoverflow as the first port of call for every
| trivial problem that came their way. They're not really
| engineers, they just want to get stuff done.
| phist_mcgee wrote:
| No ad hominem please.
| ben_w wrote:
| > How can some people think it's amazing and has completely
| changed how they work, while for me it makes mistakes that
| should a static analyser would catch? It's not like I'm
| doing anything remarkable, for the past couple of months
| I've been doing fairly standard web dev and it can't even
| fix basic problems with HTML.
|
| Part of this is, I think, anchoring and expectation
| management: you hear people say it's amazing and wonderful,
| and then you see it fall over and you're naturally
| disappointed.
|
| My formative years started off with Commodore 64 basic
| going "?SYNTAX ERROR" from most typos plus a lot of "I
| don't know what that means" from the text adventures, then
| Metrowerks' C compiler telling me there were errors on
| every line _*after but not including*_ the one where I
| forgot the semicolon, then surprises in VisualBasic and
| Java where I was getting integer division rather than
| floats, then the fantastic oddity where accidentally
| leaning on the option key on a mac keyboard while pressing
| minus turns the minus into an n-dash which looked
| completely identical to a minus on the Xcode default font
| at the time and thus produced a very confusing compiler
| error...
|
| So my expectations have always been low for machine
| generated output. And it has wildly exceeded those low
| expectations.
|
| But the expectation management goes both ways, especially
| when the comparison is "normal humans" rather than "best
| practices". I've seen things you wouldn't believe...
| Entire files copy-pasted line for line, "TODO: deduplicate"
| and all, 20 minute app starts passed off as
| "optimized solutions." FAQs filled with nothing but
| Bob Ross quotes, a zen garden of "happy little
| accidents." I watched iOS developers use UI
| tests as a complete replacement for storyboards,
| bi-weekly commits, each a sprawling novel of despair,
| where every change log was a tragic odyssey.
| Google Spreadsheets masquerading as bug trackers,
| Swift juniors not knowing their ! from their ?, All
| those hacks and horrors... lost in time, Time to
| deploy.
|
| (All true, and all pre-dating ChatGPT).
|
| > It will suggest things that just don't work at all and my
| IDE catches, it invents APIs for packages.
|
| Aye. I've even had that with models forgetting the APIs
| they themselves have created, just outside the context
| window.
|
| To me, these are tools. They're fantastic tools, but
| they're not something you can blindly fire-and-forget...
|
| ...fortunately for me, because my passive income is _not
| quite_ high enough to cover mortgage payments, and I 'm
| looking for work.
|
| > In five years are we going to end up with a bunch of
| "senior engineers" who don't actually understand what
| they're doing?
|
| Yes, if we're lucky.
|
| If we're not, the models keep getting better and we don't
| have any "senior engineers" at all.
| williamcotton wrote:
| I found it very useful for writing a lexer and parser for a
| search DSL and React component recently:
|
| https://github.com/williamcotton/search-input-query
| vrighter wrote:
| first time I tried it, I asked it to find bugs in a piece
| of very well tested C code.
|
| It introduced an off-by-one error by miscounting the number
| of arguments in an sprintf call, breaking the program. And
| then proceeded to fail to find that bug that it introduced.
| jonas21 wrote:
| I think the difference comes down to interacting with it
| like IDE autocomplete vs. interacting with it like a
| colleague.
|
| It sounds like you're doing the former -- and yeah, on the
| first pass, it'll sometimes make mistakes that autocomplete
| wouldn't or generate code that's wrong or overly complex.
|
| On the other hand, I've found that if you treat it more
| like a colleague, it works wonderfully. Ask it to do
| something hard, then read the code, and ask follow-up
| questions. If you see something that's wrong, tell it, and
| ask it to fix it. If you don't understand something, ask
| for an explanation. I've found that this generates great
| code that I understand better than if I had written it from
| scratch, in a fraction of the time.
|
| It also sounds like you're asking it to do things that you
| already know how to do, in order to save time. I find that
| it's most useful in tackling things that I _don 't_ know
| how to do.
|
| This takes a big shift in mindset if you've been using IDEs
| all your life and have expectations of LLMs being a fancy
| autocomplete. I wonder if kids learning to code now will
| natively understand how to get the most out of LLMs without
| the preconceptions that those of us who have been in the
| field for a while have.
| fzeroracer wrote:
| If you've ever used any enterprise software for long enough,
| you know the exact same song and dance.
|
| They release version Grand Banana. Purported to be
| approximately 30% faster with brand new features like
| Algorithmic Triple Layering and Enhanced Compulsory
| Alignment. You open the app. Everything is slower, things are
| harder to find and it breaks in new, fun ways. Your
| organization pays a couple hundred more per person for these
| benefits. Their stock soars, people celebrate the release and
| your management says they can't wait to see the improvement
| in workflows now that they've been able to lay off a quarter
| of your team.
|
| Has there been improvements in LLMs over time? Somewhat, most
| of it concentrated at the beginning (because they siphoned up
| a bunch of data in a dubious manner). Now it's just part of
| their sales cycle, to keep pumping up numbers while no one
| sees any meaningful improvement.
| anonzzzies wrote:
| Not sure what you are using it for, but it is terrible for me
| for coding; claude beats it always and hands down. o1 just
| thinks forever to come up with stuff it already tried the
| previous time.
|
| People say that's just prompting without pointing to real
| million line+ repositories or realistic apps to show how that
| can be improved. So I say they are making todo and hello world
| apps and yes, there it works really well. Claude still beats
| it, every.. single.. time..
|
| And yes, I use the Pro of all and yes, I do assume coding is
| done for most of people. Become a plumber or electrician or
| carpenter.
| rubymamis wrote:
| I find that o1 and Sonnet 3.5 are good and bad quite equally
| on different things. That's why I keep asking both the same
| coding questions.
| anonzzzies wrote:
| We do the same (all requests go to o1, sonnet and gemini
| and we store the results for later to compare)
| automatically for our research: Claude always wins. Even
| with specific prompting on both platforms. Especially
| frontend it seems o1 really is terrible.
| ynniv wrote:
| Claude is trained on principles. GPT is trained on
| billions of edge cases. Which student do you prefer?
| rubymamis wrote:
| Every time I try Gemini, it's really subpar. I found that
| qwen2.5-coder-32b-instruct can be better.
|
| Also, for me 50% 50% for Sonnet and o1, but although I'm
| not 100% sure about it, I think o1 is better with longer
| and more complicated (C++) code and debugging. At least
| from my brief testing. Also, OpenAI models seem to be
| more verbose - sometimes it's better - where I'd like
| additional explanation on chosen fields in a SQL schema,
| sometimes it's too much.
|
| EDIT: Just asked both o1 and Sonnet 3.5 the same QML
| coding question, and Sonnet 3.5 succeeded, o1 failed.
| oceanplexian wrote:
| Very anecdotal but I've found that for things that are
| well spec'd out with a good prompt Sonnet 3.5 is far
| better. For problems where I might have introduced a
| subtle logical error o1 seems to catch it extremely well.
| So better reasoning might be occurring but reasoning is
| only a small part of what we would consider intelligence.
| CapcomGo wrote:
| Wins? What does this mean? Do you have any results? I see
| the claims that Claude is better for coding a lot but
| using it and using Gemini 2.0 flash and o1 and it sure
| doesn't seem like it.
| h_tbob wrote:
| That so weird, it's seems like everybody here prefers Claude.
|
| I've been using Claude and openai in copilot and I find even
| 4o seems to understand the problem better. O1 definitely
| seems to get it right more for me.
| master_crab wrote:
| Claude also has a better workflow UI. It'll maintain
| conversation context while opening up new windows to
| present code suggestions.
|
| When I was still subscribing to OpenAI (about 4 months ago)
| this didn't exist.
| rrrrrrrrrrrryan wrote:
| It exists as of last week with Canvas.
| fragmede wrote:
| If you're using the web interface of either, you might
| consider looking into tools that focus on using LLMs for
| code, so you're not copy/pasting.
| A_D_E_P_T wrote:
| They're both okay for coding, though for my use cases
| (which are niche and involve quite a lot of mathematics and
| formal logic) o1/o1-Pro is better. It seems to have a
| better native grasp of mathematical concepts, and it can
| even answer very difficult questions from vague inputs,
| e.g.: https://chatgpt.com/share/676020cb-8574-8005-8b83-4be
| d5b13e1...
| orbital-decay wrote:
| Different languages maybe? I find Sonnet v2 to be lacking
| in Rust knowledge compared to 4o 11-20, but excelling at
| Python and JS/TS. O1's strong side seems to be complex or
| quirky puzzle-like coding problems that can be answered in
| a short manner, it's meh at everything else, especially
| considering the price. Which is understandable given its
| purpose and training, but I have no use for it as that's
| exactly the sort of problem I wouldn't trust an LLM to
| solve.
|
| Sonnet v2 in particular seems to be a bit broken with its
| reasoning (?) feature. The one where it detects it might be
| hallucinating (what's even the condition?) and reviews the
| reply, reflecting on it. It can make it stop halfway into
| the reply and decide it wrote enough, or invent some
| ridiculous excuse to output a worse answer. Annoying,
| although it doesn't trigger too often.
| anonzzzies wrote:
| I try to sprinkle 'for us/me' everywhere as much as I can;
| we work on LoB/ERP apps mostly. These are small frontends
| to massive multi million line backends. We carved a niche
| by providing the frontends on these backends live at the
| client office by a business consultant of ours: they simply
| solve UX issues for the client on top of large ERP by using
| our tool and prompting. Everything looks modern, fresh and
| nice; unlike basically all the competitors in this space.
| It's fast and no frontend people are needed for it; backend
| is another system we built which takes a lot longer of
| course as they are complex business rules. Both claude and
| o1 turn up something that looks similar but only the claude
| version will work and be, after less prompting, correct. I
| don't have shares in either and I want open source to win;
| we have all open (more open) solutions doing all the same
| queries and we evaluate all but claude just wins. We did
| manage even big wins with openai davinci in 2022 (or so;
| before chatgpt), but this is a massive boost allowing us to
| upgrade most people to business consultant and just have
| them build with clients real time and have the tech guys
| including me add _manually_ tests and proofs (where needed)
| to know if we are actually fine. Works so much better than
| the slog with clients before; people are so bad at
| explaining at what they need, it was slowly driving me
| insane after doing it for 30+ years.
| 1123581321 wrote:
| O1 is effective, but it's slow. I would expect a GPT-5 and mini
| to work as quickly as the 4 models.
| Xcelerate wrote:
| I had a 30 min argument with o1-pro where it was convinced it
| had solved the halting problem. Tried to gaslight me into
| thinking I just didn't understand the subtlety of the argument.
| But it's susceptible to appeal to authority and when I started
| quoting snippets of textbooks and mathoverflow it finally
| relented and claimed there had been a "misunderstanding". It
| really does argue like a human though now...
| radioactivist wrote:
| I had a similar experience with regular o1 about integral
| that was divergent. It was adamant that it wasn't and would
| respond to any attempt at persuasion with variants of "its a
| standard integral" with a "subtle cancellation". When I asked
| for any source for this standard integral it produced
| references to support its argument that existed but didn't
| actually contain the integral. When I told it the references
| didn't have the result and backpedalled (gaslighting!) to "I
| never told you they were in there". When I pointed out that
| in fact it did it insisted this was just a
| "misunderstanding". It only relented when I told it
| Mathematica agreed the integral was divergent. It still
| insisted it never said that the books it pointed to contained
| this (false, non-sensical) result.
|
| This was new behaviour for me to see in an LLM. Usually the
| problem is these things would just fold when you pushed back.
| I don't know which is better, but being this confidently
| wrong (and "lying" when confronted with it) is troubling.
| Animats wrote:
| > but being this confidently wrong (and "lying" when
| confronted with it) is troubling.
|
| It works in politics, marketing, and self-promotion.
|
| If you use the web as a training set, those categories
| dominate.
| layer8 wrote:
| Maybe they also trained the model on Sam Altman. ;)
| ldjkfkdsjnv wrote:
| It basically solves all bugs/programming challenges i throw at
| it, given i give it the right data
| construct0 wrote:
| The world is figuring out how to make this technology fit and
| work and somehow this is "behind" schedule. It's almost comical.
| diego_sandoval wrote:
| Reminds me of this Louis CK joke:
|
| I was on an airplane and there was high-speed Internet on the
| airplane. That's the newest thing that I know exists. And I'm
| sitting on the plane and they go, open up your laptop, you can
| go on the Internet.
|
| And it's fast, and I'm watching YouTube clips. It's amazing.
| I'm on an airplane! And then it breaks down. And they
| apologize, the Internet's not working. And the guy next to me
| goes, 'This is bullshit.' I mean, how quickly does the world
| owe him something that he knew existed only 10 seconds ago?"
|
| https://www.youtube.com/watch?v=me4BZBsHwZs
| mensetmanusman wrote:
| The investors need their returns now!
|
| Soon, all the middle class jobs will be converted to profits
| for the capital/data center owners, so they have to spend
| while they can before the economy crashes due to lack of
| spending.
| omega3 wrote:
| People who say ,,it's bullshit" are the ones that push the
| technological advance forward.
| from-nibly wrote:
| Not invariably. Some of those people are the ones who want
| to draw 7 red lines all perpendicular, some with green ink,
| some with transparent and one that looks like a kitten.
| ziml77 wrote:
| For anyone who hasn't seen what this comment is
| referencing: https://www.youtube.com/watch?v=BKorP55Aqvg
| taneq wrote:
| No, people who say "it's bullshit" _and then do something
| to fix the bullshit_ are the ones that push technology
| forward. Most people who say "it's bullshit" instantly
| when something isn't perfect for exactly what they want
| right now are just whingers and will never contribute
| anything except unconstructive criticism.
| omega3 wrote:
| Sounds like "yes but" rather than "no" otherwise you're
| responding to self created straw man.
| bobxmax wrote:
| That's really not true.
| echelon wrote:
| For a company that sees itself as the undisputed leader and
| that wants to raise $7 trillion to build fabs, they deserve
| some of the heaviest levels of scrutiny in the world.
|
| If OpenAI's investment prospectus relies on them reaching AGI
| before the tech becomes commoditized, everyone is going to look
| for that weakness.
| david-gpu wrote:
| What I find odd is that o1 doesn't support attaching text
| documents to chats the way 4o does. For a model that specializes
| in reasoning, reading long documents seems like a natural feature
| to have.
| ionwake wrote:
| If Sama ever reads this, I have no idea why no users seem to
| focus on this, but it would be really good to prioritise being
| able to select which model you can use with the custom myGPTs.
| I know this maybe hard or not possible without recreating them
| , but I still dont think it's possible.
|
| I dont think most customers realise how much better the models
| work with custom GPTs.
| throwaway314155 wrote:
| At this point I think it's safe to say they have given up on
| custom GPTs.
| emeg wrote:
| What makes you say that?
| jillesvangurp wrote:
| You can use the new project feature for that. That's a way of
| grouping conversations, adding files, etc. Should work with o1
| pro as well apparently.
| david-gpu wrote:
| "When using custom instructions or files, only GPT-4o is
| available". Straight out of the ChatGPT web interface when
| you try to select which model you want to use.
| PittleyDunkin wrote:
| Everyone's comparing o1 and claude, but neither really work well
| enough to justify paying for them in my experience for coding.
| What I really want is a mode where they ask _clarifying
| questions_ , ideally many of them, before spitting out an answer.
| This would greatly improve utility of producing something with
| more value than an auto-complete.
| kelsey98765431 wrote:
| have you tested that this helps? seems pretty simple to script
| with an agent framework
| throwaway314155 wrote:
| Or just f-strings.
| Vecr wrote:
| I know multiple people that carefully prompt to get that done.
| The model outputs in direct token order, and can't turn around,
| so you need to make sure that's strictly followed. The system
| can and will come up with post-hoc "reasoning".
| qup wrote:
| Have you used them to build a system to ask you clarifying
| questions?
|
| Or even instructed them to?
| simondotau wrote:
| Just today I got Claude to convert a company's PDF protocol
| specification into an actual working python implementation of
| that protocol. It would have been uncreative drudge work for a
| human, but I would have absolutely paid a week of junior dev
| time for it. Instead I wrote it alongside AI and it took me
| barely more than an hour.
|
| The best part is, I've never written any (substantial) python
| code before.
| weird_fox wrote:
| I have to agree. It's still a bit hit or miss, but the hits
| are a huge time and money saver especially in refactoring.
| And unlike what most of the rather demeaning comments in
| those HN threads state, I am not some 'grunt' doing
| 'boilerplate work'. I mostly do geometry/math stuff, and the
| AIs really do know what they're talking about there
| sometimes. I don't have many peers I can talk to most of the
| time, and Claude is really helping me gather my thoughts.
|
| That being said, I definitely believe it's only useful for
| isolated problems. Even with Copilot, I feel like the AIs
| just lack a bigger context of the projects.
|
| Another thing that helped me was designing an initial prompt
| that really works for me. I think most people just expect to
| throw in their issue and get a tailored solution, but that's
| just not how it works in my experience.
| OutOfHere wrote:
| It would seem you don't care too much about verifying its
| output or about its correctness. If you did, it wouldn't take
| you just an hour. I guess you'll let correctness be someone
| else's problem.
| coreyh14444 wrote:
| Just tell it to do that and it will. Whenever I ask an AI for
| something and I'm pretty sure it doesn't have all the context I
| literally just say "ask me clarifying questions until you have
| enough information to do a great job on this."
| aimanbenbaha wrote:
| And this chain of prompts cumulated with the improved CoT
| reasoner would accrue a lot more enhanced results. More in
| line with what the coming agentic era promises.
| vintermann wrote:
| Yes. You can only do so much with the information you get in.
| The ability to _ask good questions_ , not just of itself in
| internal monologue style, but actually of the user, would
| fundamentally make it better since it can get more information
| in.
|
| As it is now, it has a bad habit of, if it can't answer the
| question you asked, instead answering a similar-looking
| question which it thinks you may have meant. That is of course
| a great strategy for benchmarks, where you don't earn any
| points for saying you don't know. But it's extremely
| frustrating for real users, who didn't read their question from
| a test suite.
| LorenDB wrote:
| Meta question: @dang, can we ban MSN links and instead link
| directly to the original source?
| ericskiff wrote:
| What we can reasonably assume from statements made by insiders:
|
| They want a 10x improvement from scaling and a 10x improvement
| from data and algorithmic changes
|
| The sources of public data are essentially tapped
|
| Algorithmic changes will be an unknown to us until they release,
| but from published research this remains a steady source of
| improvement
|
| Scaling seems to stall if data is limited
|
| So with all of that taken together, the logical step is to figure
| out how to turn compute into better data to train on. Enter
| strawberry / o1, and now o3
|
| They can throw money, time, and compute at thinking about and
| then generating better training data. If the belief is that N
| billion new tokens of high quality training data will unlock the
| leap in capabilities they're looking for, then it makes sense to
| delay the training until that dataset is ready
|
| With o3 now public knowledge, imagine how long it's been churning
| out new thinking at expert level across every field. OpenAI's
| next moat may be the best synthetic training set ever.
|
| At this point I would guess we get 4.5 with a subset of this -
| some scale improvement, the algorithmic pickups since 4 was
| trained, and a cleaned and improved core data set but without
| risking leakage of the superior dataset
|
| When 5 launches, we get to see what a fully scaled version looks
| like with training data that outstrips average humans in almost
| every problem space
|
| Then the next o-model gets to start with that as a base and
| reason? Its likely to be remarkable
| jsheard wrote:
| > With o3 now public knowledge, imagine how long it's been
| churning out new thinking at expert level across every field.
| OpenAI's next moat may be the best synthetic training set ever.
|
| Even taking OpenAI and the benchmark authors at their word they
| said that it is consuming at least tens of dollars per task to
| hit peak performance, how much would it cost to have it produce
| a meaningfully large training set?
| qup wrote:
| That's the public API price isn't it?
| jsheard wrote:
| There is no public API for o3 yet, those are the numbers
| they revealed in the ARC-AGI announcement. Even if they
| were public API prices we can't assume they're making a
| profit on those for as long as they're billions in the red
| overall every year, its entirely possible that the public
| API prices are _less_ than what OpenAI is actually paying.
| Stevvo wrote:
| "With o3 now public knowledge, imagine how long it's been
| churning out new thinking at expert level across every field."
|
| I highly doubt that. o3 is many orders of magnitude more
| expensive than paying subject matter experts to create new
| data. It just doesn't make sense to pay six figures in compute
| to get o3 to make data a human could make for a few hundred
| dollars.
| dartos wrote:
| That's an interesting idea. What if OpenAI funded medical
| research initiatives in exchange for exclusive training
| rights on the research.
| onlyrealcuzzo wrote:
| It would be orders of magnitude cheaper to outsource to
| humans.
| dartos wrote:
| Not as sexy to investors though
| aswegs8 wrote:
| Wait didn't they just recently request researchers to pair
| up with them in exchange for the data?
| DougN7 wrote:
| Someone needs to dress up Mechanical Turk and repackage it as
| an AI company.....
| jitl wrote:
| That's basically every AI company that existed before GPT3
| bookaway wrote:
| Yes, I think they had to push this reveal forward because
| their investors were getting antsy with the lack of visible
| progress to justify continuing rising valuations. There is no
| other reason a confident company making continuous rapid
| progress would feel the need to reveal a product that 99% of
| companies worldwide couldn't use at the time of the reveal.
|
| That being said, if OpenAI is burning cash at lightspeed and
| doesn't have to publicly reveal the revenue they receive from
| certain government entities, it wouldn't come as a surprise
| if they let the government play with it early on in exchange
| for some much needed cash to set on fire.
|
| EDIT: The fact that multiple sites seem to be publishing
| GPT-5 stories similar to this one leads one to conclude that
| the o3 benchmark story was meant to counter the negativity
| from this and other similar articles that are just coming
| out.
| tshadley wrote:
| Seems to me o3 prices would be what the consumer pays, not
| what OpenAI pays. That would mean o3 could be more efficient
| in-house than paying subject-matter experts.
| lalalali wrote:
| What is open ai margin on that product?
| mrshadowgoose wrote:
| Can SMEs deliver that data in a meaningful amount of time?
| Training data now is worth significantly more than data a
| year from now.
| dartos wrote:
| I'm curious how, if at all, the plan to get around compounding
| bias in synthetic data generated by models trained in synthetic
| data.
| ynniv wrote:
| Everyone's obsessed with new training tokens... It doesn't
| need to be more knowledgeable, it just needs to practice
| more. Ask any student: practice is synthetic data.
| dartos wrote:
| That leads to overfitting in ML land, which hurts overall
| performance.
|
| We know that unique data improves performance.
|
| These LLM systems are not students...
|
| Also, which students graduate and are immediately experts
| in their fields? Almost none.
|
| It takes years of practice in unique, often one-off,
| situations after graduation for most people to develop the
| intuition needed for a given field.
| ynniv wrote:
| It's overfitting when you train too large a model on too
| many details. Rote memorization isn't rewarding.
|
| The more concepts the model manages to grok, the more
| nonlinear its capabilities will be: we don't have a data
| problem, we have an educational one.
|
| Claude 3.5 was safety trained by Claude 3.0, and it's
| more coherent for it.
| https://www.anthropic.com/news/claudes-constitution
| dartos wrote:
| Overfitting can be caused by a lot of different things.
| Having an over abundance of one kind of data in a
| training set is one of those causes.
|
| It's why many pre-processing steps for image training
| pipelines will add copies of images at weird rotations,
| amounts of blur, and different cropping.
|
| > The more concepts the model manages to grok, the more
| nonlinear its capabilities will be
|
| These kind of hand wavey statements like "practice,"
| "grok," and "nonlinear its capabilities will be" are not
| very constructive as they don't have solid meaning wrt
| language models.
|
| So earlier when I was referring to compounding bias in
| synthetic data I was referring to a bias that gets
| trained on over and over and over again.
|
| That leads to overfitting.
| ynniv wrote:
| _These kind of hand wavey statements like "practice,"
| "grok," and "nonlinear its capabilities will be" are not
| very constructive as they don't have solid meaning wrt
| language models._
|
| So, here's my hypothesis, as someone who is adjacent ML
| but haven't trained DNNs directly:
|
| We don't understand how they work, because we didn't
| build them. They built themselves.
|
| At face value this can be seen as an almost spiritual
| position, but I am not a religious person and I don't
| think there's any magic involved. Unlike traditional
| models, the behavior of DNNs is based on random changes
| that failed up. We can reason about their structure, but
| only loosely about their functionality. When they get
| better at drawing, it isn't because we taught them to
| draw. When they get better at reasoning, it isn't because
| the engineers were better philosophers. Given this, there
| will not be a direct correlation between inputs and
| capabilities, but some arrangements do work better than
| others.
|
| If this is the case, high order capabilities should
| continue to increase with training cycles, as long as
| they are performed in ways that don't interfere with what
| has been successfully learned. People lamented the loss
| of capability that GPT 4 suffered as they increased
| safety. I think Anthropic has avoided this by choosing a
| less damaging way to tune a well performing model.
|
| I think these ideas are supported by Wolfram's reduction
| of the problem at
| https://writings.stephenwolfram.com/2024/08/whats-really-
| goi...
| dartos wrote:
| Your whole argument falls apart at
|
| > We don't understand how they work, because we didn't
| build them. They built themselves.
|
| We do understand how they work, we did build them. The
| mathematical foundation of these models are sound. The
| statistics behind them are well understood.
|
| What we don't exactly know is which parameters correspond
| to what results as it's different across models.
|
| We work backwards to see which parts of the network seem
| to relate to what outcomes.
|
| > When they get better at drawing, it isn't because we
| taught them to draw. When they get better at reasoning,
| it isn't because the engineers were better philosophers.
|
| Isn't this the exact opposite of reality?
|
| They get better at drawing because we improve their
| datasets, topologies, and their training methods and in
| doing so, teach them to draw.
|
| They get better at reasoning because the engineers and
| data scientists building training sets do get better at
| philosophy.
|
| They study what reasoning is and apply those learnings to
| the datasets and training methods.
|
| That's how CoT came about early on.
| ynniv wrote:
| Please, read the Wolfram blog
| layer8 wrote:
| And who will tell the model whether its practice results
| are correct or not? Students practice against external
| evaluators, it's not a self-contained system.
| nialv7 wrote:
| synthetic data is fine if you can ground the model somehow.
| that's why the o1/o3's improvements are mostly in reasoning,
| maths, etc., because you can easily tell if the data is wrong
| or not.
| noman-land wrote:
| I completely don't understand the use for synthetic data. What
| good it's it to train a model basically on itself?
| psb217 wrote:
| The value of synthetic data relies on having non-zero signal
| about which generated data is "better" or "worse". In a
| sense, this what reinforcement learning is about. Ie,
| generate some data, have that data scored by some evaluator,
| and then feed the data back into the model with higher weight
| on the better stuff and lower weight on the worse stuff.
|
| The basic loop is: (i) generate synthetic data, (ii) rate
| synthetic data, (iii) update model to put more probability on
| better data and less probability on worse data, then go back
| to (i).
| noman-land wrote:
| Thanks, that makes a lot more sense.
| RedNifre wrote:
| But who rates the synthetic data? If it is humans, I can
| understand that this is another way to get human knowledge
| into it, but if it's rated by AI, isn't it just a
| convoluted way of copying the rating AI's knowledge?
| ijustlovemath wrote:
| This is the bit I've never understood about training AI
| on its own output; won't you just regress to the mean?
| recursivecaveat wrote:
| Many things are more easily scored than produced. Like
| it's trivial to tell whether a poem rhymes, but writing
| one is a comparatively slow and difficult task. So
| hopefully since scoring is easier/more-discerning than
| generating, the idea is you can generate stuff, classify
| it as good or bad, and then retrain on the good stuff.
| It's kindof an article of faith for a lot of AI
| companies/professionals as well, since it prevents you
| from having to face a data wall, and is analogous to a
| human student practicing and learning in an appealing
| way.
|
| As far as I know it doesn't work very well so far. It is
| prone to overfitting, where it ranks highly some trivial
| detail of the output eg "if a summary starts with a
| byline of the author its a sign of quality" and then
| starts looping on itself over and over, increasing the
| frequency and size of bylines until it's totally crommed
| off to infinity and just repeating a short phrase
| endlessly. Humans have good baselines and common sense
| that these ML systems lack, if you've ever seen one of
| those "deep dream" images it's the same kind of idea. The
| "most possible dog" image can be looks almost nothing
| like a dog in the same way that the "most possible poem"
| may look nothing like a poem.
| viraptor wrote:
| This is a good read for some examples
| https://arxiv.org/abs/2203.14465
|
| > This technique, the "Self-Taught Reasoner" (STaR), relies
| on a simple loop: generate rationales to answer many
| questions, prompted with a few rationale examples; if the
| generated answers are wrong, try again to generate a
| rationale given the correct answer; fine-tune on all the
| rationales that ultimately yielded correct answers; repeat.
| We show that STaR significantly improves performance on
| multiple datasets compared to a model fine-tuned to directly
| predict final answers
|
| But there are a few others. In general good data is good
| data. We're definitely learning more about how to produce
| good synthetic version.
| im3w1l wrote:
| One issue with that is that the model may learn to smuggle
| data. You as a human think that the plain reading of the
| words is what is doing the reasoning, but (part of) the
| processing is done by the exact comma placement and synonym
| choice etc.
|
| Data smuggling is a known phenomenon in similar tasks.
| nialv7 wrote:
| > OpenAI's next moat
|
| I don't think oai has any moat at all. If you look around, QwQ
| from Alibaba is already pushing o1-preview performances. I
| think oai is only ahead by 3~6 months at most.
| vasco wrote:
| If their AGI dreams would come true it might be more than
| enough to have 3 months head start. They probably won't, but
| it's interesting to ponder what the next few hours, days,
| weeks would be for someone that would wield AGI.
|
| Like let's say you have a few datacenters of compute at your
| disposal and the ability to instantiate millions of AGI
| agents - what do you have them do?
|
| I wonder if the USA already has a secret program for this
| under national defense. But it is interesting that once you
| do control an actual AGI you'd want to speed-run a bunch of
| things. In opposition to that, how do you detect an adversary
| already has / is using it and what to do in that case.
| nradov wrote:
| There is an enormous "iceberg" of untapped non-public data
| locked behind paywalls or licensing agreements. The next
| frontier will be spending money and human effort to get access
| to that data, then transform it into something useful for
| training.
| sdwr wrote:
| Great improvements and all, but they are still no closer (as of
| 4o regular) to having a system that can be responsible for
| work. In math problems, it forgets which variable represents
| what, in coding questions it invents library fns.
|
| I was watching a YouTube interview with a "trading floor
| insider". They said they were really being paid for holding
| risk. The bank has a position in a market, and it's their ass
| on the line if it tanks.
|
| ChatGPT (as far as I can tell) is no closer to being
| accountable or responsible for anything it produces. If they
| don't solve that (and the problem is probably inherent to the
| architecture), they are, in some sense, polishing a turd.
| tucnak wrote:
| > ChatGPT (as far as I can tell) is no closer to being
| accountable or responsible for anything it produces.
|
| What does it even mean? How do you imagine that? You want
| OpenAI to take on liability for the kicks of it?
| numpad0 wrote:
| If an LLM can't be left to do mowing by itself, but a human
| will have to closely monitor and intervene at every its
| steps, then it's just a super fast predictive keyboard, no?
| SpicyLemonZest wrote:
| They would want to, if they thought they could, because
| doing so would unblock a ton of valuable use cases. A tax
| preparation or financial advisor AI would do huge numbers
| for any company able to promise that its advice can be
| trusted.
| dmkolobov wrote:
| Obviously not. I want legislation which imposes liability
| on OpenAI and similar companies if they actively market
| their products for use in safety-critical fields and their
| product doesn't perform as advertised.
|
| If a system is providing incorrect medical diagnoses, or
| denying services to protected classes due to biases in the
| training in the training data, someone should be held
| accountable.
| neonate wrote:
| https://archive.ph/L7fOF
| h_tbob wrote:
| It seems google has a massive advantage here since they can tap
| all of YouTube to train. I wonder what openai is using for its
| video data source.
| onemoresoop wrote:
| Train for what? For making videos? Train from people's
| comments? There's a lot of garbage on AI slop on youtube, how
| would this be sifted out? I think there's more value here on HN
| in terms of training, but even that, to what avail?
| a1j9o94 wrote:
| YouTube is such a great multimodal dataset--videos, auto-
| generated captions, and real engagement data all in one
| place. That's a strong starting point for training, even
| before you filter for quality. Microsoft's Phi-series models
| already show how focusing on smaller, high-quality datasets,
| like textbooks, can produce great results. You could totally
| imagine doing the same thing with YouTube by filtering for
| high-quality educational videos.
|
| Down the line, I think models will start using video
| generation as part of how they "think." Picture a version of
| GPT that works frame by frame--ask it to solve a geometry
| problem, and it generates a sequence of images to visualize
| the solution before responding. YouTube's massive library of
| visual content could make something like that possible.
| h_tbob wrote:
| From what I read openai is having trouble bc not enough data.
|
| If u think about it, any videos on YouTube of real world data
| contribute to its understanding of physics at minimum. From
| what I gather they do pre training on tons of unstructured
| content first and that contributes to overall smartness.
| oytis wrote:
| Good that we already have AGI in o3.
| Wowfunhappy wrote:
| Archive.is does not work for this article, does anyone have a
| workaround?
| randcraw wrote:
| Right. "You have been blocked", is what I get.
|
| But this works: https://www.msn.com/en-us/money/other/the-next-
| great-leap-in...
| cokml19 wrote:
| this one does https://archive.md/L7fOF (it is just the previous
| snapshot)
| captainbland wrote:
| In my intuition it makes sense that there is going to be some
| significant friction in LLM development going forward. We're
| talking about models that will cost upwards of $1bn to train.
| Save for a technological breakthrough, GPT-6/7 will probably have
| to wait for hardware to catch up.
| rrrrrrrrrrrryan wrote:
| I think the main bottleneck right now is training data -
| they've basically exhausted all public sources of data, so they
| have to either pay humans to generate new data from scratch or
| pay for the reasoning models to generate (less useful)
| synthetic training data. The next bottleneck is hardware, and
| the least important bottleneck is money.
| vrighter wrote:
| probably because it isn't any better
| OutOfHere wrote:
| How about just an updated gpt 4o with all newer data? It would go
| a long way. Currently it doesn't know anything since Oct 2023
| (without having to do a web search).
| simonw wrote:
| "OpenAI's is called GPT-4, the fourth LLM the company has
| developed since its 2015 founding." - that sentence doesn't fill
| me with confidence in the quality of the rest of the article,
| sadly.
| 404mm wrote:
| Quite funny that an article about AI was not fed to AI to proof
| read it.
| ToucanLoucan wrote:
| Bold of you to assume AI didn't write it, too.
| viraptor wrote:
| Editing mistakes that AI wouldn't make is the new "proof of
| human input".
| jacobsimon wrote:
| There's nothing grammatically offensive about this. It's like
| saying, "Cars come in all colors. Mine is red."
| simonw wrote:
| No, I'm complaining that just because GPT-4 is called GPT-4
| doesn't mean it's the fourth LLM from OpenAI.
|
| Off the top of my head: GPT-2, Codex, GPT-3 in three
| different flavors (babbage, curie, davinci), GPT-3.5.
|
| Suggesting that GPT-4 was "fourth" simply isn't credible.
|
| Just the other day they announced a jump from o1 to o3,
| skipping o2 purely because it's already the name of a major
| telecommunications brand in Europe. Deriving anything from
| the names of OpenAI's products doesn't make sense.
| benatkin wrote:
| While I'm sure it's unintentional, that amounts to
| nitpicking. I can easily find three to include and pass
| over the rest. Face value turns out to be a decent
| approximation.
| vasco wrote:
| Imagine coming up with a naming scheme for the versioning
| of your product just for it to fail on the second time you
| want to use it.
| lelandfe wrote:
| It's more like saying "the Audi Quattro, the company's fourth
| car..."
| benatkin wrote:
| Because there's an Audi Tre e Mezzo?
| dghlsakjg wrote:
| The issue isn't the grammar. It is that there are 5 distinct
| LLMs from OpenAI that you can use right now as well as 4
| others that were deprecated in 2024.
| selimnairb wrote:
| I'm not smart enough or interesting enough to be hired by OpenAI
| to expertly solve problems and explain how to the AI. However, I
| like to think there isn't enough money in the world for me to
| sell out my colleagues like that.
| bwhiting2356 wrote:
| I want AI to help me in the physical world: folding my laundry,
| cooking and farming healthy food, cleaning toilets. Training data
| is not lying around on the internet for free, but it's also not
| impossible. How much data do you need? A dozen warehouses full of
| robots folding and unfolding laundry 24/7 for a few months?
| bobxmax wrote:
| We are close. Language models and large vision models have
| transformed robotics. It just takes some time to get hardware
| up and running.
| kelnos wrote:
| I think it would be many decades before I'd trust a robot
| like that around small children or pets. Robots with that
| kind of movement capability, as well as the ability it pick
| up and move things around, will be heavy enough that a small
| mistake could easily kill a small child or pet.
| viraptor wrote:
| That's a solved problem for small devices. And we
| effectively have "robots" like that all over the place.
| Sliding doors in shops/trains/elevators have been around
| for ages and they include sensors for resistance. Unless
| there's 1. extreme cost cutting, or 2. bug in the hardware,
| devices like that wouldn't kill children these days.
| layer8 wrote:
| Even for adults, a robot that would likely have to be close
| to as massive as a human being, in order to do laundry and
| the like, would spook me out, moving freely through my
| place.
| leonheld wrote:
| > have transformed robotics
|
| Did they? Where? Seriously, I genuinely want to know who is
| employing these techniques.
| bobxmax wrote:
| All frontier labs are now employing LVMs or LLMs. But
| that's my point is you won't see the fruits of it this
| early.
| achierius wrote:
| That's the point being made. It's transformed robotics
| research, yes, but it both remains to see whether it will
| have a truly transformative effect on the field as
| experienced by people outside academia (I think this is
| quite probable) and more pointedly _when_.
| fragmede wrote:
| https://www.figure.ai/
|
| specifically their speech demo video (which is, of course,
| a demo video)
|
| https://youtu.be/Sq1QZB5baNw
|
| https://www.1x.tech/neo and
|
| https://www.unitree.com/h1/
|
| are undoubtedly using such models.
|
| It's an area of active research, eg
|
| https://www.physicalintelligence.company/blog/pi0
|
| https://wholebody-b1.github.io/
|
| https://ok-robot.github.io/
|
| https://mobile-aloha.github.io/
| SpicyLemonZest wrote:
| Laundry folding is an instructive example. Machines have been
| capable of home-scale laundry folding for over a decade, with
| two companies Foldimate and Laundroid building functional
| prototypes. The challenge is making it cost-competitive in a
| world where most people don't even purchase a $10 folding
| board.
|
| I would guess that most cooking and cleaning tasks are in
| basically the same space. You don't need fine motor control to
| clean a toilet bowl, but you've gotta figure out how to get
| people to buy the well-proven premisting technology before
| you'll be able to sell them a toilet-cleaning robot.
| layer8 wrote:
| Counterexample: Everyone uses dishwashers. Yet I don't think
| we'll have a robot doing the dishes human-style, or even just
| filling up and clearing out a dishwasher, within the next
| decade or two, regardless of price.
| Animats wrote:
| _" Orion's problems signaled to some at OpenAI that the more-is-
| more strategy, which had driven much of its earlier success, was
| running out of steam."_
|
| So LLMs finally hit the wall. For a long time, more data, bigger
| models, and more compute to drive them worked. But that's
| apparently not enough any more.
|
| Now someone has to have a new idea. There's plenty of money
| available if someone has one.
|
| The current level of LLM would be far more useful if someone
| could get a conservative confidence metric out of the internals
| of the model. This technology desperately needs to output "Don't
| know" or "Not sure about this, but ..." when appropriate.
| synapsomorphy wrote:
| The new idea is already here and it's reasoning / chain of
| thought.
|
| Anecdotally Claude is pretty good at knowing the bounds of its
| knowledge.
| whoisthemachine wrote:
| Unfortunately, the best they can do is "This is my confidence
| on what someone would say given the prior context".
| briga wrote:
| What wall? Not a week has gone by in recent years without an
| LLM breaking new benchmarks. There is little evidence to
| suggest it will all come to a halt in 2025.
| jrm4 wrote:
| Sure, but "benchmarks" here seems roughly as useful as
| "benchmarks" for GPUs or CPUs, which don't much translate to
| what the makers of GPT need, which is 'money making use
| cases.'
| peepeepoopoo98 wrote:
| O3 has demonstrated that OpenAI needs 1,000,000% more
| inference time compute to score 50% higher on benchmarks. If
| O3-High costs about $350k an hour to operate, that would mean
| making O4 score 50% higher would cost _$3.5B_ (!!!) an hour.
| _That_ scaling wall.
| Kuinox wrote:
| Wait a few month and they will have a distilled model with
| the same performance and 1% of the run cost.
| peepeepoopoo98 wrote:
| 100X efficiency improvement (doubtful) still means that
| costs grow 200X faster than benchmark performance.
| achierius wrote:
| Even assuming that past rates of inference cost scaling
| hold up, we would only expect a 2 OoM decrease after
| about a year or so. And 1% of 3.5b is still a very large
| number.
| norir wrote:
| I used to run a lot of monte carlo simulations where the
| error is proportional to the inverse square root. There was
| a huge advantage of running for an hour vs a few minutes,
| but you hit the diminishing returns depressingly quickly.
| It would not surprise me at all if llms end up having
| similar scaling properties.
| oceanplexian wrote:
| I'm convinced they're getting good at gaming the benchmarks
| since 4 has deteriorated via ChatGPT, in fact I've used
| 4-0125 and 4-1106 via the API and find them far superior to
| o1 and o1-mini at coding problems. GPT4 is an amazing tool
| but the true capabilities are being hidden from the public
| and/or intentionally neutered.
| og_kalu wrote:
| Not really. o3-low compute still stomps the benchmarks and
| isn't anywhere that expensive and o3-mini seems better than
| o1 while being cheaper.
|
| Combine that with the fact that LLM inference has reduced
| orders of magnitudes in cost the last few years and
| hampering over the inference costs of a new release seems a
| bit silly.
| simonw wrote:
| The new idea is inference-time scaling, as seen in o1 (and o3
| and Qwen's QwQ and DeepSeek's DeepSeek-R1-Lite-Preview and
| Google's gemini-2.0-flash-thinking-exp).
|
| I suggest reading these two pieces about that:
|
| - https://www.aisnakeoil.com/p/is-ai-progress-slowing-down -
| best explanation I've seen of inference scaling anywhere
|
| - https://arcprize.org/blog/oai-o3-pub-breakthrough - Francois
| Chollet's deep dive into o3
|
| I've been tracking it on this tag on my blog:
| https://simonwillison.net/tags/inference-scaling/
| exhaze wrote:
| I think the wildest thing is actually Meta's latest paper
| where they show a method for LLMs reasoning not in English,
| but in _latent space_
|
| https://arxiv.org/pdf/2412.06769
|
| I've done research myself adjacent to this (mapping parts of
| a latent space onto a manifold), but this is a bit eerie,
| even to me.
| asadalt wrote:
| kinda how we do it. language is just an io interface(but
| also neural obv) on top of our reasoning engine.
| ynniv wrote:
| Is it "eerie"? LeCun has been talking about it for some
| time, and may also be OpenAI's rumored q-star. You can't
| hill climb tokens, but you can climb manifolds.
| knapcio wrote:
| I'm wondering whether O3 can be used to explore its own
| improvement or optimization ideas, or if it hasn't reached that
| point yet.
| Yizahi wrote:
| To output "don't know" a system needs to "know" too. Random
| token generator can't know. It can guess better and better,
| maybe it can even guess 99.99% of time, but it can't know, it
| can't decide or reason (not even o1 can "reason").
| Yizahi wrote:
| GPT-5 is not behind schedule. GPT-5 is called GPT-4o and it has
| been already released half a year ago. It was not revolutionary
| enough to be called 5, and prophet saint Altman was probably
| afraid to release new gen not exponentially improving, so it was
| rebranded in the last moment. It's speculation of course, but it
| is kinda obvious speculation.
| glenstein wrote:
| >GPT-5 is called GPT-4o
|
| This is the first I have heard of this in particular. Do you
| know of any article or source for more on the efforts to train
| GPT 5 and the decision to call it GPT 4o?
| phillipcarter wrote:
| More palace intrigue, sigh.
|
| Meanwhile, the biggest opportunity lies not in whatever next
| thing OpenAI releases, but the rest of the enormous software
| industry actually integrating this technology and realizing the
| value it can deliver.
___________________________________________________________________
(page generated 2024-12-22 23:00 UTC)