[HN Gopher] GPT-5 is behind schedule
       ___________________________________________________________________
        
       GPT-5 is behind schedule
        
       Author : owenthejumper
       Score  : 118 points
       Date   : 2024-12-22 12:29 UTC (10 hours ago)
        
 (HTM) web link (www.wsj.com)
 (TXT) w3m dump (www.wsj.com)
        
       | A_D_E_P_T wrote:
       | Counterpoint: o1-Pro is insanely good -- subjectively, it's as
       | far above GPT4 as GPT4 was above 3. It's almost _too_ good. Use
       | it properly for an extended period of time, and one begins to
       | worry about the future of one 's children and the utility of
       | their schooling.
       | 
       | o3, by all accounts, is better still.
       | 
       | Seems to me that things are progressing quickly enough.
        
         | apwell23 wrote:
         | what do you use it for ?
        
         | phito wrote:
         | I keep reading this on HN so I believe it has to be true in
         | some ways, but I don't really feel like there is any difference
         | in my limited use (programming questions or explaining some
         | concepts).
         | 
         | If anything I feel like it's all been worse compared to the
         | first release of ChatGPT, but I might be wearing rose colored
         | glasses.
        
           | delusional wrote:
           | I'd say the same. I've tried a bunch of different AI tools,
           | and none of them really seem all that helpful.
        
             | ogogmad wrote:
             | One use-case: They help with learning things quickly by
             | having a chat and asking questions. And they never get
             | tired or emotional. Tutoring 24/7.
             | 
             | They also generate small code or scripts, as well as
             | automate small things, when you're not sure how, but you
             | know there's a way. You need to ensure you have a way to
             | verify the results.
             | 
             | They do language tasks like grammar-fixing, perfect
             | translation, etc.
             | 
             | They're 100 times easier and faster than search engines, if
             | you limit your uses to that.
        
               | vintermann wrote:
               | They can't help you learn what they don't know
               | themselves.
               | 
               | I'm trying to use them to read historical handwritten
               | documents in old Norwegian (Danish, pretty much). Not
               | only do they not handle the German-style handwriting, but
               | what they spit out looks like the sort of thing GPT-2
               | would spit out if you asked it to write Norwegian (only
               | slightly better than Swedish Muppet Swedish Chef's
               | Swedish). It seems the experimental tuning has made it
               | _worse_ at the task I most desperately want to use it
               | for.
               | 
               | And when you think about it, how could it _not_ overfit
               | in some sense, when trained on its own output? No new
               | information is coming in, so it pretty much has to get
               | worse at _something_ to get better at all the benchmarks.
        
               | ben_w wrote:
               | > perfect translation
               | 
               | Hah, no. They're good, but they definitely make stuff up
               | when the context gets too long. Always check their
               | output, just the same as you already note they need for
               | small code and scripts.
        
           | omega3 wrote:
           | Same, on every release from openai, anthropic I keep reading
           | how the new model is so much better (insert hyperbole here)
           | than the previous one yet when using it I feel like they are
           | mostly the same as last year.
        
           | mathieuh wrote:
           | It's the same for me. I genuinely don't understand how I can
           | be having such a completely different experience from the
           | people who rave about ChatGPT. Every time I've tried it's
           | been useless.
           | 
           | How can some people think it's amazing and has completely
           | changed how they work, while for me it makes mistakes that a
           | static analyser would catch? It's not like I'm doing anything
           | remarkable, for the past couple of months I've been doing
           | fairly standard web dev and it can't even fix basic problems
           | with HTML. It will suggest things that just don't work at all
           | and my IDE catches, it invents APIs for packages.
           | 
           | One guy I work with uses it extensively and what it produces
           | is essentially black boxes. If I find a problem with
           | something "he" (or rather ChatGPT) has produced it takes him
           | ages to commune with the machine spirit again to figure out
           | how to fix it, and then he still doesn't understand it.
           | 
           | I can't help but see this as a time-bomb, how much completely
           | inscrutable shite are these tools producing? In five years
           | are we going to end up with a bunch of "senior engineers" who
           | don't actually understand what they're doing?
           | 
           | Before people cry "o tempora o mores" at me and make
           | parallels with the introduction of high-level languages, at
           | least in order to write in a high-level language you need
           | some basic understanding of the logic that is being executed.
        
             | lm28469 wrote:
             | > How can some people think it's amazing and has completely
             | changed how they work, while for me it makes mistakes that
             | should a static analyser would catch?
             | 
             | There are a lot of code monkeys working on boilerplate
             | code, these people used to rely on stack overflow and now
             | that chatgpt is here it's a huge improvement for them
             | 
             | If you work on anything remotely complex or which hasn't
             | been solved 10 times on stack overflow chatgpt isn't
             | remotely as useful
        
             | globular-toast wrote:
             | The ones who use it extensively are the same that used to
             | hit up stackoverflow as the first port of call for every
             | trivial problem that came their way. They're not really
             | engineers, they just want to get stuff done.
        
               | phist_mcgee wrote:
               | No ad hominem please.
        
             | ben_w wrote:
             | > How can some people think it's amazing and has completely
             | changed how they work, while for me it makes mistakes that
             | should a static analyser would catch? It's not like I'm
             | doing anything remarkable, for the past couple of months
             | I've been doing fairly standard web dev and it can't even
             | fix basic problems with HTML.
             | 
             | Part of this is, I think, anchoring and expectation
             | management: you hear people say it's amazing and wonderful,
             | and then you see it fall over and you're naturally
             | disappointed.
             | 
             | My formative years started off with Commodore 64 basic
             | going "?SYNTAX ERROR" from most typos plus a lot of "I
             | don't know what that means" from the text adventures, then
             | Metrowerks' C compiler telling me there were errors on
             | every line _*after but not including*_ the one where I
             | forgot the semicolon, then surprises in VisualBasic and
             | Java where I was getting integer division rather than
             | floats, then the fantastic oddity where accidentally
             | leaning on the option key on a mac keyboard while pressing
             | minus turns the minus into an n-dash which looked
             | completely identical to a minus on the Xcode default font
             | at the time and thus produced a very confusing compiler
             | error...
             | 
             | So my expectations have always been low for machine
             | generated output. And it has wildly exceeded those low
             | expectations.
             | 
             | But the expectation management goes both ways, especially
             | when the comparison is "normal humans" rather than "best
             | practices". I've seen things you wouldn't believe...
             | Entire files copy-pasted line for line, "TODO: deduplicate"
             | and all,       20 minute app starts passed off as
             | "optimized solutions."       FAQs filled with nothing but
             | Bob Ross quotes,       a zen garden of "happy little
             | accidents."            I watched iOS developers use UI
             | tests       as a complete replacement for storyboards,
             | bi-weekly commits, each a sprawling novel of despair,
             | where every change log was a tragic odyssey.
             | Google Spreadsheets masquerading as bug trackers,
             | Swift juniors not knowing their ! from their ?,       All
             | those hacks and horrors... lost in time,       Time to
             | deploy.
             | 
             | (All true, and all pre-dating ChatGPT).
             | 
             | > It will suggest things that just don't work at all and my
             | IDE catches, it invents APIs for packages.
             | 
             | Aye. I've even had that with models forgetting the APIs
             | they themselves have created, just outside the context
             | window.
             | 
             | To me, these are tools. They're fantastic tools, but
             | they're not something you can blindly fire-and-forget...
             | 
             | ...fortunately for me, because my passive income is _not
             | quite_ high enough to cover mortgage payments, and I 'm
             | looking for work.
             | 
             | > In five years are we going to end up with a bunch of
             | "senior engineers" who don't actually understand what
             | they're doing?
             | 
             | Yes, if we're lucky.
             | 
             | If we're not, the models keep getting better and we don't
             | have any "senior engineers" at all.
        
             | williamcotton wrote:
             | I found it very useful for writing a lexer and parser for a
             | search DSL and React component recently:
             | 
             | https://github.com/williamcotton/search-input-query
        
             | vrighter wrote:
             | first time I tried it, I asked it to find bugs in a piece
             | of very well tested C code.
             | 
             | It introduced an off-by-one error by miscounting the number
             | of arguments in an sprintf call, breaking the program. And
             | then proceeded to fail to find that bug that it introduced.
        
             | jonas21 wrote:
             | I think the difference comes down to interacting with it
             | like IDE autocomplete vs. interacting with it like a
             | colleague.
             | 
             | It sounds like you're doing the former -- and yeah, on the
             | first pass, it'll sometimes make mistakes that autocomplete
             | wouldn't or generate code that's wrong or overly complex.
             | 
             | On the other hand, I've found that if you treat it more
             | like a colleague, it works wonderfully. Ask it to do
             | something hard, then read the code, and ask follow-up
             | questions. If you see something that's wrong, tell it, and
             | ask it to fix it. If you don't understand something, ask
             | for an explanation. I've found that this generates great
             | code that I understand better than if I had written it from
             | scratch, in a fraction of the time.
             | 
             | It also sounds like you're asking it to do things that you
             | already know how to do, in order to save time. I find that
             | it's most useful in tackling things that I _don 't_ know
             | how to do.
             | 
             | This takes a big shift in mindset if you've been using IDEs
             | all your life and have expectations of LLMs being a fancy
             | autocomplete. I wonder if kids learning to code now will
             | natively understand how to get the most out of LLMs without
             | the preconceptions that those of us who have been in the
             | field for a while have.
        
           | fzeroracer wrote:
           | If you've ever used any enterprise software for long enough,
           | you know the exact same song and dance.
           | 
           | They release version Grand Banana. Purported to be
           | approximately 30% faster with brand new features like
           | Algorithmic Triple Layering and Enhanced Compulsory
           | Alignment. You open the app. Everything is slower, things are
           | harder to find and it breaks in new, fun ways. Your
           | organization pays a couple hundred more per person for these
           | benefits. Their stock soars, people celebrate the release and
           | your management says they can't wait to see the improvement
           | in workflows now that they've been able to lay off a quarter
           | of your team.
           | 
           | Has there been improvements in LLMs over time? Somewhat, most
           | of it concentrated at the beginning (because they siphoned up
           | a bunch of data in a dubious manner). Now it's just part of
           | their sales cycle, to keep pumping up numbers while no one
           | sees any meaningful improvement.
        
         | anonzzzies wrote:
         | Not sure what you are using it for, but it is terrible for me
         | for coding; claude beats it always and hands down. o1 just
         | thinks forever to come up with stuff it already tried the
         | previous time.
         | 
         | People say that's just prompting without pointing to real
         | million line+ repositories or realistic apps to show how that
         | can be improved. So I say they are making todo and hello world
         | apps and yes, there it works really well. Claude still beats
         | it, every.. single.. time..
         | 
         | And yes, I use the Pro of all and yes, I do assume coding is
         | done for most of people. Become a plumber or electrician or
         | carpenter.
        
           | rubymamis wrote:
           | I find that o1 and Sonnet 3.5 are good and bad quite equally
           | on different things. That's why I keep asking both the same
           | coding questions.
        
             | anonzzzies wrote:
             | We do the same (all requests go to o1, sonnet and gemini
             | and we store the results for later to compare)
             | automatically for our research: Claude always wins. Even
             | with specific prompting on both platforms. Especially
             | frontend it seems o1 really is terrible.
        
               | ynniv wrote:
               | Claude is trained on principles. GPT is trained on
               | billions of edge cases. Which student do you prefer?
        
               | rubymamis wrote:
               | Every time I try Gemini, it's really subpar. I found that
               | qwen2.5-coder-32b-instruct can be better.
               | 
               | Also, for me 50% 50% for Sonnet and o1, but although I'm
               | not 100% sure about it, I think o1 is better with longer
               | and more complicated (C++) code and debugging. At least
               | from my brief testing. Also, OpenAI models seem to be
               | more verbose - sometimes it's better - where I'd like
               | additional explanation on chosen fields in a SQL schema,
               | sometimes it's too much.
               | 
               | EDIT: Just asked both o1 and Sonnet 3.5 the same QML
               | coding question, and Sonnet 3.5 succeeded, o1 failed.
        
               | oceanplexian wrote:
               | Very anecdotal but I've found that for things that are
               | well spec'd out with a good prompt Sonnet 3.5 is far
               | better. For problems where I might have introduced a
               | subtle logical error o1 seems to catch it extremely well.
               | So better reasoning might be occurring but reasoning is
               | only a small part of what we would consider intelligence.
        
               | CapcomGo wrote:
               | Wins? What does this mean? Do you have any results? I see
               | the claims that Claude is better for coding a lot but
               | using it and using Gemini 2.0 flash and o1 and it sure
               | doesn't seem like it.
        
           | h_tbob wrote:
           | That so weird, it's seems like everybody here prefers Claude.
           | 
           | I've been using Claude and openai in copilot and I find even
           | 4o seems to understand the problem better. O1 definitely
           | seems to get it right more for me.
        
             | master_crab wrote:
             | Claude also has a better workflow UI. It'll maintain
             | conversation context while opening up new windows to
             | present code suggestions.
             | 
             | When I was still subscribing to OpenAI (about 4 months ago)
             | this didn't exist.
        
               | rrrrrrrrrrrryan wrote:
               | It exists as of last week with Canvas.
        
               | fragmede wrote:
               | If you're using the web interface of either, you might
               | consider looking into tools that focus on using LLMs for
               | code, so you're not copy/pasting.
        
             | A_D_E_P_T wrote:
             | They're both okay for coding, though for my use cases
             | (which are niche and involve quite a lot of mathematics and
             | formal logic) o1/o1-Pro is better. It seems to have a
             | better native grasp of mathematical concepts, and it can
             | even answer very difficult questions from vague inputs,
             | e.g.: https://chatgpt.com/share/676020cb-8574-8005-8b83-4be
             | d5b13e1...
        
             | orbital-decay wrote:
             | Different languages maybe? I find Sonnet v2 to be lacking
             | in Rust knowledge compared to 4o 11-20, but excelling at
             | Python and JS/TS. O1's strong side seems to be complex or
             | quirky puzzle-like coding problems that can be answered in
             | a short manner, it's meh at everything else, especially
             | considering the price. Which is understandable given its
             | purpose and training, but I have no use for it as that's
             | exactly the sort of problem I wouldn't trust an LLM to
             | solve.
             | 
             | Sonnet v2 in particular seems to be a bit broken with its
             | reasoning (?) feature. The one where it detects it might be
             | hallucinating (what's even the condition?) and reviews the
             | reply, reflecting on it. It can make it stop halfway into
             | the reply and decide it wrote enough, or invent some
             | ridiculous excuse to output a worse answer. Annoying,
             | although it doesn't trigger too often.
        
             | anonzzzies wrote:
             | I try to sprinkle 'for us/me' everywhere as much as I can;
             | we work on LoB/ERP apps mostly. These are small frontends
             | to massive multi million line backends. We carved a niche
             | by providing the frontends on these backends live at the
             | client office by a business consultant of ours: they simply
             | solve UX issues for the client on top of large ERP by using
             | our tool and prompting. Everything looks modern, fresh and
             | nice; unlike basically all the competitors in this space.
             | It's fast and no frontend people are needed for it; backend
             | is another system we built which takes a lot longer of
             | course as they are complex business rules. Both claude and
             | o1 turn up something that looks similar but only the claude
             | version will work and be, after less prompting, correct. I
             | don't have shares in either and I want open source to win;
             | we have all open (more open) solutions doing all the same
             | queries and we evaluate all but claude just wins. We did
             | manage even big wins with openai davinci in 2022 (or so;
             | before chatgpt), but this is a massive boost allowing us to
             | upgrade most people to business consultant and just have
             | them build with clients real time and have the tech guys
             | including me add _manually_ tests and proofs (where needed)
             | to know if we are actually fine. Works so much better than
             | the slog with clients before; people are so bad at
             | explaining at what they need, it was slowly driving me
             | insane after doing it for 30+ years.
        
         | 1123581321 wrote:
         | O1 is effective, but it's slow. I would expect a GPT-5 and mini
         | to work as quickly as the 4 models.
        
         | Xcelerate wrote:
         | I had a 30 min argument with o1-pro where it was convinced it
         | had solved the halting problem. Tried to gaslight me into
         | thinking I just didn't understand the subtlety of the argument.
         | But it's susceptible to appeal to authority and when I started
         | quoting snippets of textbooks and mathoverflow it finally
         | relented and claimed there had been a "misunderstanding". It
         | really does argue like a human though now...
        
           | radioactivist wrote:
           | I had a similar experience with regular o1 about integral
           | that was divergent. It was adamant that it wasn't and would
           | respond to any attempt at persuasion with variants of "its a
           | standard integral" with a "subtle cancellation". When I asked
           | for any source for this standard integral it produced
           | references to support its argument that existed but didn't
           | actually contain the integral. When I told it the references
           | didn't have the result and backpedalled (gaslighting!) to "I
           | never told you they were in there". When I pointed out that
           | in fact it did it insisted this was just a
           | "misunderstanding". It only relented when I told it
           | Mathematica agreed the integral was divergent. It still
           | insisted it never said that the books it pointed to contained
           | this (false, non-sensical) result.
           | 
           | This was new behaviour for me to see in an LLM. Usually the
           | problem is these things would just fold when you pushed back.
           | I don't know which is better, but being this confidently
           | wrong (and "lying" when confronted with it) is troubling.
        
             | Animats wrote:
             | > but being this confidently wrong (and "lying" when
             | confronted with it) is troubling.
             | 
             | It works in politics, marketing, and self-promotion.
             | 
             | If you use the web as a training set, those categories
             | dominate.
        
               | layer8 wrote:
               | Maybe they also trained the model on Sam Altman. ;)
        
         | ldjkfkdsjnv wrote:
         | It basically solves all bugs/programming challenges i throw at
         | it, given i give it the right data
        
       | construct0 wrote:
       | The world is figuring out how to make this technology fit and
       | work and somehow this is "behind" schedule. It's almost comical.
        
         | diego_sandoval wrote:
         | Reminds me of this Louis CK joke:
         | 
         | I was on an airplane and there was high-speed Internet on the
         | airplane. That's the newest thing that I know exists. And I'm
         | sitting on the plane and they go, open up your laptop, you can
         | go on the Internet.
         | 
         | And it's fast, and I'm watching YouTube clips. It's amazing.
         | I'm on an airplane! And then it breaks down. And they
         | apologize, the Internet's not working. And the guy next to me
         | goes, 'This is bullshit.' I mean, how quickly does the world
         | owe him something that he knew existed only 10 seconds ago?"
         | 
         | https://www.youtube.com/watch?v=me4BZBsHwZs
        
           | mensetmanusman wrote:
           | The investors need their returns now!
           | 
           | Soon, all the middle class jobs will be converted to profits
           | for the capital/data center owners, so they have to spend
           | while they can before the economy crashes due to lack of
           | spending.
        
           | omega3 wrote:
           | People who say ,,it's bullshit" are the ones that push the
           | technological advance forward.
        
             | from-nibly wrote:
             | Not invariably. Some of those people are the ones who want
             | to draw 7 red lines all perpendicular, some with green ink,
             | some with transparent and one that looks like a kitten.
        
               | ziml77 wrote:
               | For anyone who hasn't seen what this comment is
               | referencing: https://www.youtube.com/watch?v=BKorP55Aqvg
        
             | taneq wrote:
             | No, people who say "it's bullshit" _and then do something
             | to fix the bullshit_ are the ones that push technology
             | forward. Most people who say  "it's bullshit" instantly
             | when something isn't perfect for exactly what they want
             | right now are just whingers and will never contribute
             | anything except unconstructive criticism.
        
               | omega3 wrote:
               | Sounds like "yes but" rather than "no" otherwise you're
               | responding to self created straw man.
        
             | bobxmax wrote:
             | That's really not true.
        
         | echelon wrote:
         | For a company that sees itself as the undisputed leader and
         | that wants to raise $7 trillion to build fabs, they deserve
         | some of the heaviest levels of scrutiny in the world.
         | 
         | If OpenAI's investment prospectus relies on them reaching AGI
         | before the tech becomes commoditized, everyone is going to look
         | for that weakness.
        
       | david-gpu wrote:
       | What I find odd is that o1 doesn't support attaching text
       | documents to chats the way 4o does. For a model that specializes
       | in reasoning, reading long documents seems like a natural feature
       | to have.
        
         | ionwake wrote:
         | If Sama ever reads this, I have no idea why no users seem to
         | focus on this, but it would be really good to prioritise being
         | able to select which model you can use with the custom myGPTs.
         | I know this maybe hard or not possible without recreating them
         | , but I still dont think it's possible.
         | 
         | I dont think most customers realise how much better the models
         | work with custom GPTs.
        
           | throwaway314155 wrote:
           | At this point I think it's safe to say they have given up on
           | custom GPTs.
        
             | emeg wrote:
             | What makes you say that?
        
         | jillesvangurp wrote:
         | You can use the new project feature for that. That's a way of
         | grouping conversations, adding files, etc. Should work with o1
         | pro as well apparently.
        
           | david-gpu wrote:
           | "When using custom instructions or files, only GPT-4o is
           | available". Straight out of the ChatGPT web interface when
           | you try to select which model you want to use.
        
       | PittleyDunkin wrote:
       | Everyone's comparing o1 and claude, but neither really work well
       | enough to justify paying for them in my experience for coding.
       | What I really want is a mode where they ask _clarifying
       | questions_ , ideally many of them, before spitting out an answer.
       | This would greatly improve utility of producing something with
       | more value than an auto-complete.
        
         | kelsey98765431 wrote:
         | have you tested that this helps? seems pretty simple to script
         | with an agent framework
        
           | throwaway314155 wrote:
           | Or just f-strings.
        
         | Vecr wrote:
         | I know multiple people that carefully prompt to get that done.
         | The model outputs in direct token order, and can't turn around,
         | so you need to make sure that's strictly followed. The system
         | can and will come up with post-hoc "reasoning".
        
         | qup wrote:
         | Have you used them to build a system to ask you clarifying
         | questions?
         | 
         | Or even instructed them to?
        
         | simondotau wrote:
         | Just today I got Claude to convert a company's PDF protocol
         | specification into an actual working python implementation of
         | that protocol. It would have been uncreative drudge work for a
         | human, but I would have absolutely paid a week of junior dev
         | time for it. Instead I wrote it alongside AI and it took me
         | barely more than an hour.
         | 
         | The best part is, I've never written any (substantial) python
         | code before.
        
           | weird_fox wrote:
           | I have to agree. It's still a bit hit or miss, but the hits
           | are a huge time and money saver especially in refactoring.
           | And unlike what most of the rather demeaning comments in
           | those HN threads state, I am not some 'grunt' doing
           | 'boilerplate work'. I mostly do geometry/math stuff, and the
           | AIs really do know what they're talking about there
           | sometimes. I don't have many peers I can talk to most of the
           | time, and Claude is really helping me gather my thoughts.
           | 
           | That being said, I definitely believe it's only useful for
           | isolated problems. Even with Copilot, I feel like the AIs
           | just lack a bigger context of the projects.
           | 
           | Another thing that helped me was designing an initial prompt
           | that really works for me. I think most people just expect to
           | throw in their issue and get a tailored solution, but that's
           | just not how it works in my experience.
        
           | OutOfHere wrote:
           | It would seem you don't care too much about verifying its
           | output or about its correctness. If you did, it wouldn't take
           | you just an hour. I guess you'll let correctness be someone
           | else's problem.
        
         | coreyh14444 wrote:
         | Just tell it to do that and it will. Whenever I ask an AI for
         | something and I'm pretty sure it doesn't have all the context I
         | literally just say "ask me clarifying questions until you have
         | enough information to do a great job on this."
        
           | aimanbenbaha wrote:
           | And this chain of prompts cumulated with the improved CoT
           | reasoner would accrue a lot more enhanced results. More in
           | line with what the coming agentic era promises.
        
         | vintermann wrote:
         | Yes. You can only do so much with the information you get in.
         | The ability to _ask good questions_ , not just of itself in
         | internal monologue style, but actually of the user, would
         | fundamentally make it better since it can get more information
         | in.
         | 
         | As it is now, it has a bad habit of, if it can't answer the
         | question you asked, instead answering a similar-looking
         | question which it thinks you may have meant. That is of course
         | a great strategy for benchmarks, where you don't earn any
         | points for saying you don't know. But it's extremely
         | frustrating for real users, who didn't read their question from
         | a test suite.
        
       | LorenDB wrote:
       | Meta question: @dang, can we ban MSN links and instead link
       | directly to the original source?
        
       | ericskiff wrote:
       | What we can reasonably assume from statements made by insiders:
       | 
       | They want a 10x improvement from scaling and a 10x improvement
       | from data and algorithmic changes
       | 
       | The sources of public data are essentially tapped
       | 
       | Algorithmic changes will be an unknown to us until they release,
       | but from published research this remains a steady source of
       | improvement
       | 
       | Scaling seems to stall if data is limited
       | 
       | So with all of that taken together, the logical step is to figure
       | out how to turn compute into better data to train on. Enter
       | strawberry / o1, and now o3
       | 
       | They can throw money, time, and compute at thinking about and
       | then generating better training data. If the belief is that N
       | billion new tokens of high quality training data will unlock the
       | leap in capabilities they're looking for, then it makes sense to
       | delay the training until that dataset is ready
       | 
       | With o3 now public knowledge, imagine how long it's been churning
       | out new thinking at expert level across every field. OpenAI's
       | next moat may be the best synthetic training set ever.
       | 
       | At this point I would guess we get 4.5 with a subset of this -
       | some scale improvement, the algorithmic pickups since 4 was
       | trained, and a cleaned and improved core data set but without
       | risking leakage of the superior dataset
       | 
       | When 5 launches, we get to see what a fully scaled version looks
       | like with training data that outstrips average humans in almost
       | every problem space
       | 
       | Then the next o-model gets to start with that as a base and
       | reason? Its likely to be remarkable
        
         | jsheard wrote:
         | > With o3 now public knowledge, imagine how long it's been
         | churning out new thinking at expert level across every field.
         | OpenAI's next moat may be the best synthetic training set ever.
         | 
         | Even taking OpenAI and the benchmark authors at their word they
         | said that it is consuming at least tens of dollars per task to
         | hit peak performance, how much would it cost to have it produce
         | a meaningfully large training set?
        
           | qup wrote:
           | That's the public API price isn't it?
        
             | jsheard wrote:
             | There is no public API for o3 yet, those are the numbers
             | they revealed in the ARC-AGI announcement. Even if they
             | were public API prices we can't assume they're making a
             | profit on those for as long as they're billions in the red
             | overall every year, its entirely possible that the public
             | API prices are _less_ than what OpenAI is actually paying.
        
         | Stevvo wrote:
         | "With o3 now public knowledge, imagine how long it's been
         | churning out new thinking at expert level across every field."
         | 
         | I highly doubt that. o3 is many orders of magnitude more
         | expensive than paying subject matter experts to create new
         | data. It just doesn't make sense to pay six figures in compute
         | to get o3 to make data a human could make for a few hundred
         | dollars.
        
           | dartos wrote:
           | That's an interesting idea. What if OpenAI funded medical
           | research initiatives in exchange for exclusive training
           | rights on the research.
        
             | onlyrealcuzzo wrote:
             | It would be orders of magnitude cheaper to outsource to
             | humans.
        
               | dartos wrote:
               | Not as sexy to investors though
        
             | aswegs8 wrote:
             | Wait didn't they just recently request researchers to pair
             | up with them in exchange for the data?
        
           | DougN7 wrote:
           | Someone needs to dress up Mechanical Turk and repackage it as
           | an AI company.....
        
             | jitl wrote:
             | That's basically every AI company that existed before GPT3
        
           | bookaway wrote:
           | Yes, I think they had to push this reveal forward because
           | their investors were getting antsy with the lack of visible
           | progress to justify continuing rising valuations. There is no
           | other reason a confident company making continuous rapid
           | progress would feel the need to reveal a product that 99% of
           | companies worldwide couldn't use at the time of the reveal.
           | 
           | That being said, if OpenAI is burning cash at lightspeed and
           | doesn't have to publicly reveal the revenue they receive from
           | certain government entities, it wouldn't come as a surprise
           | if they let the government play with it early on in exchange
           | for some much needed cash to set on fire.
           | 
           | EDIT: The fact that multiple sites seem to be publishing
           | GPT-5 stories similar to this one leads one to conclude that
           | the o3 benchmark story was meant to counter the negativity
           | from this and other similar articles that are just coming
           | out.
        
           | tshadley wrote:
           | Seems to me o3 prices would be what the consumer pays, not
           | what OpenAI pays. That would mean o3 could be more efficient
           | in-house than paying subject-matter experts.
        
             | lalalali wrote:
             | What is open ai margin on that product?
        
           | mrshadowgoose wrote:
           | Can SMEs deliver that data in a meaningful amount of time?
           | Training data now is worth significantly more than data a
           | year from now.
        
         | dartos wrote:
         | I'm curious how, if at all, the plan to get around compounding
         | bias in synthetic data generated by models trained in synthetic
         | data.
        
           | ynniv wrote:
           | Everyone's obsessed with new training tokens... It doesn't
           | need to be more knowledgeable, it just needs to practice
           | more. Ask any student: practice is synthetic data.
        
             | dartos wrote:
             | That leads to overfitting in ML land, which hurts overall
             | performance.
             | 
             | We know that unique data improves performance.
             | 
             | These LLM systems are not students...
             | 
             | Also, which students graduate and are immediately experts
             | in their fields? Almost none.
             | 
             | It takes years of practice in unique, often one-off,
             | situations after graduation for most people to develop the
             | intuition needed for a given field.
        
               | ynniv wrote:
               | It's overfitting when you train too large a model on too
               | many details. Rote memorization isn't rewarding.
               | 
               | The more concepts the model manages to grok, the more
               | nonlinear its capabilities will be: we don't have a data
               | problem, we have an educational one.
               | 
               | Claude 3.5 was safety trained by Claude 3.0, and it's
               | more coherent for it.
               | https://www.anthropic.com/news/claudes-constitution
        
               | dartos wrote:
               | Overfitting can be caused by a lot of different things.
               | Having an over abundance of one kind of data in a
               | training set is one of those causes.
               | 
               | It's why many pre-processing steps for image training
               | pipelines will add copies of images at weird rotations,
               | amounts of blur, and different cropping.
               | 
               | > The more concepts the model manages to grok, the more
               | nonlinear its capabilities will be
               | 
               | These kind of hand wavey statements like "practice,"
               | "grok," and "nonlinear its capabilities will be" are not
               | very constructive as they don't have solid meaning wrt
               | language models.
               | 
               | So earlier when I was referring to compounding bias in
               | synthetic data I was referring to a bias that gets
               | trained on over and over and over again.
               | 
               | That leads to overfitting.
        
               | ynniv wrote:
               | _These kind of hand wavey statements like "practice,"
               | "grok," and "nonlinear its capabilities will be" are not
               | very constructive as they don't have solid meaning wrt
               | language models._
               | 
               | So, here's my hypothesis, as someone who is adjacent ML
               | but haven't trained DNNs directly:
               | 
               | We don't understand how they work, because we didn't
               | build them. They built themselves.
               | 
               | At face value this can be seen as an almost spiritual
               | position, but I am not a religious person and I don't
               | think there's any magic involved. Unlike traditional
               | models, the behavior of DNNs is based on random changes
               | that failed up. We can reason about their structure, but
               | only loosely about their functionality. When they get
               | better at drawing, it isn't because we taught them to
               | draw. When they get better at reasoning, it isn't because
               | the engineers were better philosophers. Given this, there
               | will not be a direct correlation between inputs and
               | capabilities, but some arrangements do work better than
               | others.
               | 
               | If this is the case, high order capabilities should
               | continue to increase with training cycles, as long as
               | they are performed in ways that don't interfere with what
               | has been successfully learned. People lamented the loss
               | of capability that GPT 4 suffered as they increased
               | safety. I think Anthropic has avoided this by choosing a
               | less damaging way to tune a well performing model.
               | 
               | I think these ideas are supported by Wolfram's reduction
               | of the problem at
               | https://writings.stephenwolfram.com/2024/08/whats-really-
               | goi...
        
               | dartos wrote:
               | Your whole argument falls apart at
               | 
               | > We don't understand how they work, because we didn't
               | build them. They built themselves.
               | 
               | We do understand how they work, we did build them. The
               | mathematical foundation of these models are sound. The
               | statistics behind them are well understood.
               | 
               | What we don't exactly know is which parameters correspond
               | to what results as it's different across models.
               | 
               | We work backwards to see which parts of the network seem
               | to relate to what outcomes.
               | 
               | > When they get better at drawing, it isn't because we
               | taught them to draw. When they get better at reasoning,
               | it isn't because the engineers were better philosophers.
               | 
               | Isn't this the exact opposite of reality?
               | 
               | They get better at drawing because we improve their
               | datasets, topologies, and their training methods and in
               | doing so, teach them to draw.
               | 
               | They get better at reasoning because the engineers and
               | data scientists building training sets do get better at
               | philosophy.
               | 
               | They study what reasoning is and apply those learnings to
               | the datasets and training methods.
               | 
               | That's how CoT came about early on.
        
               | ynniv wrote:
               | Please, read the Wolfram blog
        
             | layer8 wrote:
             | And who will tell the model whether its practice results
             | are correct or not? Students practice against external
             | evaluators, it's not a self-contained system.
        
           | nialv7 wrote:
           | synthetic data is fine if you can ground the model somehow.
           | that's why the o1/o3's improvements are mostly in reasoning,
           | maths, etc., because you can easily tell if the data is wrong
           | or not.
        
         | noman-land wrote:
         | I completely don't understand the use for synthetic data. What
         | good it's it to train a model basically on itself?
        
           | psb217 wrote:
           | The value of synthetic data relies on having non-zero signal
           | about which generated data is "better" or "worse". In a
           | sense, this what reinforcement learning is about. Ie,
           | generate some data, have that data scored by some evaluator,
           | and then feed the data back into the model with higher weight
           | on the better stuff and lower weight on the worse stuff.
           | 
           | The basic loop is: (i) generate synthetic data, (ii) rate
           | synthetic data, (iii) update model to put more probability on
           | better data and less probability on worse data, then go back
           | to (i).
        
             | noman-land wrote:
             | Thanks, that makes a lot more sense.
        
             | RedNifre wrote:
             | But who rates the synthetic data? If it is humans, I can
             | understand that this is another way to get human knowledge
             | into it, but if it's rated by AI, isn't it just a
             | convoluted way of copying the rating AI's knowledge?
        
               | ijustlovemath wrote:
               | This is the bit I've never understood about training AI
               | on its own output; won't you just regress to the mean?
        
               | recursivecaveat wrote:
               | Many things are more easily scored than produced. Like
               | it's trivial to tell whether a poem rhymes, but writing
               | one is a comparatively slow and difficult task. So
               | hopefully since scoring is easier/more-discerning than
               | generating, the idea is you can generate stuff, classify
               | it as good or bad, and then retrain on the good stuff.
               | It's kindof an article of faith for a lot of AI
               | companies/professionals as well, since it prevents you
               | from having to face a data wall, and is analogous to a
               | human student practicing and learning in an appealing
               | way.
               | 
               | As far as I know it doesn't work very well so far. It is
               | prone to overfitting, where it ranks highly some trivial
               | detail of the output eg "if a summary starts with a
               | byline of the author its a sign of quality" and then
               | starts looping on itself over and over, increasing the
               | frequency and size of bylines until it's totally crommed
               | off to infinity and just repeating a short phrase
               | endlessly. Humans have good baselines and common sense
               | that these ML systems lack, if you've ever seen one of
               | those "deep dream" images it's the same kind of idea. The
               | "most possible dog" image can be looks almost nothing
               | like a dog in the same way that the "most possible poem"
               | may look nothing like a poem.
        
           | viraptor wrote:
           | This is a good read for some examples
           | https://arxiv.org/abs/2203.14465
           | 
           | > This technique, the "Self-Taught Reasoner" (STaR), relies
           | on a simple loop: generate rationales to answer many
           | questions, prompted with a few rationale examples; if the
           | generated answers are wrong, try again to generate a
           | rationale given the correct answer; fine-tune on all the
           | rationales that ultimately yielded correct answers; repeat.
           | We show that STaR significantly improves performance on
           | multiple datasets compared to a model fine-tuned to directly
           | predict final answers
           | 
           | But there are a few others. In general good data is good
           | data. We're definitely learning more about how to produce
           | good synthetic version.
        
             | im3w1l wrote:
             | One issue with that is that the model may learn to smuggle
             | data. You as a human think that the plain reading of the
             | words is what is doing the reasoning, but (part of) the
             | processing is done by the exact comma placement and synonym
             | choice etc.
             | 
             | Data smuggling is a known phenomenon in similar tasks.
        
         | nialv7 wrote:
         | > OpenAI's next moat
         | 
         | I don't think oai has any moat at all. If you look around, QwQ
         | from Alibaba is already pushing o1-preview performances. I
         | think oai is only ahead by 3~6 months at most.
        
           | vasco wrote:
           | If their AGI dreams would come true it might be more than
           | enough to have 3 months head start. They probably won't, but
           | it's interesting to ponder what the next few hours, days,
           | weeks would be for someone that would wield AGI.
           | 
           | Like let's say you have a few datacenters of compute at your
           | disposal and the ability to instantiate millions of AGI
           | agents - what do you have them do?
           | 
           | I wonder if the USA already has a secret program for this
           | under national defense. But it is interesting that once you
           | do control an actual AGI you'd want to speed-run a bunch of
           | things. In opposition to that, how do you detect an adversary
           | already has / is using it and what to do in that case.
        
         | nradov wrote:
         | There is an enormous "iceberg" of untapped non-public data
         | locked behind paywalls or licensing agreements. The next
         | frontier will be spending money and human effort to get access
         | to that data, then transform it into something useful for
         | training.
        
         | sdwr wrote:
         | Great improvements and all, but they are still no closer (as of
         | 4o regular) to having a system that can be responsible for
         | work. In math problems, it forgets which variable represents
         | what, in coding questions it invents library fns.
         | 
         | I was watching a YouTube interview with a "trading floor
         | insider". They said they were really being paid for holding
         | risk. The bank has a position in a market, and it's their ass
         | on the line if it tanks.
         | 
         | ChatGPT (as far as I can tell) is no closer to being
         | accountable or responsible for anything it produces. If they
         | don't solve that (and the problem is probably inherent to the
         | architecture), they are, in some sense, polishing a turd.
        
           | tucnak wrote:
           | > ChatGPT (as far as I can tell) is no closer to being
           | accountable or responsible for anything it produces.
           | 
           | What does it even mean? How do you imagine that? You want
           | OpenAI to take on liability for the kicks of it?
        
             | numpad0 wrote:
             | If an LLM can't be left to do mowing by itself, but a human
             | will have to closely monitor and intervene at every its
             | steps, then it's just a super fast predictive keyboard, no?
        
             | SpicyLemonZest wrote:
             | They would want to, if they thought they could, because
             | doing so would unblock a ton of valuable use cases. A tax
             | preparation or financial advisor AI would do huge numbers
             | for any company able to promise that its advice can be
             | trusted.
        
             | dmkolobov wrote:
             | Obviously not. I want legislation which imposes liability
             | on OpenAI and similar companies if they actively market
             | their products for use in safety-critical fields and their
             | product doesn't perform as advertised.
             | 
             | If a system is providing incorrect medical diagnoses, or
             | denying services to protected classes due to biases in the
             | training in the training data, someone should be held
             | accountable.
        
       | neonate wrote:
       | https://archive.ph/L7fOF
        
       | h_tbob wrote:
       | It seems google has a massive advantage here since they can tap
       | all of YouTube to train. I wonder what openai is using for its
       | video data source.
        
         | onemoresoop wrote:
         | Train for what? For making videos? Train from people's
         | comments? There's a lot of garbage on AI slop on youtube, how
         | would this be sifted out? I think there's more value here on HN
         | in terms of training, but even that, to what avail?
        
           | a1j9o94 wrote:
           | YouTube is such a great multimodal dataset--videos, auto-
           | generated captions, and real engagement data all in one
           | place. That's a strong starting point for training, even
           | before you filter for quality. Microsoft's Phi-series models
           | already show how focusing on smaller, high-quality datasets,
           | like textbooks, can produce great results. You could totally
           | imagine doing the same thing with YouTube by filtering for
           | high-quality educational videos.
           | 
           | Down the line, I think models will start using video
           | generation as part of how they "think." Picture a version of
           | GPT that works frame by frame--ask it to solve a geometry
           | problem, and it generates a sequence of images to visualize
           | the solution before responding. YouTube's massive library of
           | visual content could make something like that possible.
        
           | h_tbob wrote:
           | From what I read openai is having trouble bc not enough data.
           | 
           | If u think about it, any videos on YouTube of real world data
           | contribute to its understanding of physics at minimum. From
           | what I gather they do pre training on tons of unstructured
           | content first and that contributes to overall smartness.
        
       | oytis wrote:
       | Good that we already have AGI in o3.
        
       | Wowfunhappy wrote:
       | Archive.is does not work for this article, does anyone have a
       | workaround?
        
         | randcraw wrote:
         | Right. "You have been blocked", is what I get.
         | 
         | But this works: https://www.msn.com/en-us/money/other/the-next-
         | great-leap-in...
        
         | cokml19 wrote:
         | this one does https://archive.md/L7fOF (it is just the previous
         | snapshot)
        
       | captainbland wrote:
       | In my intuition it makes sense that there is going to be some
       | significant friction in LLM development going forward. We're
       | talking about models that will cost upwards of $1bn to train.
       | Save for a technological breakthrough, GPT-6/7 will probably have
       | to wait for hardware to catch up.
        
         | rrrrrrrrrrrryan wrote:
         | I think the main bottleneck right now is training data -
         | they've basically exhausted all public sources of data, so they
         | have to either pay humans to generate new data from scratch or
         | pay for the reasoning models to generate (less useful)
         | synthetic training data. The next bottleneck is hardware, and
         | the least important bottleneck is money.
        
       | vrighter wrote:
       | probably because it isn't any better
        
       | OutOfHere wrote:
       | How about just an updated gpt 4o with all newer data? It would go
       | a long way. Currently it doesn't know anything since Oct 2023
       | (without having to do a web search).
        
       | simonw wrote:
       | "OpenAI's is called GPT-4, the fourth LLM the company has
       | developed since its 2015 founding." - that sentence doesn't fill
       | me with confidence in the quality of the rest of the article,
       | sadly.
        
         | 404mm wrote:
         | Quite funny that an article about AI was not fed to AI to proof
         | read it.
        
           | ToucanLoucan wrote:
           | Bold of you to assume AI didn't write it, too.
        
           | viraptor wrote:
           | Editing mistakes that AI wouldn't make is the new "proof of
           | human input".
        
         | jacobsimon wrote:
         | There's nothing grammatically offensive about this. It's like
         | saying, "Cars come in all colors. Mine is red."
        
           | simonw wrote:
           | No, I'm complaining that just because GPT-4 is called GPT-4
           | doesn't mean it's the fourth LLM from OpenAI.
           | 
           | Off the top of my head: GPT-2, Codex, GPT-3 in three
           | different flavors (babbage, curie, davinci), GPT-3.5.
           | 
           | Suggesting that GPT-4 was "fourth" simply isn't credible.
           | 
           | Just the other day they announced a jump from o1 to o3,
           | skipping o2 purely because it's already the name of a major
           | telecommunications brand in Europe. Deriving anything from
           | the names of OpenAI's products doesn't make sense.
        
             | benatkin wrote:
             | While I'm sure it's unintentional, that amounts to
             | nitpicking. I can easily find three to include and pass
             | over the rest. Face value turns out to be a decent
             | approximation.
        
             | vasco wrote:
             | Imagine coming up with a naming scheme for the versioning
             | of your product just for it to fail on the second time you
             | want to use it.
        
           | lelandfe wrote:
           | It's more like saying "the Audi Quattro, the company's fourth
           | car..."
        
             | benatkin wrote:
             | Because there's an Audi Tre e Mezzo?
        
           | dghlsakjg wrote:
           | The issue isn't the grammar. It is that there are 5 distinct
           | LLMs from OpenAI that you can use right now as well as 4
           | others that were deprecated in 2024.
        
       | selimnairb wrote:
       | I'm not smart enough or interesting enough to be hired by OpenAI
       | to expertly solve problems and explain how to the AI. However, I
       | like to think there isn't enough money in the world for me to
       | sell out my colleagues like that.
        
       | bwhiting2356 wrote:
       | I want AI to help me in the physical world: folding my laundry,
       | cooking and farming healthy food, cleaning toilets. Training data
       | is not lying around on the internet for free, but it's also not
       | impossible. How much data do you need? A dozen warehouses full of
       | robots folding and unfolding laundry 24/7 for a few months?
        
         | bobxmax wrote:
         | We are close. Language models and large vision models have
         | transformed robotics. It just takes some time to get hardware
         | up and running.
        
           | kelnos wrote:
           | I think it would be many decades before I'd trust a robot
           | like that around small children or pets. Robots with that
           | kind of movement capability, as well as the ability it pick
           | up and move things around, will be heavy enough that a small
           | mistake could easily kill a small child or pet.
        
             | viraptor wrote:
             | That's a solved problem for small devices. And we
             | effectively have "robots" like that all over the place.
             | Sliding doors in shops/trains/elevators have been around
             | for ages and they include sensors for resistance. Unless
             | there's 1. extreme cost cutting, or 2. bug in the hardware,
             | devices like that wouldn't kill children these days.
        
             | layer8 wrote:
             | Even for adults, a robot that would likely have to be close
             | to as massive as a human being, in order to do laundry and
             | the like, would spook me out, moving freely through my
             | place.
        
           | leonheld wrote:
           | > have transformed robotics
           | 
           | Did they? Where? Seriously, I genuinely want to know who is
           | employing these techniques.
        
             | bobxmax wrote:
             | All frontier labs are now employing LVMs or LLMs. But
             | that's my point is you won't see the fruits of it this
             | early.
        
               | achierius wrote:
               | That's the point being made. It's transformed robotics
               | research, yes, but it both remains to see whether it will
               | have a truly transformative effect on the field as
               | experienced by people outside academia (I think this is
               | quite probable) and more pointedly _when_.
        
             | fragmede wrote:
             | https://www.figure.ai/
             | 
             | specifically their speech demo video (which is, of course,
             | a demo video)
             | 
             | https://youtu.be/Sq1QZB5baNw
             | 
             | https://www.1x.tech/neo and
             | 
             | https://www.unitree.com/h1/
             | 
             | are undoubtedly using such models.
             | 
             | It's an area of active research, eg
             | 
             | https://www.physicalintelligence.company/blog/pi0
             | 
             | https://wholebody-b1.github.io/
             | 
             | https://ok-robot.github.io/
             | 
             | https://mobile-aloha.github.io/
        
         | SpicyLemonZest wrote:
         | Laundry folding is an instructive example. Machines have been
         | capable of home-scale laundry folding for over a decade, with
         | two companies Foldimate and Laundroid building functional
         | prototypes. The challenge is making it cost-competitive in a
         | world where most people don't even purchase a $10 folding
         | board.
         | 
         | I would guess that most cooking and cleaning tasks are in
         | basically the same space. You don't need fine motor control to
         | clean a toilet bowl, but you've gotta figure out how to get
         | people to buy the well-proven premisting technology before
         | you'll be able to sell them a toilet-cleaning robot.
        
           | layer8 wrote:
           | Counterexample: Everyone uses dishwashers. Yet I don't think
           | we'll have a robot doing the dishes human-style, or even just
           | filling up and clearing out a dishwasher, within the next
           | decade or two, regardless of price.
        
       | Animats wrote:
       | _" Orion's problems signaled to some at OpenAI that the more-is-
       | more strategy, which had driven much of its earlier success, was
       | running out of steam."_
       | 
       | So LLMs finally hit the wall. For a long time, more data, bigger
       | models, and more compute to drive them worked. But that's
       | apparently not enough any more.
       | 
       | Now someone has to have a new idea. There's plenty of money
       | available if someone has one.
       | 
       | The current level of LLM would be far more useful if someone
       | could get a conservative confidence metric out of the internals
       | of the model. This technology desperately needs to output "Don't
       | know" or "Not sure about this, but ..." when appropriate.
        
         | synapsomorphy wrote:
         | The new idea is already here and it's reasoning / chain of
         | thought.
         | 
         | Anecdotally Claude is pretty good at knowing the bounds of its
         | knowledge.
        
         | whoisthemachine wrote:
         | Unfortunately, the best they can do is "This is my confidence
         | on what someone would say given the prior context".
        
         | briga wrote:
         | What wall? Not a week has gone by in recent years without an
         | LLM breaking new benchmarks. There is little evidence to
         | suggest it will all come to a halt in 2025.
        
           | jrm4 wrote:
           | Sure, but "benchmarks" here seems roughly as useful as
           | "benchmarks" for GPUs or CPUs, which don't much translate to
           | what the makers of GPT need, which is 'money making use
           | cases.'
        
           | peepeepoopoo98 wrote:
           | O3 has demonstrated that OpenAI needs 1,000,000% more
           | inference time compute to score 50% higher on benchmarks. If
           | O3-High costs about $350k an hour to operate, that would mean
           | making O4 score 50% higher would cost _$3.5B_ (!!!) an hour.
           | _That_ scaling wall.
        
             | Kuinox wrote:
             | Wait a few month and they will have a distilled model with
             | the same performance and 1% of the run cost.
        
               | peepeepoopoo98 wrote:
               | 100X efficiency improvement (doubtful) still means that
               | costs grow 200X faster than benchmark performance.
        
               | achierius wrote:
               | Even assuming that past rates of inference cost scaling
               | hold up, we would only expect a 2 OoM decrease after
               | about a year or so. And 1% of 3.5b is still a very large
               | number.
        
             | norir wrote:
             | I used to run a lot of monte carlo simulations where the
             | error is proportional to the inverse square root. There was
             | a huge advantage of running for an hour vs a few minutes,
             | but you hit the diminishing returns depressingly quickly.
             | It would not surprise me at all if llms end up having
             | similar scaling properties.
        
             | oceanplexian wrote:
             | I'm convinced they're getting good at gaming the benchmarks
             | since 4 has deteriorated via ChatGPT, in fact I've used
             | 4-0125 and 4-1106 via the API and find them far superior to
             | o1 and o1-mini at coding problems. GPT4 is an amazing tool
             | but the true capabilities are being hidden from the public
             | and/or intentionally neutered.
        
             | og_kalu wrote:
             | Not really. o3-low compute still stomps the benchmarks and
             | isn't anywhere that expensive and o3-mini seems better than
             | o1 while being cheaper.
             | 
             | Combine that with the fact that LLM inference has reduced
             | orders of magnitudes in cost the last few years and
             | hampering over the inference costs of a new release seems a
             | bit silly.
        
         | simonw wrote:
         | The new idea is inference-time scaling, as seen in o1 (and o3
         | and Qwen's QwQ and DeepSeek's DeepSeek-R1-Lite-Preview and
         | Google's gemini-2.0-flash-thinking-exp).
         | 
         | I suggest reading these two pieces about that:
         | 
         | - https://www.aisnakeoil.com/p/is-ai-progress-slowing-down -
         | best explanation I've seen of inference scaling anywhere
         | 
         | - https://arcprize.org/blog/oai-o3-pub-breakthrough - Francois
         | Chollet's deep dive into o3
         | 
         | I've been tracking it on this tag on my blog:
         | https://simonwillison.net/tags/inference-scaling/
        
           | exhaze wrote:
           | I think the wildest thing is actually Meta's latest paper
           | where they show a method for LLMs reasoning not in English,
           | but in _latent space_
           | 
           | https://arxiv.org/pdf/2412.06769
           | 
           | I've done research myself adjacent to this (mapping parts of
           | a latent space onto a manifold), but this is a bit eerie,
           | even to me.
        
             | asadalt wrote:
             | kinda how we do it. language is just an io interface(but
             | also neural obv) on top of our reasoning engine.
        
             | ynniv wrote:
             | Is it "eerie"? LeCun has been talking about it for some
             | time, and may also be OpenAI's rumored q-star. You can't
             | hill climb tokens, but you can climb manifolds.
        
         | knapcio wrote:
         | I'm wondering whether O3 can be used to explore its own
         | improvement or optimization ideas, or if it hasn't reached that
         | point yet.
        
         | Yizahi wrote:
         | To output "don't know" a system needs to "know" too. Random
         | token generator can't know. It can guess better and better,
         | maybe it can even guess 99.99% of time, but it can't know, it
         | can't decide or reason (not even o1 can "reason").
        
       | Yizahi wrote:
       | GPT-5 is not behind schedule. GPT-5 is called GPT-4o and it has
       | been already released half a year ago. It was not revolutionary
       | enough to be called 5, and prophet saint Altman was probably
       | afraid to release new gen not exponentially improving, so it was
       | rebranded in the last moment. It's speculation of course, but it
       | is kinda obvious speculation.
        
         | glenstein wrote:
         | >GPT-5 is called GPT-4o
         | 
         | This is the first I have heard of this in particular. Do you
         | know of any article or source for more on the efforts to train
         | GPT 5 and the decision to call it GPT 4o?
        
       | phillipcarter wrote:
       | More palace intrigue, sigh.
       | 
       | Meanwhile, the biggest opportunity lies not in whatever next
       | thing OpenAI releases, but the rest of the enormous software
       | industry actually integrating this technology and realizing the
       | value it can deliver.
        
       ___________________________________________________________________
       (page generated 2024-12-22 23:00 UTC)