hngopher.com

       [HN Gopher] Learning to Reason with LLMs
       ___________________________________________________________________
        
       Learning to Reason with LLMs
        
       Author : fofoz
       Score  : 1289 points
       Date   : 2024-09-12 17:08 UTC (5 hours ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | fsflover wrote:
       | Dupe: https://news.ycombinator.com/item?id=41523050
        
       | farresito wrote:
       | Damn, that looks like a big jump.
        
         | deisteve wrote:
         | so o1 seems like it has real measurable edge, crushing it in
         | every single metric, i mean 1673 elo is insane, and 89th
         | percentile is like a whole different league, and it looks like
         | it's not just a one off either, it's consistently performing
         | way better than gpt-4o across all the datasets, even in the
         | ones where gpt-4o was already doing pretty well, like math and
         | mmlu, o1 is just taking it to the next level, and the fact that
         | it's not even showing up in some of the metrics, like mmmu and
         | mathvista, just makes it look even more impressive, i mean
         | what's going on with gpt-4o, is it just a total dud or what,
         | and btw what's the deal with the preview model, is that like a
         | beta version or something, and how does it compare to o1, is it
         | like a stepping stone to o1 or something, and btw has anyone
         | tried to dig into the actual performance of o1, like what's it
         | doing differently, is it just a matter of more training data or
         | is there something more going on, and btw what's the plan for
         | o1, is it going to be released to the public or is it just
         | going to be some internal tool or something
        
           | farresito wrote:
           | > like what's it doing differently, is it just a matter of
           | more training data or is there something more going on
           | 
           | Well, the model doesn't start with "GPT", so maybe they have
           | come up with something better.
        
             | rvnx wrote:
             | It sounds like GPT-4o with a long CoT prompt no ?
        
           | spaceman_2020 wrote:
           | 1673 ELO is wild
           | 
           | If its actually true in practice, I sincerely cannot imagine
           | a scenario where it would be cheaper to hire actual junior or
           | mid-tier developers (keyword: "developers", not architects or
           | engineers).
           | 
           | 1,673 ELO should be able to build very complex, scalable apps
           | with some guidance
        
             | deisteve wrote:
             | currently my workflow is generate some code, run it, if it
             | doesn't run i tell LLM what I expected, it will then
             | produce code and I frequently tell it how to reason about
             | the problem.
             | 
             | with O1 being in the 89th percentile would mean it should
             | be able to think at junior to intermediate level with very
             | strong consistency.
             | 
             | i dont think people in the comments realize the implication
             | of this. previously LLMs were able to only "pattern match"
             | but now its able to evaluate itself (with some guidance
             | ofc) essentially, steering the software into depth of edge
             | cases and reason about it in a way that feels natural to
             | us.
             | 
             | currently I'm copying and pasting stuff and notifying LLM
             | the results but once O1 is available its going to
             | significantly lower that frequency.
             | 
             | For example, I expect it to self evaluate the code its
             | generate and think at higher levels.
             | 
             | ex) oooh looks like this user shouldn't be able to escalate
             | privileges in this case because it would lead to security
             | issues or it could conflict with the code i generated 3
             | steps ago, i'll fix it myself.
        
             | usaar333 wrote:
             | I'm not sure how well codeforces percentiles correlate to
             | software engineering ability. Looking at all the data, it
             | still isn't. Key notes:
             | 
             | 1. AlphaCode 2 was already at 1650 last year.
             | 
             | 2. SWE-bench verified under an agent has jumped from 33.2%
             | to 35.8% under this model (which doesn't really matter).
             | The full model is at 41.4% which still isn't a game changer
             | either.
             | 
             | 3. It's not handling open ended questions much better than
             | gpt-4o.
        
               | deisteve wrote:
               | i think you are right now actually initially i got
               | excited but now i think OpenAI pulled the hype card again
               | to seem relevant as they struggle to be profitable
               | 
               | Claude on the other hand has been fantastic and seems to
               | do similar reasoning behind the scenes with RL
        
               | usaar333 wrote:
               | The model is really impressive to be fair. It's just how
               | economically relevant it is.
        
       | dinobones wrote:
       | Generating more "think out loud" tokens and hiding them from the
       | user...
       | 
       | Idk if I'm "feeling the AGI" if I'm being honest.
       | 
       | Also... telling that they choose to benchmark against CodeForces
       | rather than SWE-bench.
        
         | thelastparadise wrote:
         | Why not? Isn't that basically what humans do? Sit there and
         | think for a while before answering, going down different
         | branches/chains of thought?
        
           | dinobones wrote:
           | This new approach is showing:
           | 
           | 1) The "bitter lesson" may not be true, and there is a
           | fundamental limit to transformer intelligence.
           | 
           | 2) The "bitter lesson" is true, and there just isn't enough
           | data/compute/energy to train AGI.
           | 
           | All the cognition should be happening inside the transformer.
           | Attention is all you need. The possible cognition and
           | reasoning occurring "inside" in high dimensions is much more
           | advanced than any possible cognition that you output into
           | text tokens.
           | 
           | This feels like a sidequest/hack on what was otherwise a
           | promising path to AGI.
        
             | gradus_ad wrote:
             | Does that mean human intelligence is cheapened when you
             | talk out a problem to yourself? Or when you write down
             | steps solving a problem?
             | 
             | It's the exact same thing here.
        
               | barrell wrote:
               | lol come on it's not the exact same thing. At best this
               | is like gagging yourself while you talk about it then
               | engaging yourself when you say the answer. And that
               | presupposing LLMs are thinking in, your words, exactly
               | the same way as humans.
               | 
               | At best it maybe vaguely resembles thinking
        
               | exe34 wrote:
               | > "lol come on"
               | 
               | I've never found this sort of argument convincing. it's
               | very Chalmers.
        
               | youssefabdelm wrote:
               | > Does that mean human intelligence is cheapened when you
               | talk out a problem to yourself?
               | 
               | In a sense, maybe yeah. Of course if one were to really
               | be absolute about that statement it would be absurd, it
               | would greatly overfit the reality.
               | 
               | But it is interesting to assume this statement as true.
               | Oftentimes when we think of ideas "off the top of our
               | heads" they are not as profound as ideas that "come to
               | us" in the shower. The subconscious may be doing 'more'
               | 'computation' in a sense. Lakoff said the subconscious
               | was 98% of the brain, and that the conscious mind is the
               | tip of the iceberg of thought.
        
               | slashdave wrote:
               | The similarity is cosmetic only. The reason it is used is
               | because it's easy to leverage existing work in LLMs, and
               | scaling (although not cheap) is an obvious approach.
        
             | grbsh wrote:
             | On the contrary, this suggests that the bitter lesson is
             | alive and kicking. The bitter lesson doesn't say "compute
             | is all you need", it says "only those methods which allow
             | you to make better use of hardware as hardware itself
             | scales are relevant".
             | 
             | This chain of thought / reflection method allows you to
             | make better use of the hardware as the hardware itself
             | scales. If a given transformer is N billion parameters, and
             | to solve a harder problem we estimate we need 10N billion
             | parameters, one way to do it is to build a GPU cluster 10x
             | larger.
             | 
             | This method shows that there might be another way: instead
             | train the N billion model differently so that we can use
             | 10x of it at inference time. Say hardware gets 2x better in
             | 2 years -- then this method will be 20x better than now!
        
             | seydor wrote:
             | Attention is about similarity/statistical correlation which
             | is fundamentally stochastic , while reasoning needs to be
             | truthful and exact to be successful.
        
             | user9925 wrote:
             | I think it's too soon to tell. Training the next generation
             | of models means building out entire datacenters. So while
             | they wait they have engineers build these sidequests/hacks.
        
             | 93po wrote:
             | Karpathy himself believes that neural networks are
             | perfectly plausible as a key component to AGI. He has said
             | that it doesn't need to be superseded by something better,
             | it's just that everything else around it (especially
             | infrastructure) needs to improve. As one of the most
             | valuable opinions in the entire world on the subject, I
             | tend to trust what he said.
             | 
             | source: https://youtu.be/hM_h0UA7upI?t=973
        
             | authorfly wrote:
             | Imagine instead the bitter lesson says: we can expand an
             | outwards circle in many dimensions of ways to continuously
             | mathematically manipulate data to adjust outputs.
             | 
             | Even the attention-token approach is on the grand scale of
             | things a simple line outwards from the centre; we have not
             | even explored around the centre (with the same compute
             | spend) for things like non-token generation, different
             | layers/different activation functions and norming /
             | query/key/value set up (why do we only use the 3 inherent
             | to contextualising tokens, why not add a 4th matrix for
             | something else?), character, sentence, whole thought,
             | paragraph one-shot generation, positional embeddings which
             | could work differently.
             | 
             | The bitter lesson says there is almost a work completely
             | untouched by our findings for us to explore. The temporary
             | work of non-data approaches can piggy back off a point on
             | the line; it cannot expand it like we can as we exude out
             | from the circle..
        
           | aktuel wrote:
           | Sure, but if I want a human, I can hire a human. Humans also
           | do many other things I don't want my LLM to do.
        
             | forgot_old_user wrote:
             | well it could be a lot cheaper to hire the AI model instead
             | of a human?
        
           | imiric wrote:
           | Except that these aren't thoughts. These techniques are
           | improvements to how the model breaks down input data, and how
           | it evaluates its responses to arrive at a result that most
           | closely approximates patterns it was previously rewarded for.
           | Calling this "thinking" is anthropomorphizing what's really
           | happening. "AI" companies love to throw these phrases around,
           | since it obviously creates hype and pumps up their valuation.
           | 
           | Human thinking is much more nuanced than this mechanical
           | process. We rely on actually understanding the meaning of
           | what the text represents. We use deduction, intuition and
           | reasoning that involves semantic relationships between ideas.
           | Our understanding of the world doesn't require "reinforcement
           | learning" and being trained on all the text that's ever been
           | written.
           | 
           | Of course, this isn't to say that machine learning methods
           | can't be useful, or that we can't keep improving them to
           | yield better results. But these are still methods that mimic
           | human intelligence, and I think it's disingenuous to label
           | them as such.
        
             | golol wrote:
             | It becomes thinking when you reinforcement learn on those
             | Chain-of-Thought generations. The LLM is just a very good
             | initialization.
        
           | slashdave wrote:
           | Without a world model, not really.
        
             | UniverseHacker wrote:
             | The whole thing is a world model- accurately predicting
             | text that describes things happening in a world, can only
             | be done by modeling the world.
        
           | freejazz wrote:
           | Is it?
        
           | NewEntryHN wrote:
           | Yes but with concepts instead of tokens spelling out the
           | written representation of those concepts.
        
         | WXLCKNO wrote:
         | Exploring different approaches and stumbling on AGI eventually
         | through a combination of random discoveries will be the way to
         | go.
         | 
         | Same as Bitcoin being the right combination of things that
         | already existed.
        
           | ActionHank wrote:
           | Crypto being used as an example of how we have moved forward
           | successfully as a species is backward toilet sitting
           | behaviour.
        
       | lloydatkinson wrote:
       | What's with this how many r's in a strawberry thing I keep
       | seeing?
        
         | bn-l wrote:
         | It's a common LLM riddle. Apparently many fail to give the
         | right answer.
        
           | seydor wrote:
           | Somebody please ask o1 to solve it
        
             | lloydatkinson wrote:
             | The link shows it solving it
        
         | dr_quacksworth wrote:
         | LLM are bad at answering that question because inputs are
         | tokenized.
        
         | swalsh wrote:
         | Models don't really predict the next word, they predict the
         | next token. Strawberry is made up of multiple tokens, and the
         | model doesn't truely understand the characters in it... so it
         | tends to struggle.
        
         | runjake wrote:
         | This became something of a meme.
         | 
         | https://community.openai.com/t/incorrect-count-of-r-characte...
        
         | andrewla wrote:
         | What's amazing is that given how LLMs receive input data (as
         | tokenized streams, as other commenters have pointed out) it's
         | remarkable that it can ever answer this question correctly.
        
       | valine wrote:
       | The model performance is driven by chain of thought, but they
       | will not be providing chain of thought responses to the user for
       | various reasons including competitive advantage.
       | 
       | After the release of GPT4 it became very common to fine-tune non-
       | OpenAI models on GPT4 output. I'd say OpenAI is rightly concerned
       | that fine-tuning on chain of thought responses from this model
       | would allow for quicker reproduction of their results. This
       | forces everyone else to reproduce it the hard way. It's sad news
       | for open weight models but an understandable decision.
        
         | tomtom1337 wrote:
         | Can you explain what you mean by this?
        
           | tomduncalf wrote:
           | I think they mean that you won't be able to see the
           | "thinking"/"reasoning" part of the model's output, even
           | though you pay for it. If you could see that, you might be
           | able to infer better how these models reason and replicate it
           | as a competitor
        
           | ffreire wrote:
           | You can see an example of the Chain of Thought in the post,
           | it's quite extensive. Presumably they don't want to release
           | this so that it is raw and unfiltered and can better monitor
           | for cases of manipulation or deviation from training. What GP
           | is also referring to is explicitly stated in the post: they
           | also aren't release the CoT for competitive reasons, so that
           | presumably competitors like Anthropic are unable to use the
           | CoT to train their own frontier models.
        
             | gwd wrote:
             | > Presumably they don't want to release this so that it is
             | raw and unfiltered and can better monitor for cases of
             | manipulation or deviation from training.
             | 
             | My take was:
             | 
             | 1. A genuine, un-RLHF'd "chain of thought" might contain
             | things that shouldn't be told to the user. E.g., it might
             | at some point think to itself, "One way to make an
             | explosive would be to mix $X and $Y" or "It seems like they
             | might be able to poison the person".
             | 
             | 2. They want the "Chain of Thought" as much as possible to
             | reflect the _actual_ reasoning that the model is using; in
             | part so that they can understand what the model is actually
             | thinking. They fear that if they RLHF the chain of thought,
             | the model will self-censor in a way which undermines their
             | ability to see what it 's really thinking
             | 
             | 3. So, they RLHF _only_ the final output, _not_ the CoT,
             | letting the CoT be as frank within itself as any human; and
             | post-filter the CoT for the user.
        
               | Y_Y wrote:
               | RLHF is one thing, but now that the training is done it
               | has no bearing on whether or not you can show the chain
               | of thought to the user.
        
           | teaearlgraycold wrote:
           | Including the chain of thought would provide competitors with
           | training data.
        
           | andrewla wrote:
           | This is a transcription of a literal quote from the article:
           | 
           | > Therefore, after weighing multiple factors including user
           | experience, competitive advantage, and the option to pursue
           | the chain of thought monitoring, we have decided not to show
           | the raw chains of thought to users
        
             | baq wrote:
             | At least they're open about not being open. Very meta
             | OpenAI.
        
         | seydor wrote:
         | The open source/weights models so far have proved that openAI
         | doesn't have some special magic sauce. I m confident we ll soon
         | have a model from Meta or others that s close to this level of
         | reasoning. [Also consider that some of their top researchers
         | have departed]
         | 
         | On a cursory look, it looks like the chain of thought is a long
         | series of chains of thought balanced on each step, with a small
         | backtracking added whenever a negative result occurs, sort of
         | like solving a maze.
        
           | zamalek wrote:
           | I suspect that the largest limiting factor for a competing
           | model will be the dataset. Unless they somehow used GPT4 to
           | generate the dataset somehow, this is an extremely novel
           | dataset to have to build.
        
             | j_maffe wrote:
             | They almost definitely used existing models for generating
             | it. The human feedback part, however, is the expensive
             | aspect.
        
           | tarruda wrote:
           | I would love to see Meta releasing CoT specialized model as a
           | LoRa we can apply to existing 3.1 models
        
         | msp26 wrote:
         | That's unfortunate. When an LLM makes a mistake it's very
         | helpful to read the CoT and see what went wrong (input
         | error/instruction error/random shit)
        
           | dragonwriter wrote:
           | Yeah, exposed chain of thought is more useful as a user, as
           | well as being useful for training purposes.
        
             | riku_iki wrote:
             | I think we may discover that model do some cryptic mess
             | inside instead of some clean reasoning.
        
               | hadlock wrote:
               | Loopback to: "my code works. why does my code work?"
        
             | MVissers wrote:
             | I'd say depends. If the model iterates 100x I'd just say
             | give me the output.
             | 
             | Same with problem solving in my brain: Sure, sometimes it
             | helps to think out loud. But taking a break and let my
             | unconcious do the work is helpful as well. For complex
             | problems that's actually nice.
             | 
             | I think eventually we don't care as long as it works or we
             | can easily debug it.
        
         | ramadis wrote:
         | It'd be helpful if they exposed a summary of the chain-of-
         | thought response instead. That way they'd not be leaking the
         | actual tokens, but you'd still be able to understand the
         | outline of the process. And, hopefully, understand where it
         | went wrong.
        
           | seydor wrote:
           | They do, according to the example
        
           | ashellunts wrote:
           | Exactly that I see in the Android app.
        
         | yunohn wrote:
         | Given the significant chain of thought tokens being generated,
         | it also feels a bit odd to hide it from a cost fairness
         | perspective. How do we believe they aren't inflating it for
         | profit?
        
           | wmf wrote:
           | That sounds like the GPU labor theory of value that was
           | debunked a century ago.
        
             | dragonwriter wrote:
             | No, its the fraud theory of charging for usage that is
             | unaccountable that has been repeatedly proven true when
             | unaccountable bases for charges have been deployed.
        
               | nfw2 wrote:
               | The one-shot models aren't going away for anyone who
               | wants to program the chain-of-thought themselves
        
               | wmf wrote:
               | Yeah, if they are charging for some specific resource
               | like tokens then it better be accurate. But ultimately
               | utility-like pricing is a mistake IMO. I think they
               | should try to align their pricing with the customer value
               | they're creating.
        
             | yunohn wrote:
             | Not sure why you didn't bother to check their pricing page
             | (1) before dismissing my point. They are charging
             | significantly more for both input (3x) and output (4x)
             | tokens when using o1.
             | 
             | Per 1M in/out tokens:
             | 
             | GPT4o - 5$/15$
             | 
             | O1-preview - 15$/60$
             | 
             | (1) https://openai.com/api/pricing
        
               | wmf wrote:
               | My point is that "cost fairness" is not a thing. Either
               | o1 is worth it to you or it isn't.
        
               | dongping wrote:
               | If there's a high premium, then one might want to wait
               | for a year or two for the premium to vanish.
        
               | yunohn wrote:
               | It's really unclear to me what you understood by "cost
               | fairness".
               | 
               | I'm saying if you charge me per brick laid, but you can't
               | show me how many bricks were laid, nor can I calculate
               | how many should have been laid - how do I trust your
               | invoice?
               | 
               | Note: The reason I say all this is because OpenAI is
               | simultaneously flailing for funding, while being
               | inherently unprofitable as it continues to boil the ocean
               | searching for strawberries.
        
         | rglullis wrote:
         | When are they going to change the name to reflect their
         | complete change of direction?
         | 
         | Also, what is going to be their excuse to defend themselves
         | against copyright lawsuits if they are going to
         | "understandably" keep their models closed?
        
         | tcdent wrote:
         | CoT is now their primary method for alignment. Exposing that
         | information would negate that benefit.
         | 
         | I don't agree with this, but it definitely carries higher
         | weight in their decision making than leaking relevant training
         | info to other models.
        
           | zellyn wrote:
           | This. Please go read and understand the alignment argument
           | against exposing chain of thought reasoning.
        
         | amelius wrote:
         | > I'd say OpenAI is rightly concerned that fine-tuning on chain
         | of thought responses from this model would allow for quicker
         | reproduction of their results.
         | 
         | Why? They're called "Open" AI after all ...
        
         | ashellunts wrote:
         | I see chain of thought responses in chatgpt android app.
        
           | ashellunts wrote:
           | Tested cipher example, and it got it right. But "thinking
           | logs" I see in the app looks like a summary of actual chain
           | of thought messages that are not visible.
        
       | p1esk wrote:
       | _after weighing multiple factors including user experience,
       | competitive advantage, and the option to pursue the chain of
       | thought monitoring, we have decided not to show the raw chains of
       | thought to users._
        
         | zaptrem wrote:
         | This also makes them less useful because I can't just click
         | stop generation when they make a logical error re: coding.
        
           | neonbjb wrote:
           | You wouldn't do that to this model. It finds its own mistakes
           | and corrects them as it is thinking through things.
        
             | zaptrem wrote:
             | No model is perfect, the less I can see into what it's
             | "thinking" the less productively I can use it. So much for
             | interpretability.
        
         | sterlind wrote:
         | "Open"AI is such a comically ironic name at this point.
        
         | swalsh wrote:
         | We're not going to give you training data... for a better user
         | experience.
        
         | scosman wrote:
         | Saying "competitive advantage" so directly is surprising.
         | 
         | There must be some magic sauce here for guiding LLMs which
         | boosts performance. They must think inspecting a reasonable
         | number of chains would allow others to replicate it.
         | 
         | They call GPT 4 a model. But we don't know if it's really a
         | system that builds in a ton of best practices and secret
         | tactics: prompt expansion, guided CoT, etc. Dalle was
         | transparent that it automated re-generating the prompts, adding
         | missing details prior to generation. This and a lot more could
         | all be running under the hood here.
        
         | 0x_rs wrote:
         | Lame but not atypical of OpenAI. Too bad, but I'm expecting
         | competitors to follow with this sort of implementation and
         | better. Being able to view the "reasoning" process and
         | especially being able to modify it and re-render the answer may
         | be faster than editing your prompt a few times until you get
         | the desired response, if you even manage to do that.
        
       | skywhopper wrote:
       | No direct indication of what "maximum test time" means, but if
       | I'm reading the obscured language properly, the best scores on
       | standardized tests were generated across a thousand samples with
       | supplemental help provided.
       | 
       | Obviously, I hope everyone takes what any company says about the
       | capabilities of its own software with a huge grain of salt. But
       | it seems particularly called for here.
        
       | immortal3 wrote:
       | Honestly, it doesn't matter for the end user if there are more
       | tokens generated between the AI reply and human message. This is
       | like getting rid of AI wrappers for specific tasks. If the jump
       | in accuracy is actual, then for all practical purposes, we have a
       | sufficiently capable AI which has the potential to boost
       | productivity at the largest scale in human history.
        
         | Lalabadie wrote:
         | It starts to matter if the compute time is 10-100 fold, as the
         | provider needs to bill for it.
         | 
         | Of course, that's assuming it's not priced for market
         | acquisition funded by a huge operational deficit, which is a
         | rarely safe to conclude with AI right now.
        
           | skywhopper wrote:
           | Given that their compute-time vs accuracy charts labeled the
           | compute time axis as logarithmic would worry me greatly about
           | this aspect.
        
       | deisteve wrote:
       | yeah this is kinda cool i guess but 808 elo is still pretty bad
       | for a model that can supposedly code like a human, i mean 11th
       | percentile is like barely scraping by, and what even is the point
       | of simulating codeforces if youre just gonna make a model that
       | can barely compete with a decent amateur, and btw what kind of
       | contest allows 10 submissions, thats not how codeforces works,
       | and what about the time limits and memory limits and all that
       | jazz, did they even simulate those, and btw how did they even get
       | the elo ratings, is it just some arbitrary number they pulled out
       | of their butt, and what about the model that got 1807 elo, is
       | that even a real model or just some cherry picked result, and btw
       | what does it even mean to "perform better than 93% of
       | competitors" when the competition is a bunch of humans who are
       | all over the place in terms of skill, like what even is the
       | baseline for comparison
       | 
       | edit: i got confused with the Codeforce. it is indeed zero shot
       | and O1 is potentially something very new I hope Anthropic and
       | others will follow suit
       | 
       | any type of reasoning capability i'll take it !
        
         | qt31415926 wrote:
         | 808 ELO was for GPT-4o.
         | 
         | I would suggest re-reading more carefully
        
           | deisteve wrote:
           | you are right i read the charts wrong. O1 has significant
           | lead over GPT-4o in the zero shot examples
           | 
           | honestly im spooked
        
       | catchnear4321 wrote:
       | oh wow, something you can roughly model as a diy in a base model.
       | so impressive. yawn.
       | 
       | at least NVDA should benefit. i guess.
        
         | apsec112 wrote:
         | If there's a way to do something like this with Llama I'd love
         | to hear about it (not being sarcastic)
        
           | catchnear4321 wrote:
           | nurture the model have patience and a couple bash scripts
        
             | apsec112 wrote:
             | But what does that mean? I can't do "pip install nurture"
             | or "pip install patience". I can generate a bunch of
             | answers and take the consensus, but we've been able to do
             | that for years. I can do fine-tuning or DPO, but on what?
        
               | catchnear4321 wrote:
               | you want instructions on how to compete with OpenAI?
               | 
               | go play more, your priorities and focus on it being work
               | are making you think this to be harder than it is, and
               | the models can even tell you this.
               | 
               | you don't have to like the answer, but take it seriously,
               | and you might come back and like it quite a bit.
               | 
               | you have to have patience because you likely wont have
               | scale - but it is not just patience with the response
               | time.
        
       | gliiics wrote:
       | Congrats to OpenAI for yet another product that has nothing to do
       | with the word "open"
        
         | sk11001 wrote:
         | And Apple's product line this year? Phones. Nothing to do with
         | fruit. Almost 50 years of lying to people. Names should mean
         | something!
        
           | achrono wrote:
           | Did Apple start their company by saying they will be selling
           | apples?
        
             | sk11001 wrote:
             | What's the statement that OpenAI are making today which you
             | think they're violating? There very well could be one and
             | if there is, it would make sense to talk about it.
             | 
             | But arguments like "you wrote $x in a blog post when you
             | founded your company" or "this is what the word in your
             | name means" are infantile.
        
         | trash_cat wrote:
         | It is open in the sense that everyone can use it.
        
           | bionhoward wrote:
           | Not people working on AI or those who would like to train AI
           | on their logs
        
           | oblio wrote:
           | If they would have launched it with Oracle DB style licensing
           | their company would have been dead in 1 year.
        
           | Hizonner wrote:
           | Only people who exactly share OpenAI's concepts of what
           | "alignment" and "safety" should mean can use it to its full
           | potential.
        
       | RandomThoughts3 wrote:
       | > "Therefore, after weighing multiple factors including user
       | experience, competitive advantage, and the option to pursue the
       | chain of thought monitoring, we have decided not to show the raw
       | chains of thought to users."
       | 
       | Trust us, we have your best intention in mind. I'm still
       | impressed by how astonishingly impossible to like and root for
       | OpenAI is for a company with such an innovative product.
        
       | TheAceOfHearts wrote:
       | Kinda disappointed that they're hiding the thought process.
       | Hopefully the open source community will figure out how to
       | effectively match and replicate what OpenAI is doing.
       | 
       | I wonder how far we are from having a model that can correctly
       | solve a word soup search problem directly from just a prompt and
       | input image. It seems like the crossword example is close. For a
       | word search it would require turning the image into an internal
       | grid representation, prepare the list of words, and do a search.
       | I'd be interested in seeing if this model can already solve the
       | word grid search problem if you give it the correct
       | representation as an input.
        
         | zozbot234 wrote:
         | > Hopefully the open source community will figure out how to
         | effectively match and replicate what OpenAI is doing.
         | 
         | No need for that, there is a Reflection 70B model that does the
         | exact same thing - with chains of thought being separated from
         | the "final answer" via custom 'tag' tokens.
        
           | TheAceOfHearts wrote:
           | Wasn't this the model that was proven to have been faking
           | their benchmarks recently? Or am I thinking of a different
           | model?
        
             | brokensegue wrote:
             | yes. it was fake
        
               | zozbot234 wrote:
               | Some reported benchmarks do seem to be rather dubious,
               | but the 70B model itself is quite real. Sample output:
               | $ ollama run reflection:70b-q4_0       >>> hello
               | <thinking>       To respond to "hello", I'll need to
               | consider several factors:              1. The user's
               | intent: They're likely saying hello as a greeting.
               | 2. Appropriate responses: Common ways to respond to
               | "hello" are:          - Hello back          - Hi
               | - Hey       3. Formality level: Since this is an AI
               | response, I'll aim for a friendly but professional tone.
               | <reflection>       The approach of responding with a
               | standard greeting seems appropriate in this context. It
               | acknowledges the user's hello and provides a polite
               | response.       </reflection>              Given these
               | considerations, the most suitable response would be to
               | echo "hello" back to the user.       </thinking>
               | <output>       Hello!       </output>
        
               | agolio wrote:
               | There was a hackernews post a few days ago, pointing to a
               | reddit thread where some guys proved that the founder/s
               | of relection AI were faking their model by just passing
               | the input to Claude (Sonnet 3.5) and stripping the word
               | "Claude" from the output, amongst other things. Then when
               | they got caught they switched it to GPT 4-o.
               | 
               | After this, I will be very skeptical to anything they
               | claim to achieve.
               | 
               | https://news.ycombinator.com/item?id=41484981
        
             | jslakro wrote:
             | It's the same, for sure the proximity of that little
             | scandal to this announcement is no coincidence.
        
             | Filligree wrote:
             | That's the one.
        
           | staticman2 wrote:
           | That reflection model is in no way comparable to whatever
           | OpenAI is doing.
        
         | rankam wrote:
         | I have access to the model via the web client and it does show
         | the thought process along the way. It shows a little icon that
         | says things like "Examining parser logic", "Understanding data
         | structures"...
         | 
         | However, once the answer is complete the chain of thought is
         | lost
        
           | knotty66 wrote:
           | It's still there.
           | 
           | Where it says "Thought for 20 seconds" - you can click the
           | Chevron to expand it and see what I guess is the entire chain
           | of thought.
        
             | EgoIncarnate wrote:
             | Per OpenAI, it's a summary of the chain of thought, not the
             | actual chain of thought.
        
       | crakenzak wrote:
       | > we are releasing an early version of this model, OpenAI
       | o1-preview, for immediate use in ChatGPT
       | 
       | Awesome!
        
         | dinobones wrote:
         | I am interpreting "immediate use in ChatGPT" the same way
         | advanced voice mode was promised "in the next few weeks."
         | 
         | Probably 1% of users will get access to it, with a 20/message a
         | day rate limit. Until early next year.
        
           | nilsherzig wrote:
           | Rate limit is 30 a week for the big one and 50 for the small
           | one
        
         | afruitpie wrote:
         | Rate limited to 30 messages per week for ChatGPT Plus
         | subscribers at launch: https://openai.com/index/introducing-
         | openai-o1-preview/
        
         | benterix wrote:
         | Read "immediate" in "immediate use" in the same way as "open"
         | in "OpenAI".
        
           | apsec112 wrote:
           | You can use it, I just tried a few minutes ago. It's
           | apparently limited to 30 messages/week, though.
        
             | rvnx wrote:
             | The option isn't there for us (though the blogpost says
             | otherwise), even after CTRL-SHIFT-R, hence the parent
             | comment.
        
       | Ninjinka wrote:
       | Someone give this model an IQ test stat.
        
         | adverbly wrote:
         | You're kidding right? The tests they gave it are probably
         | better tests than IQ tests at determining actually useful
         | problem solving skills...
        
         | Vecr wrote:
         | It can't do large portions of the parts of an IQ test (not
         | multi-modal). Otherwise I think it's essentially superhuman,
         | modulo tokenization issues (please start running byte-by-byte
         | or at least come up with a better tokenizer).
        
       | modeless wrote:
       | > We have found that the performance of o1 consistently improves
       | with more reinforcement learning (train-time compute) and with
       | more time spent thinking (test-time compute).
       | 
       | Wow. So we can expect scaling to continue after all. Hyperscalers
       | feeling pretty good about their big bets right now. Jensen is
       | smiling.
       | 
       | This is the most important thing. Performance today matters less
       | than the scaling laws. I think everyone has been waiting for the
       | next release just trying to figure out what the future will look
       | like. This is good evidence that we are on the path to AGI.
        
         | ffsm8 wrote:
         | It'd be interesting for sure if true. Gotta remember that this
         | is a marketing post though, let's wait a few months and see if
         | its actually true. Things are definitely interesting, wherever
         | these techniques will get us AGI or not
        
         | XCSme wrote:
         | Nvidia stock go brrr...
        
         | acchow wrote:
         | Even when we start to plateau on direct LLM performance, we can
         | still get significant jumps by stacking LLMs together or
         | putting a cluster of them together.
        
         | gizmo wrote:
         | Microsoft, Google, Facebook have all said in recent weeks that
         | they fully expect their AI datacenter spend to accelerate. They
         | are effectively all-in on AI. Demand for nvidia chips is
         | effectively infinite.
        
           | seydor wrote:
           | Until the first LLM that can improve itself occurs. Then
           | $NVDA tanks
        
         | modeless wrote:
         | More, from an OpenAI employee:
         | 
         | > I really hope people understand that this is a new paradigm:
         | don't expect the same pace, schedule, or dynamics of pre-
         | training era. I believe the rate of improvement on evals with
         | our reasoning models has been the fastest in OpenAI history.
         | 
         | > It's going to be a wild year.
         | 
         | https://x.com/willdepue/status/1834294935497179633
        
       | breck wrote:
       | I LOVE the long list of contributions. It looks like the credits
       | from a Christoper Nolan film. So many people involved. Nice care
       | to create a nice looking credits page. A practice worth copying.
       | 
       | https://openai.com/openai-o1-contributions/
        
       | rfw300 wrote:
       | A lot of skepticism here, but these are astonishing results!
       | People should realize we're reaching the point where LLMs are
       | surpassing humans in any task limited in scope enough to be a
       | "benchmark". And as anyone who's spent time using Claude 3.5
       | Sonnet / GPT-4o can attest, these things really are useful and
       | smart! (And, if these results hold up, O1 is much, much smarter.)
       | This is a nerve-wracking time to be a knowledge worker for sure.
        
         | bigstrat2003 wrote:
         | I cannot, in fact, attest that they are useful and smart. LLMs
         | remain a fun toy for me, not something that actually produces
         | useful results.
        
           | pdntspa wrote:
           | I have been deploying useful code from LLMs right and left
           | over the last several months. They are a significant force
           | accelerator for programmers if you know how to prompt them
           | well.
        
             | fiddlerwoaroof wrote:
             | We'll see if this is a good idea when we start having
             | millions of lines of LLM-written legacy code. My experience
             | maintaining such code so far has been very bad:
             | accidentally quadratic algorithms; subtly wrong code that
             | looks right; and un-idiomatic use of programming language
             | features.
        
               | deisteve wrote:
               | ah i see so you're saying that LLM-written code is
               | already showing signs of being a maintenance nightmare,
               | and that's a reason to be skeptical about its adoption.
               | But isn't that just a classic case of 'we've always done
               | it this way' thinking?
               | 
               | legacy code is a problem regardless of who wrote it.
               | Humans have been writing suboptimal, hard-to-maintain
               | code for decades. At least with LLMs, we have the
               | opportunity to design and implement better coding
               | standards and review processes from the start.
               | 
               | let's be real, most of the code written by humans is not
               | exactly a paragon of elegance and maintainability either.
               | I've seen my fair share of 'accidentally quadratic
               | algorithms' and 'subtly wrong code that looks right'
               | written by humans. At least with LLMs, we can identify
               | and address these issues more systematically.
               | 
               | As for 'un-idiomatic use of programming language
               | features', isn't that just a matter of training the LLM
               | on a more diverse set of coding styles and idioms? It's
               | not like humans have a monopoly on good coding practices.
               | 
               | So, instead of throwing up our hands, why not try to
               | address these issues head-on and see if we can create a
               | better future for software development?
        
               | fiddlerwoaroof wrote:
               | Maybe it will work out, but I think we'll regret this
               | experiment because it's the wrong sort of "force
               | accelerator": writing tons of code that should be
               | abstracted rather than just dumped out literally has
               | always caused the worst messes I've seen.
        
               | medvezhenok wrote:
               | Yes, same way that the image model outputs have already
               | permeated the blogosphere and pushed out some artists,
               | the other models will all bury us under a pile of auto-
               | generated code.
               | 
               | We will yearn for the pre-GPT years at some point, like
               | we yearn for the internet of the late 90s/early 2000s.
               | Not for a while though. We're going through the early
               | phases of GPT today, so it hasn't been taken over by the
               | traditional power players yet.
        
               | Eggpants wrote:
               | When the tool is statistical word vomit based, it will
               | never move beyond cool bar trick levels.
        
               | oblio wrote:
               | LLMs will allow us to write code faster and create
               | applications and systems faster.
               | 
               | Which is how we ended up here, which I guess is
               | tolerable, where a webpage with a bit of styling and a
               | table uses up 200MB of RAM.
        
               | pdntspa wrote:
               | Honestly the code it's been giving me has been fairly
               | cromulent. I don't believe in premature optimization and
               | it is perfect for getting features out quick and then I
               | mold it to what it needs to be.
        
             | deisteve wrote:
             | same...but have you considered the broader implications of
             | relying on LLMs to generate code? It's not just about being
             | a 'force accelerator' for individual programmers, but also
             | about the potential impact on the industry as a whole.
             | 
             | If LLMs can generate high-quality code with minimal human
             | input, what does that mean for the wages and job security
             | of programmers? Will companies start to rely more heavily
             | on AI-generated code, and less on human developers? It's
             | not hard to imagine a future where LLMs are used to drive
             | down programming costs, and human developers are relegated
             | to maintenance and debugging work.
             | 
             | I'm not saying that's necessarily a bad thing, but it's
             | definitely something that needs to be considered. As
             | someone who's enthusiastic about the potential of code gen
             | this O1 reasoning capability is going to make big changes.
             | 
             | do you think you'll be willing to take a pay cut when your
             | employer realizes they can get similar results from a
             | machine in a few seconds?
        
               | airstrike wrote:
               | As a society we're not solving for programmer salaries
               | but for general welfare which is basically code for
               | "cheaper goods and services".
        
               | pdntspa wrote:
               | My boss is holding a figurative gun to my head to use
               | this stuff. His performance targets necessitate the use
               | of it. It is what it is.
        
               | oblio wrote:
               | Yeah, but this, in itself, is triggered by a hype wave.
               | These come and go. So we can't really judge the long term
               | impact from inside the wave.
        
               | fragmede wrote:
               | Your job won't be taken by AI, it will be taken by
               | someone wielding AI.
        
             | attilakun wrote:
             | In a way it's not surprising that people are getting vastly
             | different results out of LLMs. People have different skill
             | levels when it comes to using even Google. An LLM has a
             | vastly bigger input space.
        
             | criddell wrote:
             | What's a sample prompt that you've used? Every time I've
             | tried to use one for programming, they invent APIs that
             | don't exist (but sound like they might) or fail to produce
             | something that does what it says it does.
        
               | GaggiX wrote:
               | Have you tried Claude 3.5 Sonnet?
        
               | disgruntledphd2 wrote:
               | Use Python or JS. The models definitely don't seem to
               | perform as well on less hyper prevalent languages.
        
               | randomdata wrote:
               | Even then it is hit and miss. If you are doing something
               | that is also copy/paste-able out of a StackOverflow
               | comment, you're apt to be fine, but as soon as you are
               | doing anything slightly less common... Good luck.
        
               | disgruntledphd2 wrote:
               | Yeah, fair. It's good for short snippets and ways of
               | approaching the problem but not great at execution.
               | 
               | It's like infinitely tailored blog posts, for me at
               | least.
        
               | randomdata wrote:
               | True. It can be good at giving you pointers towards
               | approaching the problem, even if the result is flawed,
               | for slightly less common problems. But as you slide even
               | father towards esotericism, there is no hope. It won't
               | even get you in the right direction. Unfortunately - as
               | that is where it would be most useful.
        
               | brianshaler wrote:
               | No matter the prompt, there's a significant difference
               | between how it handles common problems in popular
               | languages (python, JS) versus esoteric algorithms in
               | niche languages or tools.
               | 
               | I had a funny one a while back (granted this was probably
               | ChatGPT 3.5) where I was trying to figure out what
               | payload would get AWS CloudFormation to fix an
               | authentication problem between 2 services and ChatGPT
               | confidently proposed adding some OAuth querystring
               | parameters to the AWS API endpoint.
        
               | pdntspa wrote:
               | I just ask it for what I want in very specific detail,
               | stating the language and frameworks in use. I keep the
               | ideas self-contained -- for example if I need something
               | for the frontend I will ask it to make me a webcomponent.
               | Asking it to not make assumptions and ask questions on
               | ambiguities is also very helpful.
               | 
               | It tends to fall apart on bigger asks with larger
               | context. Breaking your task into discrete subtasks works
               | well.
        
             | unshavedyak wrote:
             | I think that's just the same as using an autocomplete
             | efficiently, though. I tend to like them for Search, but
             | not for anything i have to "prompt correctly" because i
             | feel like i can type fast enough that i'm not too worried
             | about auto-completing.
             | 
             | With that said i'm not one of those "It's just a parrot!"
             | people. It is, definitely just a parrot atm.. however i'm
             | not convinced _we 're not parrots_ as well. Notably i'm not
             | convinced that that complexity won't be sufficient to walk
             | talk and act like intelligence. I'm not convinced that
             | intelligence is different than complexity. I'm not an
             | expert though, so this is just some dudes stupid opinion.
             | 
             | I suspect if LLMs can prove to have duck-intelligence (ie
             | duck typing but for intelligence) then it'll only be
             | achieved in volumes much larger than we imagine. We'll
             | continue to refine and reduce how much volume is necessary,
             | but nevertheless i expect complexity to be the real
             | barrier.
        
           | deisteve wrote:
           | 'Not useful' is a pretty low bar to clear, especially when
           | you consider the state of the art just 5 years ago. LLMs may
           | not be solving world hunger, but they're already being used
           | in production for coding
           | 
           | If you're not seeing value in them, maybe it's because you're
           | not looking at the right problems. Or maybe you're just not
           | using them correctly. Either way, dismissing an entire field
           | of research because it doesn't fit your narrow use case is
           | pretty short-sighted.
           | 
           | FWIW, I've been using LLMs to generate production code and
           | it's saved me weeks if not months. YMMV, I guess
        
           | rfw300 wrote:
           | It's definitely the case that there are _some_ programming
           | workflows where LLMs aren't useful. But I can say with
           | certainty that there are many where they have become
           | incredibly useful recently. The difference between even GPT-4
           | last year and C3.5 /GPT-4o this year is profound.
           | 
           | I recently wrote a complex web frontend for a tool I've been
           | building with Cursor/Claude and I wrote maybe 10% of the
           | code; the rest with broad instructions. Had I done it all
           | myself (or even with GitHub Copilot only) it would have taken
           | 5 times longer. You can say this isn't the most complex task
           | on the planet, but it's real work, and it matters a lot! So
           | for increasingly many, regardless of your personal
           | experience, these things have gone far beyond "useful toy".
        
             | uoaei wrote:
             | The sooner those paths are closed for low-effort high-pay
             | jobs, the better, IMO. All this money for no work is going
             | to our heads.
             | 
             | It's time to learn some real math and science, the era of
             | regurgitating UI templates is over.
        
               | rfw300 wrote:
               | I don't want to be in the business of LLM defender, but
               | it's just hard to imagine this aging well when you step
               | back and look at the pace of advancement here. In the
               | realm of "real math and science", O1 has improved from 0%
               | to 50% on AIME today. A year ago, LLMs could only write
               | little functions, not much better than searching
               | StackOverflow. Today, they can write thousands of lines
               | of code that work together with minimal supervision.
               | 
               | I'm sure this tech continues to have many limitations,
               | but every piece of trajectory evidence we have points in
               | the same direction. I just think you should be prepared
               | for the ratio of "real" work vs. LLM-capable work to
               | become increasingly small.
        
               | oblio wrote:
               | > The sooner those paths are closed for low-effort high-
               | pay jobs, the better, IMO. All this money for no work is
               | going to our heads.
               | 
               | > It's time to learn some real math and science, the era
               | of regurgitating UI templates is over.
               | 
               | You do realize that software development was one of the
               | last social elevators, right?
               | 
               | What you're asking for won't happen, let alone the fact
               | that "real math and science" pay a pittance, there's a
               | reason the pauper mathematician was a common meme.
        
           | bongodongobob wrote:
           | At this point, you're either saying "I don't understand how
           | to prompt them" or "I'm a Luddite". They are useful, here to
           | stay, and only getting better.
        
           | baq wrote:
           | Familiarize yourself with a tool which does half the
           | prompting for you, e.g. cursor is pretty good at prompting
           | claude 3.5 and it really does make code edits 10x faster (I'm
           | not even talking about the fancy stuff about generating apps
           | in 5 mins - just plain old edits.)
        
         | jimkoen wrote:
         | Is it? They talk about 10k attempts to reach gold medal status
         | in the mathematics olympiad, but zero shot performance doesn't
         | even place it in the upper 50th percentile.
         | 
         | Maybe I'm confused but 10k attempts on the same problem set
         | would make anyone an expert in that topic? It's also weird that
         | zero shot performance is so bad, but over a lot of attempts it
         | seems to get correct answers? Or is it learning from previous
         | attempts? No info given.
        
           | joshribakoff wrote:
           | The correct metaphor is that 10,000 attempts would allow
           | anyone to cherry pick a successful attempt. You're conflating
           | cherry picking with online learning. This is like if an
           | entire school of students randomized their answers on a
           | multiple choice test, and then you point to someone who
           | scored 100% and claim it is proof of the school's expertise.
        
             | jimkoen wrote:
             | Yeah but how is it possible that it has such a high margin
             | of error? 10k attempts is insane! Were talking about an
             | error margin of 50%! How can you deliver "expert reasoning"
             | with such an error margin?
        
           | zone411 wrote:
           | That's not what "zero shot" means.
        
           | rfw300 wrote:
           | It's undeniably less impressive than a human on the same
           | task, but who cares at the end of the day? It can do 10,000
           | attempts in the time a person can do 1. Obviously improving
           | that ratio will help for any number of reasons, but if you
           | have a computer that can do a task in 5 minutes that will
           | take a human 3 hours, it doesn't necessarily matter very much
           | how you got there.
        
             | jsheard wrote:
             | How long does it take the operator to sift through those
             | 10,000 attempts to find the successful one, when it's not a
             | contrived benchmark where the desired answer is already
             | known ahead of time? LLMs generally don't know when they've
             | failed, they just barrel forwards and leave the user to
             | filter out the junk responses.
        
               | jimkoen wrote:
               | I have an idea! We should train an LLM with reasoning
               | capabilities to sift through all the attempts! /s
        
               | johnny22 wrote:
               | why /s ? Isn't that an approach some people are actually
               | trying to take?
        
             | miki123211 wrote:
             | Even if it's the other way around, if the computer takes 3
             | hours on a task that a human can do in 5 minutes, using the
             | computer might still be a good idea.
             | 
             | A computer will never go on strike, demand better working
             | conditions, unionize, secretly be in cahoots with your
             | competitor or foreign adversary, play office politics,
             | scroll through Tiktok instead of doing its job, or cause an
             | embarrassment to your company by posting a politically
             | incorrect meme on its personal social media account.
        
           | RigelKentaurus wrote:
           | The blog says "With a relaxed submission constraint, we found
           | that model performance improved significantly. When allowed
           | 10,000 submissions per problem, the model achieved a score of
           | 362.14 - above the gold medal threshold - even without any
           | test-time selection strategy."
           | 
           | I am interpreting this to mean that the model tried 10K
           | approaches to solve the problem, and finally selected the one
           | that did the trick. Am I wrong?
        
             | jimkoen wrote:
             | > Am I wrong?
             | 
             | That's the thing, did the operator select the correct
             | result or did the model check it's own attempts? No info
             | given whatsoever in the article.
        
           | gizmo wrote:
           | Even if you disregard the Olympiad performance OpenAI-O1 is,
           | if the charts are to be believed, a leap forward in
           | intelligence. Also bear in mind that AI researchers are not
           | out of ideas on how to make models better and improvements in
           | AI chips are the metaphorical tide that lifts all boats. The
           | trend is the biggest story here.
           | 
           | I get the AI skepticism because so much tech hype of recent
           | years turned out to be hot air (if you're generous, obvious
           | fraud if you're not). But AI tools available toady, once you
           | get the hang of using them, are pretty damn amazing already.
           | Many jobs can be fully automated with AI tools that exist
           | today. No further breakthroughs required. And although I
           | still don't believe software engineers will find themselves
           | out of work anytime soon, I can no longer completely rule it
           | out either.
        
         | apsec112 wrote:
         | Even without AI, it's gotten ~10,000 times easier to write
         | software than in the 1950s (eg. imagine trying to write PyTorch
         | code by hand in IBM 650 assembly), but the demand for software
         | engineering has only increased, because demand increases even
         | faster than supply does. Jevons paradox:
         | 
         | https://en.wikipedia.org/wiki/Jevons_paradox
        
           | aantix wrote:
           | The number of tech job postings has tanked - which loosely
           | correlates with the rise of AI.
           | 
           | https://x.com/catalinmpit/status/1831768926746734984
        
             | apsec112 wrote:
             | GPT-4 came out in March 2023, after most of this drop was
             | already finished.
        
             | disgruntledphd2 wrote:
             | And also with a large increase in interest rates.
        
             | macinjosh wrote:
             | The tanking is more closely aligned with new tax rules that
             | went to effect that make it much harder to claim dev time
             | as an expense.
        
             | Meekro wrote:
             | I'm skeptical because "we fired half our programmers and
             | our new AI does their jobs as well as they did" is a story
             | that would tear through the Silicon Valley rumor mill. To
             | my knowledge, this has not happened (yet).
        
             | guluarte wrote:
             | this drop is more related to the FED increasing the
             | interest rates
        
             | bognition wrote:
             | The local decline in open software engineering positions
             | has _nothing_ to do with AI. The best orgs are using AI to
             | assist developers in building out new systems and write
             | tests. Show me someone who is doing anything bigger than
             | that, please I'd love to be proven wrong.
             | 
             | The big decline is driven by a few big factors. Two of
             | which are 1- the overhiring that happened in 2021. This was
             | followed by the increase of interest rates which
             | dramatically constrained the money supply. Investors
             | stopped preferring growth over profits. This shift in
             | investor preferences is reflected in engineering orgs
             | tightening their budgets as they are no longer rewarded for
             | unbridled growth.
        
               | nickfromseattle wrote:
               | Plus the tax code requiring amortization of developer
               | salaries over 5 years instead of the year the salary
               | expense is incurred.
        
           | randomdata wrote:
           | _> it 's gotten ~10,000 times easier to write software than
           | in the 1950s_
           | 
           | It seems many of the popular tools want to make writing
           | software harder than in the 2010s, though. Perhaps their
           | stewards believe that if they keep making things more and
           | more unnecessarily complicated, LLMs won't be able to keep
           | up?
        
         | afavour wrote:
         | > People should realize we're reaching the point where LLMs are
         | surpassing humans in any task limited in scope enough to be a
         | "benchmark".
         | 
         | Can you explain what this statement means? It sounds like
         | you're saying LLMs are now smart enough to be able to jump
         | through arbitrary hoops but are not able to do so when taken
         | outside of that comfort zone. If my reading is correct then it
         | sounds like skepticism is still warranted? I'm not trying to be
         | an asshole here, it's just that my #1 problem with anything AI
         | is being able to separate fact from hype.
        
           | rfw300 wrote:
           | I think what I'm saying is a bit more nuanced than that. LLMs
           | currently struggle with very "wide", long-run reasoning tasks
           | (e.g., the evolution over time of a million-line codebase).
           | That isn't because they are secretly stupid and their
           | capabilities are all hype, it's just that this technology
           | currently has a different balance of strengths and weaknesses
           | than human intelligence, which tends to more smoothly
           | extrapolate to longer-horizon tasks.
           | 
           | We are seeing steady improvement on long-run tasks (SWE-Bench
           | being one example) and much more improvement on shorter, more
           | well-defined tasks. The latter capabilities aren't "hype" or
           | just for show, there really is productive work like that to
           | be done in the world! It's just not everything, yet.
        
         | crystal_revenge wrote:
         | I have written a ton of evaluations and run countless
         | benchmarks and I'm not even close to convinced that we're at
         | 
         | > the point where LLMs are surpassing humans in any task
         | limited in scope enough to be a "benchmark"
         | 
         | so much as we're over-fitting these bench marks (and in many
         | cases fishing for a particular way of measuring the results
         | that looks more impressive).
         | 
         | While it's great that the LLM community has so many benchmarks
         | and cares about attempting to measure performance, these
         | benchmarks are becoming an increasingly poor signal.
         | 
         | > This is a nerve-wracking time to be a knowledge worker for
         | sure.
         | 
         | It might because I'm in this space, but I personally feel like
         | this is the _best_ time to working in tech. LLMs still are
         | awful at things requiring true expertise while increasingly
         | replacing the need for mediocre programmers and dilettantes. I
         | 'm increasingly seeing the quality of the technical people I'm
         | working with going up. After years of being stuck in rooms with
         | leetcode grinding TC chasers, it's very refreshing.
        
         | skepticATX wrote:
         | > People should realize we're reaching the point where LLMs are
         | surpassing humans in any task limited in scope enough to be a
         | "benchmark
         | 
         | This seems like a bold statement considering we have so few
         | benchmarks, and so many of them are poorly put together.
        
         | grbsh wrote:
         | I like your phrasing - "any task limited in scope enough to be
         | a 'benchmark'". Exactly! This is the real gap with LLMs, and
         | will continue to be an issue with o1 -- sure, if you can write
         | down all of the relevant context information you need to
         | perform some computation, LLMs should be able to do it. In
         | other words, LLMs are calculators!
         | 
         | I'm not especially nerve-wracked about being a knowledge
         | worker, because my day-to-day doesn't consist of being handed a
         | detailed specification of exactly what is required, and then me
         | 'computing' it. Although this does sound a lot like what a
         | product manager does!
        
         | rvz wrote:
         | > And as anyone who's spent time using Claude 3.5 Sonnet /
         | GPT-4o can attest, these things really are useful and smart!
         | (And, if these results hold up, O1 is much, much smarter.) This
         | is a nerve-wracking time to be a knowledge worker for sure.
         | 
         | If you have to keep checking the result of an LLM, you do not
         | trust it enough to give you the correct answer.
         | 
         | Thus, having to 'prompt' hundreds of times for the answer you
         | believe is correct over something that claims to be smart -
         | which is why it can confidently convince others that its answer
         | is correct (even when it can be totally erroneous).
         | 
         | I bet if Google DeepMind announced the exact same product, you
         | would equally be as skeptical with its cherry-picked results.
        
         | latexr wrote:
         | > And as anyone who's spent time using Claude 3.5 Sonnet /
         | GPT-4o can attest, these things really are useful and smart!
         | 
         | I have spent significant time with GPT-4o, and I disagree. LLMs
         | are as useful as a random forum dweller who recognises your
         | question as something they read somewhere at some point but are
         | too lazy to check so they just say the first thing which comes
         | to mind.
         | 
         | Here's a recent example I shared before: I asked GPT-4o which
         | Monty Python members have been knighted (not a trick question,
         | I wanted to know). It answered Michael Palin and Terry Gilliam,
         | and that they had been knighted for X, Y, and Z (I don't recall
         | the exact reasons). Then I verified the answer on the BBC,
         | Wikipedia, and a few others, and determined only Michael Palin
         | has been knighted, _and those weren't even the reasons_.
         | 
         | Just for kicks, I then said I didn't think Michael Palin had
         | been knighted. It promptly apologised, told me I was right, and
         | that only Terry Gilliam had been knighted. Worse than useless.
         | 
         | Coding-wise, it's been hit or miss with way more misses. It can
         | be half-right if you ask it uninteresting boilerplate crap
         | everyone has done hundreds of times, but for anything even
         | remotely interesting it falls flatter than a pancake under a
         | steam roller.
        
           | gizmo wrote:
           | I asked GPT-4o and I got the correct answer in one shot:
           | 
           | > Only one Monty Python member, Michael Palin, has been
           | knighted. He was honored in 2019 for his contributions to
           | travel, culture, and geography. His extensive work as a
           | travel documentarian, including notable series on the BBC,
           | earned him recognition beyond his comedic career with Monty
           | Python (NERDBOT) (Wikipedia).
           | 
           | > Other members, such as John Cleese, declined honors,
           | including a CBE (Commander of the British Empire) in 1996 and
           | a peerage later on (8days).
           | 
           | Maybe you just asked the question wrong. My prompt was "which
           | monty python actors have been knighted. look it up and give
           | the reasons why. be brief".
        
             | latexr wrote:
             | Yes yes, there's always some "you're holding it wrong"
             | apologist.1 Look, it's not a complicated question to ask
             | unambiguously. If you understand even a tiny bit of how
             | these models work, you know you can make _the exact same
             | question_ twice in a row and get wildly different answers.
             | 
             | The point is that you never know what you can trust or not.
             | Unless you're intimately familiar with Monty Python
             | history, you only know you got the correct answer in one
             | shot because I already told you what the right answer is.
             | 
             | Oh, and by the way, I just asked GPT-4o the same question,
             | _with your phrasing, copied verbatim_ and it said _two_
             | Pythons were knighted: Michael Palin (with the correct
             | reasons this time) and John Cleese.
             | 
             | 1 And I've had enough discussions on HN where someone
             | insists on the correct way to prompt, then they do it and
             | get wrong answers. Which they don't realise until they
             | shared it and disproven their own argument.
        
               | oblio wrote:
               | Unless I'm mistaken, isn't all the math behind them...
               | ultimately probabilistic? Even theoretically they can't
               | guarantee the same answer. I'm agreeing with you, by the
               | way, just curious if I'm missing something.
        
               | gizmo wrote:
               | If you take a photo the photons hitting the camera sensor
               | do so in a probabilistic fashion. Still, in sufficient
               | light you'll get the same picture every time you press
               | the shutter button. In near darkness you'll get a random
               | noise picture every time.
               | 
               | Similarly language models are probabilistic and yet they
               | get the easiest questions right 100% of the time with
               | little variability and the hardest prompts will return
               | gibberish. The point of good prompting is to get useful
               | responses to questions at the boundary of what the
               | language model is capable of.
               | 
               | (You can also configure a language model to generate the
               | same output for every prompt without any random noise.
               | Image models for instance generate exactly the same image
               | pixel for pixel when given the same seed.)
        
               | latexr wrote:
               | The photo comparison is disingenuous. Light and colour
               | information can be disorganised to a large extent and yet
               | you still perceive the same from an image. You can grab a
               | photo and apply to it a red filter or make it black and
               | white and still understand what's in there, what it
               | means, and how it compares to reality.
               | 
               | In comparison, with text a single word can change the
               | entire meaning of a sentence, paragraph, or idea. The
               | same word in different parts of a text can make all the
               | difference between clarity and ambiguity.
               | 
               | It makes no difference how good your prompting is, some
               | things are simply unknowable by an LLM. I repeatedly
               | asked GPT-4o how many Magic: The Gathering cards based on
               | Monty Python exist. It said there are none (wrong)
               | because they didn't exist yet at the cut off date of its
               | training. No amount of prompting changes that, unless you
               | steer it by giving it the answer (at which point there
               | would have been no point in asking).
               | 
               | Furthermore, there's no seed that guarantees truth in all
               | answers or the best images in all cases. Seeds matter for
               | reproducibility, they are unrelated to accuracy.
        
               | gizmo wrote:
               | Language is fuzzy in exactly the same way. LLMs can
               | create factually correct responses in dozens of languages
               | using endless variations in phrasing. You fixate on the
               | kind of questions that current language models struggle
               | with but you forget that for millions of easier questions
               | modern language models already respond with a perfect
               | answer every time.
               | 
               | You think the probabilistic nature of language models is
               | a fundamental problem that puts a ceiling on how smart
               | they can become, but you're wrong.
        
               | gizmo wrote:
               | I think your iPhone analogy is apt. Do you want to be the
               | person complaining that the phone drops calls or do you
               | want to hold it slightly differently and get a lot of use
               | out of it?
               | 
               | If you pay careful attention to prompt phrasing you will
               | get a lot more mileage out of these models. That's the
               | bottom line. If you believe that you shouldn't have to
               | learn how to use a tool well then you can be satisfied
               | with your righteous attitude but you won't get anywhere.
        
               | latexr wrote:
               | No one's arguing that correct use of a tool isn't
               | beneficial. The point is that insisting LLMs just need
               | good prompting is delusional and a denial of reality. I
               | have just demonstrated how _your own prompt_ is still
               | capable of producing the wrong result. So either you
               | don't know how to prompt correctly (because if you did,
               | by your own logic it would have produced the right
               | response every time, which it didn't) or the notion that
               | all you need is good prompting is wrong. Which anyone who
               | understands the first thing about these systems knows to
               | be the case.
        
       | hobofan wrote:
       | That naming scheme...
       | 
       | Will the next model be named "1k", so that the subsequent models
       | will be named "4o1k", and we can all go into retirement?
        
         | p1esk wrote:
         | More like you will need to dip into your 401k fund early to pay
         | for it after they raise the prices.
        
       | notamy wrote:
       | https://openai.com/index/introducing-openai-o1-preview/
       | 
       | > ChatGPT Plus and Team users will be able to access o1 models in
       | ChatGPT starting today. Both o1-preview and o1-mini can be
       | selected manually in the model picker, and at launch, weekly rate
       | limits will be 30 messages for o1-preview and 50 for o1-mini. We
       | are working to increase those rates and enable ChatGPT to
       | automatically choose the right model for a given prompt.
       | 
       |  _Weekly_? Holy crap, how expensive is it to run is this model?
        
         | HPMOR wrote:
         | It's probably running several lines of COT. I imagine, each
         | single message you send is probably at __least__ 10x to the
         | actual model. So in reality it's like 300 messages, and
         | honestly it's probably 100x, given how constrained they're
         | being with usage.
        
         | theLiminator wrote:
         | Anyone know when o1 access in ChatGPT will be open?
        
           | tedsanders wrote:
           | Rolling out over the next few hours to Plus users.
        
         | narrator wrote:
         | The human brain uses 20 watts, so yeah we figured out a way to
         | run better than human brain computation by using many orders of
         | magnitude more power. At some point we'll need to reject
         | exponential power usage for more computation. This is one of
         | those interesting civilizational level problems. There's still
         | a lack of recognition that we aren't going to be able to
         | compute all we want to, like we did in the pre-LLM days.
        
           | seydor wrote:
           | we ll ask it to redesign itself for low power usage
        
           | cma wrote:
           | For 20 watts of work on stuff like this for about 4 hours a
           | day counting vacations and weekends and attention span. So 20
           | hours of rest, relaxation, distraction, household errands and
           | stuff, so that maybe bumps it up to 120 watts per work hour.
           | Then 22.5 years of training or so per worker, 45 year work
           | period, 22.5 year retirement. So double it there to 240
           | watts. We can't run brains without bodies, so multiply that
           | by 6 giving 1440 watts + the air conditioning, commuting to
           | school and work, etc., maybe 2000 watts?
           | 
           | We're getting close to parity if things keep getting more
           | efficient as fast as they have been. But that's without
           | accounting for the AI training, which can on the plus side be
           | shared among multiple agents, but on the down side can't
           | really do continuous learning very well without catastrophic
           | forgetting.
        
       | minimaxir wrote:
       | > Therefore, after weighing multiple factors including user
       | experience, competitive advantage, and the option to pursue the
       | chain of thought monitoring, we have decided not to show the raw
       | chains of thought to users.
       | 
       | What? I agree people who typically use the free ChatGPT webapp
       | won't care about raw chain-of-thoughts, but OpenAI is opening an
       | API endpoint for the O1 model and downstream developers very very
       | much care about chain-of-thoughts/the entire pipeline for
       | debugging and refinement.
       | 
       | I suspect "competitive advantage" is the primary driver here, but
       | that just gives competitors like Anthropic an oppertunity.
        
         | Hizonner wrote:
         | They they've taken at least some of the hobbles off for the
         | chain of thought, so the chain of thought will also include
         | stuff like "I shouldn't say <forbidden thing they don't want it
         | to say>".
        
       | npn wrote:
       | "Open"AI. Should be ClosedAI instead.
        
       | cal85 wrote:
       | Sounds great, but so does their "new flagship model that can
       | reason across audio, vision, and text in real time" announced in
       | May. [0]
       | 
       | [0] https://openai.com/index/hello-gpt-4o/
        
         | mickeystreicher wrote:
         | Yep, all these AI announcements from big companies feel like
         | promises for the future rather than immediate solutions. I miss
         | the days when you could actually use a product right after it
         | was announced, instead of waiting for some indefinite "coming
         | soon."
        
         | apsec112 wrote:
         | This one [o1/Strawberry] is available. I have it, though it's
         | limited to 30 messages/week in ChatGPT Plus.
        
           | sbochins wrote:
           | How do you get access? I don't have it and am a ChatGPT plus
           | subscriber.
        
             | apsec112 wrote:
             | I'm using the Android ChatGPT app (and am in the Android
             | Beta program, though not sure if that matters)
        
             | changoplatanero wrote:
             | it will roll out to everyone over the next few hours
        
             | Szpadel wrote:
             | I'm plus subscriber and I have o1-preview and o1-mini
             | available
        
           | aantix wrote:
           | Dang - I don't see the model listed for me in the iOS app nor
           | the web interface.
           | 
           | I'm a ChatGPT subscriber.
        
             | derwiki wrote:
             | Same! And have been a subscriber for 18 months.
        
               | qwertox wrote:
               | I've been a subscriber since close to the beginning,
               | cancelled 2 weeks ago. I got an email telling me that
               | this is available, but only for Plus.
               | 
               | But for 30 posts per week I see no reason to subscribe
               | again.
               | 
               | I prefer to be frustrated because the quality is
               | unreliable because I'm not paying, instead of having an
               | equally unreliable experience as a paying customer.
               | 
               | Not paying feels the same. It made me wonder if they
               | sometimes just hand over the chat to a lower quality
               | model without telling the Plus subscriber.
               | 
               | The only thing I miss is not being able to tell it to run
               | code for me, but it's not worth the frustration.
        
           | ansc wrote:
           | 30 messages _per week_? Wow. You better not miss!
        
             | evilduck wrote:
             | In the world of hype driven vaporware AI products[1],
             | giving people limited access is at least proof they're not
             | lying about it actually existing or it being able to do
             | what they claim.
             | 
             | [1] https://www.reddit.com/r/LocalLLaMA/comments/1fd75nm/ou
             | t_of_...
        
               | ActionHank wrote:
               | Ok, but the point is that they told me I would have
               | flirty ScarJo ASMR whispering to me at bed time that I am
               | a good boy, but that's not what we got is it?
        
               | hobo_in_library wrote:
               | At 30 messages per week they could secretly hire a human
               | to give the responses
        
         | paxys wrote:
         | Agreed. Release announcements and benchmarks always sound
         | world-changing, but the reality is that every new model is
         | bringing smaller practical improvements to the end user over
         | its predecessor.
        
           | zamadatix wrote:
           | The point above is the said amazing multimodal version of
           | ChatGPT was announced in May and are still not the actual
           | offered way to interact with the service in September
           | (despite the model choice being called 4 omni it's still not
           | actually using multimodal IO). It could be a giant leap in
           | practical improvements but it doesn't matter if you can't
           | actually use what is announced.
           | 
           | This one, oddly, seems to actually be launching before that
           | one despite just being announced though.
        
           | jstummbillig wrote:
           | Sonnet 3.5 brought the largest practical improvements to this
           | end user over all predecessors (so far).
        
         | CooCooCaCha wrote:
         | My guess is they're going to incorporate all of these advances
         | into gpt-5 so it looks like a "best of all worlds" model.
        
         | cja wrote:
         | Recently I was starting to think I imagined that. Back then
         | they gave me the impression it would be released within week or
         | so of the announcement. Have they explained the delay?
        
           | Cu3PO42 wrote:
           | It is definitely available today and I believe it was
           | available shortly after the announcement.
        
             | exitb wrote:
             | The text-to-text model is available. And you can use it
             | with the old voice interface that does Whipser+GPT+TTS. But
             | what was advertised is a model capable of direct audio-to-
             | audio. That's not available.
        
         | trustno2 wrote:
         | That is in chatgpt now and it greatly improves chatgpt. What
         | are you on to now?
        
           | vanviegen wrote:
           | Audio has only rolled out to a small subset of paying
           | customers. There's still no word about the direct-from-4o
           | image generation they demo'd. Let alone the video
           | capabilities.
           | 
           | So no, it's not in chatgpt.
        
       | thelastparadise wrote:
       | Wouldn't this introduce new economics into the LLM market?
       | 
       | I.e. if the "thinking loop" budget is parameterized, users might
       | pay more (much more) to spend more compute on a particular
       | question/prompt.
        
         | minimaxir wrote:
         | Depends on how OpenAI prices it.
         | 
         | Given the need for chain-of-thoughts, and that would be
         | budgeted as output, the new model will not be cheap nor fast.
         | 
         | EDIT: Pricing is out and it is definitely not teneable unless
         | you really really have a use case for it.
        
         | sroussey wrote:
         | Yes, and note the large price increase
        
       | p1esk wrote:
       | Do people see the new models in the web interface? Mine still
       | shows the old models (I'm a paid subscriber).
        
         | hi wrote:
         | > "o1 models are currently in beta - The o1 models are
         | currently in beta with limited features. Access is limited to
         | developers in tier 5 (check your usage tier here), with low
         | rate limits (20 RPM). We are working on adding more features,
         | increasing rate limits, and expanding access to more developers
         | in the coming weeks!"
         | 
         | https://platform.openai.com/docs/guides/rate-limits/usage-ti...
        
           | p1esk wrote:
           | I'm talking about web interface, not API. Should be available
           | now, since they said "immediate release".
        
             | hi wrote:
             | https://chatgpt.com/?model=o1-preview --> defaults back to
             | 4o
        
               | MillionOClock wrote:
               | Same for me here
        
               | zamadatix wrote:
               | It may take a bit to appear in your account (and by a bit
               | I mean I had to fiddle around a while, try logging
               | out/in, etc for a bit) but it appears for me and many
               | others as normal Plus users in the web.
        
           | mewpmewp2 wrote:
           | I have tier 5, but I'm not seeing that model. Also API call
           | gives an error that it doesn't exist or I do not have access.
        
         | benterix wrote:
         | Not yet, neither in the API nor chat.
        
         | rankam wrote:
         | I do - I now have a "More models" option where I can select
         | 01-preview
        
           | cypherpunks01 wrote:
           | I can see it too, I am on the Plus plan and don't think I
           | have any special developer privileges. Selecting that option
           | for me changes the URL to
           | https://chatgpt.com/?model=o1-preview
           | 
           | I tried a fake Monty Hall problem, where the presenter opens
           | a door _before_ the participant picks and is then offered to
           | switch doors, so the probability remains 50% for each door.
           | Previous models have consistently gotten this wrong, because
           | of how many times they 've seen the Monty Hall written where
           | switching doors improves their chance of winning the prize.
           | The chain-of-thought reasoning figured out this modification
           | and after analyzing the conditional probabilities confidently
           | stated: "Answer: It doesn't matter; switching or staying
           | yields the same chance--the participant need not switch
           | doors." Good job.
        
         | chipgap98 wrote:
         | I can't see them yet but they usually roll these things out
         | incrementally
        
         | mickeystreicher wrote:
         | Not yet, it's still not available in the web interface. I think
         | they're rolling it out step by step.
         | 
         | Anyway, the usage limits are pretty ridiculous right now, which
         | makes it even more frustrating.
        
         | tedsanders wrote:
         | They're rolling out gradually over the next few hours. Also be
         | aware there's a _weekly_ rate limit of 30 messages to start.
        
       | rvz wrote:
       | Won't be surprised to see all these hand-picked results and
       | extreme expectations to collapse under scenarios involving highly
       | safety critical and complex demanding tasks requiring a definite
       | focus on detail with lots of awareness, which what they haven't
       | shown yet.
       | 
       | So let's not jump straight into conclusions with these hand-
       | picked scenarios marketed to us and be very skeptical.
       | 
       | Not quite there yet with being able to replace truck drivers and
       | pilots for self-autonomous navigation in transportation,
       | aerospace or even mechanical engineering tasks, but it certainly
       | has the capability in replacing both typical junior and senior
       | software engineers in a world considering to do more with less
       | software engineers needed.
       | 
       | But yet, the race to zero will surely bankrupt millions of
       | startups along the way. Even if the monthly cost of this AI can
       | easily be as much as a Bloomberg terminal to offset the hundreds
       | of billions of dollars thrown into training it and costing the
       | entire earth.
        
         | jazzyjackson wrote:
         | My concern with AI always has been it will outrun the juniors
         | and taper off before replacing folks with 10, 20 years of
         | experience
         | 
         | And as they retire there's no economic incentive to train
         | juniors up, so when the AI starts fucking up the important
         | things there will be no one who actually knows how it works
         | 
         | I've heard this already from amtrak workers, track allocation
         | was automated a long time ago, but there used to be people who
         | could recognize when the computer made a mistake, now there's
         | no one who has done the job manually enough to correct it.
        
       | andrewla wrote:
       | This is something that people have toyed with to improve the
       | quality of LLM responses. Often instructing the LLM to "think
       | about" a problem before giving the answer will greatly improve
       | the quality of response. For example, if you ask it how many
       | letters are in the correctly spelled version of a misspelled
       | word, it will first give the correct spelling, and then the
       | number (which is often correct). But if you instruct it to only
       | give the number the accuracy is greatly reduced.
       | 
       | I like the idea too that they turbocharged it by taking the
       | limits off during the "thinking" state -- so if an LLM wants to
       | think about horrible racist things or how to build bombs or other
       | things that RLHF filters out that's fine so long as it isn't
       | reflected in the final answer.
        
       | jazzyjackson wrote:
       | Dang, I just payed out for Kagi Assistant.
       | 
       | Using Claude 3 Opus I noticed it performs <thinking> and <result>
       | while browsing the web for me. I don't guess that's a change in
       | the model for doing reasoning.
        
       | orbital-decay wrote:
       | Wait, are they comparing 4o without CoT and o1 with built-in CoT?
        
         | persedes wrote:
         | yeah was wondering what 4o with a CoT in the prompt would look
         | like.
        
       | kickofline wrote:
       | LLM performance, recently, seemingly hit the top of the S-curve.
       | It remains to be seen if this is the next leap forward or just
       | the rest of that curve.
        
       | billconan wrote:
       | I will pay if O1 can become my college level math tutor.
        
         | seydor wrote:
         | Looking at the full chain of thought , it involves a lot of
         | backtracking and even hallucination.
         | 
         | It will be like a math teacher that is perpetually drunk and on
         | speed
        
           | lupire wrote:
           | That's Paul Erdos
        
       | cyanf wrote:
       | > 30 messages per week
        
       | djoldman wrote:
       | > THERE ARE THREE R'S IN STRAWBERRY
       | 
       | Ha! This is a nice easteregg.
        
         | vessenes wrote:
         | I appreciated that, too! FWIW, I could get Claude 3.5 to tell
         | me how many rs a python program would tell you there are in
         | strawberry. It didn't like it, though.
        
           | mewpmewp2 wrote:
           | I was able to get GPT-4o to calculate characters properly
           | using following prompt:
           | 
           | """ how many R's are in strawberry?
           | 
           | use the following method to calculate - for example Os in
           | Brocolli.
           | 
           | B - 0
           | 
           | R - 0
           | 
           | O - 1
           | 
           | C - 1
           | 
           | O - 2
           | 
           | L - 2
           | 
           | L - 2
           | 
           | I - 2
           | 
           | Where you keep track after each time you find one character
           | by character
           | 
           | """
           | 
           | And also later I asked it to only provide a number if the
           | count increased.
           | 
           | This also worked well with longer sentences.
        
             | zamadatix wrote:
             | At that point just ask it "Use python to count the number
             | of O's in Broccoli". At least then it's still the one
             | figuring out the "smarts" needed to solve the problem
             | instead of being pure execution.
        
               | mewpmewp2 wrote:
               | Do you think you'll have python always available when you
               | go to the store and need to calculate how much change you
               | should get?
        
               | zamadatix wrote:
               | I'm not sure if your making a joke about the teachers who
               | used to say "you won't have a calculator in your pocket"
               | and now we have cell phones or are not aware that ChatGPT
               | runs the generated Python for you in a built in
               | environment as part of the response. I lean towards the
               | former but in case anyone else strolling by hasn't tried
               | this before:
               | 
               | User: Use python to count the number of O's in Broccoli
               | 
               | ChatGPT: Analyzing... The word "Broccoli" contains 2
               | 'O's. <button to show code>
               | 
               | User: Use python to multiply that by the square root of
               | 20424.2332423
               | 
               | ChatGPT: Analyzing... The result of multiplying the
               | number of 'O's in "Broccoli" by the square root of
               | 20424.2332423 is approximately 285.83.
        
               | mewpmewp2 wrote:
               | Yes, the former, trying to satirize cases where people
               | are testing LLMs capabilities by its ability to count
               | characters in a word, do mathematical operations token by
               | token or otherwise. Because LLM is seeing hieroglyphs
               | compared to character by character words that we are
               | seeing. The true test is its ability to solve those
               | problems using tools like somebody is using a calculator.
               | And while it is good to learn and be good at math, it's
               | not because of counting how much change you should
               | receive when buying something. It's to figure out how
               | reasoning works or how to reason in the first place.
        
         | adverbly wrote:
         | I also gave this a chuckle.
         | 
         | Background: https://www.inc.com/kit-eaton/how-many-rs-in-
         | strawberry-this...
        
         | DonHopkins wrote:
         | Therefore there are four R's in STRAWBERRIER, and five R'S in
         | STRAWBERRIEST!
        
       | yunohn wrote:
       | The generated chain of thought for their example is _incredibly_
       | long! The style is kind of similar to how a human might reason,
       | but it 's also redundant and messy at various points. I hope
       | future models will be able to optimize this further, otherwise
       | it'll lead to exponential increases in cost.
        
         | tines wrote:
         | I know my thoughts are never redundant or messy, that's for
         | sure.
        
           | yunohn wrote:
           | Fair enough, but you're a human - not an AI which costs
           | massive GPU hours.
        
       | flockonus wrote:
       | Are we ready yet to admit Turing test has been passed?
        
         | paxys wrote:
         | The Turing Test (which involves fooling a human into thinking
         | they are talking to another human rather than a computer) has
         | been routinely passed by very rudimentary "AI" since as early
         | as 1991. It has no relevance today.
        
           | adverbly wrote:
           | This is only true for some situations. In some test
           | conditions it has not been passed. I can't remember the exact
           | name, but there used to be a competition where PhD level
           | participants blindly chat for several minutes with each other
           | and are incentivized to discover who is a bot and who is a
           | human. I can't remember if they still run it, but that bar
           | has never been passed from what I recall.
        
         | rvz wrote:
         | LLMs have already beaten the Turing test. It's useless to use
         | it when OpenAI and others are aiming for 'AGI'.
         | 
         | So you need a new Turing test adapted for AGI or a totally
         | different one to test for AGI rather than the standard obsolete
         | Turing test.
        
           | riku_iki wrote:
           | > LLMs have already beaten the Turing test.
           | 
           | I am wondering where this happened? In some limited scope?
           | Because if you plug LLM into some call center role for
           | example, it will fall apart pretty quickly.
        
         | TillE wrote:
         | Extremely basic agency would be required to pass the Turing
         | test as intended.
         | 
         | Like, the ability to ask a new unrelated question without being
         | prompted. Of course you can fake this, but then you're not
         | testing the LLM as an AI, you're testing a dumb system you
         | rigged up to create the appearance of an AI.
        
           | flockonus wrote:
           | > Turing proposed that a human evaluator would judge natural
           | language conversations between a human and a machine designed
           | to generate human-like responses. The evaluator would be
           | aware that one of the two partners in conversation was a
           | machine, and all participants would be separated from one
           | another. The conversation would be limited to a text-only
           | channel, such as a computer keyboard and screen, so the
           | result would not depend on the machine's ability to render
           | words as speech.
           | 
           | I don't see agency mentioned or implied anywhere:
           | https://en.wikipedia.org/wiki/Turing_test
           | 
           | What definition or setup are you taking it from?
        
       | patapong wrote:
       | Very interesting. I guess this is the strawberry model that was
       | rumoured.
       | 
       | I am a bit surprised that this does not beat GPT-4o for personal
       | writing tasks. My expectations would be that a model that is
       | better at one thing is better across the board. But I suppose
       | writing is not a task that generally requires "reasoning steps",
       | and may also be difficult to evaluate objectively.
        
         | markonen wrote:
         | In the performance tests they said they used "consensus among
         | 64 samples" and "re-ranking 1000 samples with a learned scoring
         | function" for the best results.
         | 
         | If they did something similar for these human evaluations,
         | rather than just use the single sample, you could see how that
         | would be horrible for personal writing.
        
           | janalsncm wrote:
           | I don't understand how that is generalizable. I'm not going
           | to be able to train a scoring function for any arbitrary task
           | I need to do. In many cases the problem of ranking _is at
           | least as hard as_ generating a response in the first place.
        
         | afro88 wrote:
         | The solution of the cipher example problem also strongly hints
         | at this: "there are three r's in strawberry"
        
           | patapong wrote:
           | Confirmed by the verge: https://www.theverge.com/2024/9/12/24
           | 242439/openai-o1-model-...
        
         | slashdave wrote:
         | > My expectations would be that a model that is better at one
         | thing is better across the board.
         | 
         | No, it's the opposite. This is simply a function of resources
         | applied during training.
        
           | patapong wrote:
           | To some extent I agree, but until now all of the big jumps
           | (GPT2 -> GPT3 -> GPT4) have meant significant improvements
           | across all tasks. This does not seem to be the case here,
           | this model seems to be vastly stronger on certain tasks but
           | not much of an improvement on other tasks. Maybe we will have
           | to wait for GPT5 for that :)
        
         | nickfromseattle wrote:
         | Maybe math is easier to score and do reinforcement learning on
         | because of it's 'solvability' whereas writing requires human
         | judgement to score?
        
       | Hansenq wrote:
       | Reading through the Chain of Thought for the provided Cipher
       | example (go to the example, click "Show Chain of Thought") is
       | kind of crazy...it literally spells out every thinking step that
       | someone would go through mentally in their head to figure out the
       | cipher (even useless ones like "Hmm"!). It really seems like
       | slowing down and writing down the logic it's using and reasoning
       | over that makes it better at logic, similar to how you're taught
       | to do so in school.
        
         | Jasper_ wrote:
         | > Average:18/2=9
         | 
         | > 9 corresponds to 'i'(9='i')
         | 
         | > But 'i' is 9, so that seems off by 1.
         | 
         | Still seems bad at counting, as ever.
        
           | dymk wrote:
           | The next line is it catching its own mistake, and noting i =
           | 9.
        
           | PoignardAzur wrote:
           | It's interesting that it makes that mistake, but then catches
           | it a few lines later.
           | 
           | A common complaint about LLMs is that once they make a
           | mistake, they will _keep making it_ and write the rest of
           | their completion under the assumption that everything before
           | was correct. Even if they 've been RLHF to take human
           | feedback into account and the human points out the mistake,
           | their answer is "Certainly! Here's the corrected version" and
           | then they write something that makes the same mistake.
           | 
           | So it's interesting that this model does something that
           | _appears_ to be self-correction.
        
         | afro88 wrote:
         | Seeing the "hmmm", "perfect!" etc. one can easily imagine the
         | kind of training data that humans created for this. Being told
         | to literally speak their mind as they work out complex
         | problems.
        
           | seydor wrote:
           | looks a bit like 'code', using keywords 'Hmm',
           | 'Alternatively', 'Perfect'
        
             | thomasahle wrote:
             | Right, these are not mere "filler words", but initialize
             | specific reasoning paths.
        
               | mewpmewp2 wrote:
               | Hmm... you may be onto something here.
        
               | wrs wrote:
               | Interesting.
        
               | j_maffe wrote:
               | Alternatively, these might not be "filler words", but
               | instantiate paths of reasonsing.
        
               | squigz wrote:
               | What a strange comment chain.
        
               | seydor wrote:
               | Hmmm.
        
               | squigz wrote:
               | Interesting.
        
             | legel wrote:
             | As a technical engineer, I've learned the value of starting
             | sentences with "basically", even when I'm facing technical
             | uncertainty. Basically, "basically" forces me to be
             | _simple_.
             | 
             | Being trained to say words like "Alternatively", "But...",
             | "Wait!", "So," ... based on some metric of value in
             | focusing / switching elsewhere / ... is basically
             | brilliant.
        
         | impossiblefork wrote:
         | Even though there's of course no guarantee of people getting
         | these chain of thought traces, or whatever one is to call them,
         | I can imagine these being very useful for people learning
         | competitive mathematics, because it must in fact give the full
         | reasoning, and transformers in themselves aren't really that
         | smart, usually, so it's probably feasible for a person with
         | very normal intellectual abilities to reproduce these traces
         | with practice.
        
         | Salgat wrote:
         | It's interesting how it basically generates a larger sample
         | size to create a regression against. The larger the input, the
         | larger the surface area it can compare against existing
         | training data (implicitly through regression of course).
        
         | crazygringo wrote:
         | Seriously. I actually feel as impressed by the chain of
         | thought, as I was when ChatGPT first came out.
         | 
         | This isn't "just" autocompletion anymore, this is actual step-
         | by-step reasoning full of ideas and dead ends and refinement,
         | just like humans do when solving problems. Even if it is still
         | ultimately being powered by "autocompletion".
         | 
         | But then it makes me wonder about human reasoning, and what if
         | it's similar? Just following basic patterns of "thinking steps"
         | that ultimately aren't any different from "English language
         | grammar steps"?
         | 
         | This is truly making me wonder if LLM's are actually far more
         | powerful than we thought at first, and if it's just a matter of
         | figuring out how to plug them together in the right
         | configurations, like "making them think".
        
           | AndyKelley wrote:
           | You ever see that scene from Westworld? (spoiler)
           | https://www.youtube.com/watch?v=ZnxJRYit44k
        
           | dsign wrote:
           | You are just catching up to this idea, probably after hearing
           | 2^n explanations about why we humans are superiors to <<fill
           | in here our latest creation>>.
           | 
           | I'm not the kind of scientist that can say how good an LLM is
           | for human reasoning, but I know that we humans are very
           | incentivized and kind of good at scaling, composing and
           | perfecting things. If there is money to pay for human effort,
           | we will play God no-problem, and maybe outdo the divine.
           | Which makes me wonder, isn't there any other problem in our
           | bucket list to dump ginormous amounts of effort at... maybe
           | something more worth-while than engineering the thing that
           | will replace Homo Sapiens?
        
           | Nadya wrote:
           | When an AI makes a silly math mistake we say it is bad at
           | math and laugh at how dumb it is. Some people extrapolate
           | this to "they'll never get any better and will always be a
           | dumb toy that gets things wrong". When I forget to carry a 1
           | when doing a math problem we call it "human error" even if I
           | make that mistake an embarrassing number of times throughout
           | my lifetime.
           | 
           | Do I think LLM's are alive/close to ASI? No. Will they get
           | there? If it's even at all possible - almost certainly one
           | day. Do I think people severely underestimate AI's ability to
           | solve problems while significantly overestimating their own?
           | Absolutely 10,000%.
           | 
           | If there is one thing I've learned from watching the AI
           | discussion over the past 10-20 years its that people have
           | overinflated egos and a crazy amount of hubris.
           | 
           | "Today is the worst that it will ever be." applies to an
           | awful large number of things that people work on creating and
           | improving.
        
           | tsoj wrote:
           | Yeah, humans are very similar. We have intuitive immediate-
           | next-step-suggestions, and then we apply these intuitive next
           | steps, until we find that it lead to a dead end, and then we
           | backtrack.
           | 
           | I always say, the way we used LLMs (so far) is basically like
           | having a human write text only on gut reactions, and without
           | backspace key.
        
           | exe34 wrote:
           | that's my assessment too. there's even a phenomenon I've
           | observed both in others and myself, when thrust into a new
           | field and given a task to complete, we do it to the best of
           | our ability, which is often sod all. so we ape the things
           | we've heard others say, roughly following the right chain of
           | reasoning by luck, and then suddenly say something that in
           | hind sight, with proper training, we realise was incredibly
           | stupid. we autocomplete and then update with rlhf.
           | 
           | we also have a ton of heuristics that trigger a closer look
           | and loading of specific formal reasoning, but by and large,
           | most of our thought process is just auto complete.
        
           | armchairhacker wrote:
           | I think it's similar, although I think it would be more
           | similar if the LLM did the steps in lower layers (not in
           | English), and instead of the end being fed to the start,
           | there would be a big mess of cycles throughout the neural
           | net.
           | 
           | That could be more efficient since the cycles are much
           | smaller, but harder to train.
        
             | vanviegen wrote:
             | It doesn't do the 'thinking' in English (inference is just
             | math), but it does now verbalize intermediate thoughts in
             | English (or whatever the input language is, presumably),
             | just like humans tend to do.
        
           | ActorNightly wrote:
           | Again its not reasoning.
           | 
           | Reasoning would imply that it can figure out stuff without
           | being trained in it.
           | 
           | The chain of thought is basically just a more accurate way to
           | map input to output. But its still a map, i.e forward only.
           | 
           | If an LLM coud reason, you should be able to ask it a
           | question about how to make a bicycle frame from scratch with
           | a small home cnc with limited work area and it should be able
           | to iterate on an analysis of the best way to put it together,
           | using internet to look up available parts and make decisions
           | on optimization.
           | 
           | No LLM can do that or even come close, because there are no
           | real feedback loops, because nobody knows how to train a
           | network like that.
        
         | cowsaymoo wrote:
         | > THERE ARE THREE R'S IN STRAWBERRY
         | 
         | hilarious
        
         | evilfred wrote:
         | which makes it even funnier when the Chain is just... wrong
         | https://x.com/colin_fraser/status/1834336440819614036
        
         | davesque wrote:
         | Yes and apparently we won't have access to that chain of
         | thought in the release version:
         | 
         | "after weighing multiple factors including user experience,
         | competitive advantage, and the option to pursue the chain of
         | thought monitoring, we have decided not to show the raw chains
         | of thought to users"
        
       | k2xl wrote:
       | Pricing page updated for O1 API costs.
       | 
       | https://openai.com/api/pricing/
       | 
       | $15.00 / 1M input tokens $60.00 / 1M output tokens
       | 
       | For o1 preview
       | 
       | Approx 3x the price of gpt4o.
       | 
       | o1-mini $3.00 / 1M input tokens $12.00 / 1M output tokens
       | 
       | About 60% of the cost of gpt4o. Much more expensive than
       | gpt4o-mini.
       | 
       | Curious on the performance/tokens per second for these new
       | massive models.
        
         | logicchains wrote:
         | I guess they'd also charge for the chain of thought tokens, of
         | which there may be many, even if users can't see them.
        
           | fraboniface wrote:
           | That would be very bad product design. My understanding is
           | that the model itself is similar to GPT4o in architecture but
           | trained and used differently. So the 5x relative increase in
           | output token cost likely already accounts for hidden tokens
           | and additional compute.
        
             | natrys wrote:
             | > While reasoning tokens are not visible via the API, they
             | still occupy space in the model's context window and are
             | billed as output tokens.
             | 
             | https://platform.openai.com/docs/guides/reasoning
             | 
             | So yeah, it is in fact very bad product design. I hope
             | Llama catches up in a couple of months.
        
       | hi wrote:
       | BUG: https://openai.com/index/reasoning-in-gpt/
       | 
       | > o1 models are currently in beta - The o1 models are currently
       | in beta with limited features. Access is limited to developers in
       | tier 5 (check your usage tier here), with low rate limits (20
       | RPM). We are working on adding more features, increasing rate
       | limits, and expanding access to more developers in the coming
       | weeks!
       | 
       | https://platform.openai.com/docs/guides/reasoning/reasoning
        
         | cptcobalt wrote:
         | I'm in Tier 4, and not far off from Tier 5. The docs aren't
         | quite transparent enough to show that if I buy credits if I'll
         | be bumped up to Tier 5, or if I actually have to use enough
         | credits to get into Tier 5.
         | 
         | Edit, w/ real time follow up:
         | 
         | Prior to buying the credits, I saw O1-preview in the Tier 5
         | model list as a Tier 4 user. I bought credits to bump to Tier 5
         | --not much, I'd have gotten there before the end of the year.
         | The OpenAI website now shows I'm in Tier 5, but O1-preview is
         | not in the Tier 5 model list for me anymore. So sneaky of them!
        
           | hi wrote:
           | https://news.ycombinator.com/item?id=41523070#41523525
        
       | paxys wrote:
       | 2018 - gpt1
       | 
       | 2019 - gpt2
       | 
       | 2020 - gpt3
       | 
       | 2022 - gpt3.5
       | 
       | 2023 - gpt4
       | 
       | 2023 - gpt4-turbo
       | 
       | 2024 - gpt-4o
       | 
       | 2024 - o1
       | 
       | Did OpenAI hire Google's product marketing team in recent years?
        
         | Infinity315 wrote:
         | No, this is just how Microsoft names things.
        
           | logicchains wrote:
           | We'll know the Microsoft takeover is complete when OpenAI
           | release Ai.net.
        
             | randomdata wrote:
             | GPT# forthcoming. You heard it here first.
        
         | adverbly wrote:
         | Makes sense to me actually. This is a different product. It
         | doesn't respond instantly.
         | 
         | It fundamentally makes sense to separate these two products in
         | the AI space. There will obviously be a speed vs quality trade-
         | off with a variety of products across the spectrum over time.
         | LLMs respond way too fast to actually be expected to produce
         | the maximum possible quality of a response to complex queries.
        
         | ilaksh wrote:
         | One of them would have been named gpt-5, but people forget what
         | an absolute panic there was about gpt-5 for quite a few people.
         | That caused Altman to reassure people they would not release
         | 'gpt-5' any time soon.
         | 
         | The funny thing is, after a certain amount of time, the gpt-5
         | panic eventually morphed into people basically begging for
         | gpt-5. But he already said he wouldn't release something called
         | 'gpt-5'.
         | 
         | Another funny thing is, just because he didn't name any of them
         | 'gpt-5', everyone assumes that there is something called
         | 'gpt-5' that has been in the works and still is not released.
        
           | zamadatix wrote:
           | This doesn't feel like GPT-5, the training data cutoff is Oct
           | 2023 which is the same as the other GPT-4 models and it
           | doesn't seem particularly "larger" as much as "runs
           | differently". Of course it's all speculation one way or the
           | other.
        
         | randomdata wrote:
         | They partnered with Microsoft, remember?
         | 
         | 1985 - Windows 1.0
         | 
         | 1987 - Windows 2.0
         | 
         | 1990 - Windows 3.0
         | 
         | 1992 - Windows 3.1
         | 
         | 1995 - Windows 95
         | 
         | 1998 - Windows 98
         | 
         | 2000 - Windows ME (Millennium Edition)
         | 
         | 2001 - Windows XP
         | 
         | 2006 - Windows Vista
         | 
         | 2009 - Windows 7
         | 
         | 2012 - Windows 8
         | 
         | 2013 - Windows 8.1
         | 
         | 2015 - Windows 10
         | 
         | 2021 - Windows 11
        
           | oblio wrote:
           | Why did you have to pick on Windows? :-(
           | 
           | If you want real atrocities, look at Xbox.
        
             | randomdata wrote:
             | Honestly, it is the only Microsoft product I know. Xbox may
             | be a better example, but I know nothing about the Xbox. But
             | I am interested to learn! What is notable about its naming?
        
               | murrain wrote:
               | Xbox
               | 
               | Xbox 360
               | 
               | Xbox One => Xbox One S / Xbox One X
               | 
               | Xbox Series S / Xbox Series X
        
               | oblio wrote:
               | https://computercity.com/consoles/xbox/xbox-consoles-
               | list-in...
               | 
               | No real chronology, Xbox One is basically the third
               | version. Then Xbox One X and Xbox Series X. Everything is
               | atrocious about the naming.
        
               | randomdata wrote:
               | Got it! If we're picking favourites, though, I still like
               | Windows as it, like GPT, starts with reasonably sensible
               | names and then goes completely off the rails.
        
         | drexlspivey wrote:
         | 1998 - Half-Life
         | 
         | 1999 - Half-Life: Opposing Force
         | 
         | 2001 - Half-Life: Blue Shift
         | 
         | 2001 - Half-Life: Decay
         | 
         | 2004 - Half-Life: Source
         | 
         | 2004 - Half-Life 2
         | 
         | 2004 - Half-Life 2: Deathmatch
         | 
         | 2005 - Half-Life 2: Lost Coast
         | 
         | 2006 - Half-Life Deathmatch: Source
         | 
         | 2006 - Half-Life 2: Episode One
         | 
         | 2007 - Half-Life 2: Episode Two
         | 
         | 2020 - Half-Life: Alyx
        
         | CamperBob2 wrote:
         | They signed a cross-licensing deal with the USB Consortium.
        
         | trash_cat wrote:
         | It's not that bad...It's quite easy to follow and understand.
        
       | aktuel wrote:
       | If I pay for the chain of thought, I want to see the chain of
       | thought. Simple. How would I know if it happened at all? Trust
       | OpenAI? LOL
        
         | baq wrote:
         | Easy solution - don't pay!
        
         | 93po wrote:
         | how do you know it isn't some guy typing responses to you when
         | you use openAI?
        
       | asadm wrote:
       | I am not up-to-speed on CoT side but is this similar to how
       | perplexity does it ie.
       | 
       | - generate a plan - execute the steps in plan (search internet,
       | program this part, see if it is compilable)
       | 
       | each step is a separate gpt inference with added context from
       | previous steps.
       | 
       | is O1 same? or does it do all this in a single inference run?
        
         | seydor wrote:
         | that is the summary of the task it presents to the user. The
         | full chain of thought seems more mechanistic
        
         | golol wrote:
         | There is a hige difference which is that thex use reinforcement
         | learning to make the model use the Chain-of-Thought better.
        
       | bbstats wrote:
       | Finally, a Claude competitor!
        
       | MrRobotics wrote:
       | This is the sort of reasoning needed to solve the ARC AGI
       | benchmark.
        
       | gradus_ad wrote:
       | Interesting sequence from the Cipher CoT:
       | 
       | Third pair: 'dn' to 'i'
       | 
       | 'd'=4, 'n'=14
       | 
       | Sum:4+14=18
       | 
       | Average:18/2=9
       | 
       | 9 corresponds to 'i'(9='i')
       | 
       | But 'i' is 9, so that seems off by 1.
       | 
       | So perhaps we need to think carefully about letters.
       | 
       | Wait, 18/2=9, 9 corresponds to 'I'
       | 
       | So this works.
       | 
       | -----
       | 
       | This looks like recovery from a hallucination. Is it realistic to
       | expect CoT to be able to recover from hallucinations this
       | quickly?
        
         | trash_cat wrote:
         | How do you mean quickly? It probably will take a while for it
         | to output the final answer as it needs to re-prompt itself. It
         | won't be as fast as 4o.
        
         | bigyikes wrote:
         | 4o could already recover from hallucination in a limited
         | capacity.
         | 
         | I've seen it, mid-reply say things like "Actually, that's
         | wrong, let me try again."
        
         | NightlyDev wrote:
         | Did it hallucinate? I haven't looked at it, but lowercase i and
         | uppercase i is not the same number if you're getting the number
         | from ascii
        
         | machiaweliczny wrote:
         | In general if hallucination ratio is 2% can't it be reduced to
         | 0.04% by running twice or sth like this. I think they should
         | try establishing the facts from different angles and this
         | probably would work fine to minimize hallucinations. But if
         | this was that simple somebody would already do it...
        
       | bn-l wrote:
       | > Unless otherwise specified, we evaluated o1 on the maximal
       | test-time compute setting.
       | 
       | Maximal test time is the maximum amount of time spent doing the
       | "Chain of Thought" "reasoning". So that's what these results are
       | based on.
       | 
       | The caveat is that in the graphs they show that for each increase
       | in test-time performance, the (wall) time / compute goes up
       | _exponentially_.
       | 
       | So there is a potentially interesting play here. They can
       | honestly boast these amazing results (it's the same model after
       | all) yet the actual product may have a lower order of magnitude
       | of "test-time" and not be as good.
        
         | logicchains wrote:
         | Surprising that at run time it needs an exponential increase in
         | thinking to achieved a linear increase in output quality. I
         | suppose it's due to diminishing returns to adding more and more
         | thought.
        
           | HarHarVeryFunny wrote:
           | The exponential increase is presumably because of the
           | branching factor of the tree of thoughts. Think of a binary
           | tree who's number of leaf nodes doubles (= exponential
           | growth) at each level.
           | 
           | It's not too surprising that the corresponding increase in
           | quality is only linear - how much difference in quality would
           | you expect between the best, say, 10 word answer to a
           | question, and the best 11 word answer ?
           | 
           | It'll be interesting to see what they charge for this. An
           | exponential increase in thinking time means an exponential
           | increase in FLOPs/dollars.
        
         | alwa wrote:
         | I interpreted it to suggest that the product might include a
         | user-facing "maximum test time" knob.
         | 
         | Generating problem sets for kids? You might only need or want a
         | basic level of introspection, even though you like the flavor
         | of this model's personality over that of its predecessors.
         | 
         | Problem worth thinking long, hard, and expensively about? Turn
         | that knob up to 11, and you'll get a better-quality answer with
         | no human-in-the-loop coaching or trial-and-error involved.
         | You'll just get your answer in timeframes closer to human ones,
         | consuming more (metered) tokens along the way.
        
           | mrdmnd wrote:
           | Yeah, I think this is the goal - remember; there are some
           | problems that only need to be solved correctly once! Imagine
           | something like a millennium problem - you'd be willing to
           | wait a pretty long time for a proof of the RH!
        
         | bluecoconut wrote:
         | This power law behavior of test-time improvement seems to be
         | pretty ubiquitous now. In more agents is all you need [1], they
         | start to see this as a function of ensemble size. It also shows
         | up in: Large Language Monkeys: Scaling Inference Compute with
         | Repeated Sampling [2]
         | 
         | I sorta wish everyone would plot their y-axis with logit
         | y-axis, rather than 0->100 accuracy (including the openai
         | post), to help show the power-law behavior. This is especially
         | important when talking about incremental gains in the ~90->95,
         | 95->99%. When the values (like the open ai post) are between
         | 20->80, logit and linear look pretty similar, so you can "see"
         | the inference power-law
         | 
         | [1] https://arxiv.org/abs/2402.05120 [2]
         | https://arxiv.org/abs/2407.21787
        
       | HPMOR wrote:
       | A near perfect on AMC 12, 1900 CodeForces ELO, and silver medal
       | IOI competitor. In two years, we'll have models that could easily
       | win IMO and IOI. This is __incredible__!!
        
         | vjerancrnjak wrote:
         | It depends on what they mean by "simulation". It sounds like o1
         | did not participate in new contests with new problems.
         | 
         | Any previous success of models with code generation focus was
         | easily discovered to be a copy-paste of a solution in the
         | dataset.
         | 
         | We could argue that there is an improvement in "understanding"
         | if the code recall is vastly more efficient.
        
         | lupire wrote:
         | Near perfect AIME, not just AMC12.
         | 
         | But each solve costs far more time and energy than a competent
         | human takes.
        
       | islewis wrote:
       | My first interpretation of this is that it's jazzed-up Chain-Of-
       | Thought. The results look pretty promising, but i'm most
       | interested in this:
       | 
       | > Therefore, after weighing multiple factors including user
       | experience, competitive advantage, and the option to pursue the
       | chain of thought monitoring, we have decided not to show the raw
       | chains of thought to users.
       | 
       | Mentioning competitive advantage here signals to me that OpenAI
       | believes there moat is evaporating. Past the business context, my
       | gut reaction is this negatively impacts model usability, but i'm
       | having a hard time putting my finger on why.
        
         | logicchains wrote:
         | >my gut reaction is this negatively impacts model usability,
         | but i'm having a hard time putting my finger on why.
         | 
         | If the model outputs an incorrect answer due to a single
         | mistake/incorrect assumption in reasoning, the user has no way
         | to correct it as it can't see the reasoning so can't see where
         | the mistake was.
        
           | accrual wrote:
           | Maybe CriticGPT could be used here [0]. Have the CoT model
           | produce a result, and either automatically or upon user
           | request, ask CriticGPT to review the hidden CoT and feed the
           | critique into the next response. This way the error can
           | (hopefully) be spotted and corrected without revealing the
           | whole process to the user.
           | 
           | [0] https://openai.com/index/finding-gpt4s-mistakes-with-
           | gpt-4/
           | 
           | Day dreaming: imagine if this architecture takes off and the
           | AI "thought process" becomes hidden and private much like
           | human thoughts. I wonder then if a future robot's inner
           | dialog could be subpoenaed in court, connected to some
           | special debugger, and have their "thoughts" read out loud in
           | court to determine why it acted in some way.
        
         | thomasahle wrote:
         | > my gut reaction is this negatively impacts model usability,
         | but i'm having a hard time putting my finger on why.
         | 
         | This will make it harder for things like DSPy to work, which
         | rely using "good" CoT examples as few-shot examples.
        
         | m3kw9 wrote:
         | The moat is expanding from use count, also the moat is to lead
         | and advance faster than anyone can catch up, you will always
         | have the best mode with the best infrastructure and low limits.
        
       | fnord77 wrote:
       | > Available starting 9.12
       | 
       | I don't see it
        
         | airstrike wrote:
         | Only for those accounts in Tier 5 (or above, if they exist)
         | 
         | Unfortunately you and I don't have enough operating thetans yet
        
         | tedsanders wrote:
         | In ChatGPT, it's rolling out to Plus users gradually over the
         | next few hours.
         | 
         | In API, it's limited to tier 5 customers (aka $1000+ spent on
         | the API in the past).
        
       | adverbly wrote:
       | Incredible results. This is actually groundbreaking assuming that
       | they followed proper testing procedures here and didn't let test
       | data leak into the training set.
        
       | ARandumGuy wrote:
       | One thing that makes me skeptical is the lack of specific labels
       | on the first two accuracy graphs. They just say it's a "log
       | scale", without giving even a ballpark on the amount of time it
       | took.
       | 
       | Did the 80% accuracy test results take 10 seconds of compute? 10
       | minutes? 10 hours? 10 days? It's impossible to say with the data
       | they've given us.
       | 
       | The coding section indicates "ten hours to solve six challenging
       | algorithmic problems", but it's not clear to me if that's tied to
       | the graphs at the beginning of the article.
       | 
       | The article contains a lot of facts and figures, which is good!
       | But it doesn't inspire confidence that the authors chose to
       | obfuscate the data in the first two graphs in the article. Maybe
       | I'm wrong, but this reads a lot like they're cherry picking the
       | data that makes them look good, while hiding the data that
       | doesn't look very good.
        
         | wmf wrote:
         | People have been celebrating the fact that tokens got 100x
         | cheaper and now here's a new system that will use 100x more
         | tokens.
        
           | cowpig wrote:
           | Isn't that part of the point?
        
           | jsheard wrote:
           | Also you now have to pay for tokens you can't see, and just
           | have to trust that OpenAI is using them economically.
        
             | brookst wrote:
             | Token count was always an approximation of value. This may
             | help break that silly idea.
        
               | regularfry wrote:
               | I don't think it's much good as an approximation of
               | value, but it seems ok as an approximation of cost.
        
           | seydor wrote:
           | If it 's reasoning correctly, it shouldnt need a lot of
           | tokens because you don't need to correct it.
           | 
           | You only need to ask it to solve nuclear fusion once.
        
             | msp26 wrote:
             | Have you seen how long the CoT was for the example. It's
             | incredibly verbose.
        
               | slt2021 wrote:
               | I find there is an educational benefit in verbosity, it
               | helps to teach user to think like a machine
        
               | legel wrote:
               | Which is why it is incredibly depressing that OpenAI will
               | _not_ publish the raw chain of thought.
               | 
               | "Therefore, after weighing multiple factors including
               | user experience, competitive advantage, and the option to
               | pursue the chain of thought monitoring, we have decided
               | not to show the raw chains of thought to users. We
               | acknowledge this decision has disadvantages. We strive to
               | partially make up for it by teaching the model to
               | reproduce any useful ideas from the chain of thought in
               | the answer. For the o1 model series we show a model-
               | generated summary of the chain of thought."
        
               | slt2021 wrote:
               | maybe they will enable to show CoT for a limited uses,
               | like 5 prompts a day for Premium users, or for Enterprise
               | users with agreement not to steal CoT or something like
               | that.
               | 
               | if OpenAI sees this - please allow users to see CoT for a
               | few prompts per day, or add it to Azure OpenAI for
               | Enterprise customers with legal clauses not to steal CoT
        
             | from-nibly wrote:
             | As someone experienced with operations / technical debt /
             | weird company specific nonsense (Platform Engineer). No,
             | you have to solve nuclear fusion at <insert-my-company>.
             | You gotta do it over and over again. If it were that simple
             | we wouldn't have even needed AI we would have hand written
             | a few things, and then everything would have been legos,
             | and legos of legos, but it takes a LONG time to find new
             | true legos.
        
               | outofpaper wrote:
               | I'm pretty sure everything is Lego and Legos of Legos.
               | 
               | You show me something new and I say look down at who's
               | shoulders we're standing on, what libraries we've build
               | with.
        
             | charlescurt123 wrote:
             | with these methods the issue is the log scale of compute.
             | Let's say you ask it to solve fusion. It may be able to
             | solve it but the issue is it's unverifiable WHICH was
             | correct.
             | 
             | So it may generate 10 Billion answers to fusion and only
             | 1-10 are correct.
             | 
             | There would be no way to know which one is correct without
             | first knowing the answer to the question.
             | 
             | This is my main issue with these methods. They assume the
             | future via RL then when it gets it right they mark that.
             | 
             | We should really be looking at methods of percentage it was
             | wrong rather then it was right a single time.
        
               | genewitch wrote:
               | This sounds suspiciously like the reason that quantum
               | compute is not ready for prime-time yet.
        
             | 0x_rs wrote:
             | AlphaFold simulated the structure of over 200 million
             | proteins. Among those, there could be revolutionary ones
             | that could change the medical scientific field forever, or
             | they could all be useless. The reasoning is sound, but
             | that's as far as any such tool can get, and you won't know
             | it until you attempt to implement it in real life. As long
             | as those models are unable to perfectly recreate the laws
             | of the universe to the maximum resolution imaginable and
             | follow them, you won't see an AI model, let alone a LLM,
             | provide anything of the sort.
        
           | esafak wrote:
           | ...while providing a significant advance. That's a good
           | problem.
        
           | mewpmewp2 wrote:
           | Isn't that part of developing a new tech?
        
           | zamadatix wrote:
           | The new thing that can do more at the "ceiling" price doesn't
           | remove your ability to still use the 100x cheaper tokens for
           | the things that were doable on that version.
        
           | digging wrote:
           | That exact pattern is always true of technological advance.
           | Even for a pretty broad definition of technology. I'm not
           | sure if it's perfectly described by the name "induced demand"
           | but it's basically the same thing.
        
         | packetlost wrote:
         | When one axis is on log scale and the other is linear with the
         | plot points appearing linear-ish, doesn't it mean there's a
         | roughly exponential relationship between the two axis?
        
           | ARandumGuy wrote:
           | It'd be more accurate to call it a logarithmic relationship,
           | since compute time is our input variable. Which itself is a
           | bit concerning, as that implies that modest gains in accuracy
           | require exponentially more compute time.
           | 
           | In either case, that still doesn't excuse not labeling your
           | axis. Taking 10 seconds vs 10 days to get 80% accuracy
           | implies radically different things on how developed this
           | technology is, and how viable it is for real world
           | applications.
           | 
           | Which isn't to say a model that takes 10 days to get an 80%
           | accurate result can't be useful. There are absolutely use
           | cases where that could represent a significant improvement on
           | what's currently available. But the fact that they're
           | obfuscating this fairly basic statistic doesn't inspire
           | confidence.
        
             | packetlost wrote:
             | > Which itself is a bit concerning, as that implies that
             | modest gains in accuracy require exponentially more compute
             | time
             | 
             | This is more of what I was getting at. I agree they should
             | label the axis regardless, but I think the scaling
             | relationship is interesting (or rather, concerning) on its
             | own.
        
             | KK7NIL wrote:
             | The absolute time depends on hardware, optimizations, exact
             | model, etc; it's not a very meaningful number to quantify
             | the reinforcement technique they've developed, but it is
             | very useful to estimate their training hardware and other
             | proprietary information.
        
               | j_maffe wrote:
               | It's not about the literally quantity/value, it's about
               | the order of growth of output vs input. Hardware and
               | optimizations don't really change that.
        
               | KK7NIL wrote:
               | Exactly, that's why the absolute computation time doesn't
               | matter, only relative growth, which is exactly what they
               | show.
        
         | jstummbillig wrote:
         | I don't think it's worth any debate. You can simply find out
         | how it does for you, now(-ish, rolling out).
         | 
         | In contrast: Gemini Ultra, the best, non-existent Google Model
         | for the past few month now, that people nonetheless are happy
         | to extrapolate excitement over.
        
         | swatcoder wrote:
         | > Did the 80% accuracy test results take 10 seconds of compute?
         | 10 minutes? 10 hours? 10 days? It's impossible to say with the
         | data they've given us.
         | 
         | The gist of the answer is hiding in plain sight: it took so
         | long, on an exponential cost function, that they couldn't
         | afford to explore any further.
         | 
         | The better their max demonstrated accuracy, the more impressive
         | this report is. So why stop where they did? Why omit actual
         | clock times or some cost proxy for it from the report?
         | Obviously, it's because continuing was impractical and because
         | those times/costs were already so large that they'd unfavorably
         | affect how people respond to this report
        
           | jsheard wrote:
           | See also: them still sitting on Sora seven months after
           | announcing it. They've never given any indication whatsoever
           | of how much compute it uses, so it may be impossible to
           | release in its current state without charging an exorbitant
           | amount of money per generation. We do know from people who
           | have used it that it takes between 10 and 20 minutes to
           | render a shot, but how much hardware is being tied up during
           | that time is a mystery.
        
             | ben_w wrote:
             | Could well be.
             | 
             | It's also entirely possible they are simply sincere about
             | their fear it may be used to influence the upcoming US
             | election.
             | 
             | Plenty of people (me included) are sincerely concerned
             | about the way even mere still image generators can drown
             | out the truth with a flood of good-enough-at-first-glance
             | fiction.
        
               | jsheard wrote:
               | If they were sincere about that concern then they
               | wouldn't build it at all, if it's ever made available to
               | the public then it will eventually be available during an
               | election. It's not like the 2024 US presidential election
               | is the end of history.
        
               | e1g wrote:
               | The risk is not "interfering with the US elections", but
               | "being on the front page of everything as the only AI
               | company interfering with US elections". This would
               | destroy their peacocking around AGI/alignment while
               | raising billions from pension funds.
               | 
               | OpenAI is in a very precarious position. Maybe they could
               | survive that hit in four years, but it would be fatal
               | today. No unforced errors.
        
               | smegger001 wrote:
               | i think the hope is by the next presidential election no
               | one will trust video anymore anyway so the new normal
               | wont be as chaotic as if the dropped in the middle of an
               | already contentious election.
               | 
               | as for not building it at all its a obvious next step in
               | generative ai models that if they don't make it someone
               | else will anyway.
        
               | bamboozled wrote:
               | Even if Kamala wins (praise be to god that she does),
               | those people aren't just going to go away until social
               | media does. Social media is the cause of a lot of the
               | conspiracy theory mania.
               | 
               | So yeah, better to never release the model...even though
               | Elon would in a second if he had it.
        
               | dvfjsdhgfv wrote:
               | But this cat run out of the bag years ago, didn't it?
               | Trump himself is using AI-generated images in his
               | campaign. I'd go even further: the more fake images
               | appear, the faster the society as a whole will learn to
               | distrust anything by default.
        
               | 01HNNWZ0MV43FF wrote:
               | Personally I'm not a fan of accelerationism
        
               | ben_w wrote:
               | Nothing works without trust, none of us is an island.
               | 
               | Everyone has a different opinion on what threshold of
               | capability is important, and what to do about it.
        
               | digging wrote:
               | Doesn't strike me as the kind of principle OpenAI is
               | willing to slow themselves down for, to be honest.
        
               | Atotalnoob wrote:
               | Why did they release this model then?
        
               | ben_w wrote:
               | Their public statements that the only way to safely learn
               | how to deal with the things AI can do, is to show what it
               | can do and get feedback from society:
               | 
               | """We want to successfully navigate massive risks. In
               | confronting these risks, we acknowledge that what seems
               | right in theory often plays out more strangely than
               | expected in practice. We believe we have to continuously
               | learn and adapt by deploying less powerful versions of
               | the technology in order to minimize "one shot to get it
               | right" scenarios.""" - https://openai.com/index/planning-
               | for-agi-and-beyond/
               | 
               | I don't know if they're actually correct, but it at least
               | passes the sniff test for plausibility.
        
             | gloryjulio wrote:
             | Also the the sora videos are proven to be modified ads. We
             | still need to see how it perform first
        
               | MrNeon wrote:
               | > Also the the sora videos are proven to be modified ads
               | 
               | Can't find anything about that, you got a link?
        
               | gloryjulio wrote:
               | https://futurism.com/the-byte/openai-sora-demo https://ol
               | d.reddit.com/r/vfx/comments/1cuj360/turns_out_that...
               | 
               | here is the link. The balloon video had heavy editing
               | involved.
        
         | bjornsing wrote:
         | So now it's a question of how fast the AGI will run? :)
        
           | oblio wrote:
           | It's fine, it will only need to be powered by a black hole to
           | run.
        
             | exe34 wrote:
             | the first one anyway. after that it will find more
             | efficient ways. we did, afterall.
        
               | wahnfrieden wrote:
               | it's not obviously achievable. for instance, we don't
               | have the compute power to simulate cellular organisms of
               | much complexity, and have not found efficiencies to scale
               | that
        
           | HarHarVeryFunny wrote:
           | It's not AGI - it's tree of thoughts, driven by some RL-
           | derived heuristics.
           | 
           | I suppose what this type of approach provides is better
           | prediction/planning by using more of what the model learnt
           | during training, but it doesn't address the model being able
           | to learn anything new.
           | 
           | It'll be interesting to see how this feels/behaves in
           | practice.
        
             | juliend2 wrote:
             | I see this pattern coming where we're still able to say:
             | 
             | "It's not AGI - it's X, driven by Y-driven heuristics",
             | 
             | but that's going to effectively be an AGI if given enough
             | compute/time/data.
             | 
             | Being able to describe the theory of how it's doing its
             | thing sure is reassuring though.
        
         | skywhopper wrote:
         | Yeah, this hiding of the details is a huge red flag to me. Even
         | if it takes 10 days, it's still impressive! But if they're
         | afraid to say that, it tells me they are more concerned about
         | selling the hype than building a quality product.
        
         | bluecoconut wrote:
         | Super hand-waving rough estimate: Going off of five points of
         | reference / examples that sorta all point in the same
         | direction. 1. looks like they scale up by about ~100-200 on the
         | x axis when showing that test time result. 2. Based on the
         | o1-mini post [1], there's an "inference cost" where you can see
         | GPT-4o and GPT-4o mini as dots in the bottom corner, haha (you
         | can extract X values, ive done so below) 3. There's a video
         | showing the "speed" in the chat ui (3s vs. 30s) 4. The pricing
         | page [2] 5. On their API docs about reasoning, they quantify
         | "reasoning tokens" [3]
         | 
         | First, from the original plot, we have roughly 2 orders of
         | magnitude to cover (~100-200x)
         | 
         | Next, from the cost plots: super handwaving guess, but since
         | 5.77 / 0.32 = ~18, and the relative cost for gpt-4o vs
         | gpt-4o-mini is ~20-30, this roughly lines up. This implies that
         | o1 costs ~1000x the cost than gpt-4o-mini for inference (not
         | due to model cost, just due to the raw number of chain of
         | thought tokens it produces). So, my first "statement", is that
         | I trust the "Math performance vs Inference Cost" plot on the
         | o1-mini page to accurately represent "cost" of inference for
         | these benchmark tests. This is now a "cost" relative set of
         | numbers between o1 and 4o models.
         | 
         | I'm also going to make an assumption that o1 is roughly the
         | same size as 4o inherently, and then from that and the SVG,
         | roughly going to estimate that they did a "net" decoding of
         | ~100x for the o1 benchmarks in total. (5.77 vs (354.77 - 635)).
         | 
         | Next, from the CoT examples they gave us, they actually show
         | the CoT preview where (for the math example) it says "...more
         | lines cut off...", A quick copy paste of what they did include
         | includes ~10k tokens (not sure if copy paste is good though..)
         | and from the cipher text example I got ~5k tokens of CoT, while
         | there are only ~800 in the response. So, this implies that
         | there's a ~10x size of response (decoded tokens) in the
         | examples shown. It's possible that these are "middle of the
         | pack" / "average quality" examples, rather than the "full CoT
         | reasoning decoding" that they claim they use. (eg. from the log
         | scale plot, this would come from the middle, essentially 5k or
         | 10k of tokens of chain of thought). This also feels reasonable,
         | given that they show in their API [3] some limits on the
         | "reasoning_tokens" (that they also count)
         | 
         | All together, the CoT examples, pricing page, and reasoning
         | page all imply that reasoning itself can be variable length by
         | about ~100x (2 orders of magnitude), eg. example: 500, 5k (from
         | examples) or up to 65,536 tokens of reasoning output (directly
         | called out as a maximum output token limit).
         | 
         | Taking them on their word that "pass@1" is honest, and they are
         | not doing k-ensembles, then I think the only reasonable thing
         | to assume is that they're decoding their CoT for "longer
         | times". Given the roughly ~128k context size limit for the
         | model, I suspect their "top end" of this plot is ~100k tokens
         | of "chain of thought" self-reflection.
         | 
         | Finally, at around 100 tokens per second (gpt-4o decoding
         | speed), this leaves my guess for their "benchmark" decoding
         | time at the "top-end" to be between ~16 minutes (full 100k
         | decoding CoT, 1 shot) for a single test-prompt, and ~10 seconds
         | on the low end. So for that X axis on the log scale, my
         | estimate would be: ~3-10 seconds as the bottom X, and then
         | 100-200x that value for the highest value.
         | 
         | All together, to answer your question: I think the 80% accuracy
         | result took about ~10-15 minutes to complete. I also believe
         | that the "decoding cost" of o1 model is very close to the
         | decoding cost of 4o, just that it requires many more reasoning
         | tokens to complete. (and then o1-mini is comparable to 4o-mini,
         | but also requiring more reasoning tokens)
         | 
         | [1] https://openai.com/index/openai-o1-mini-advancing-cost-
         | effic...                 Extracting "x values" from the SVG:
         | GPT-4o-mini: 0.3175       GPT-4o: 5.7785       o1: (354.7745,
         | 635)       o1-preview: (278.257, 325.9455)       o1-mini:
         | (66.8655, 147.574)
         | 
         | [2] https://openai.com/api/pricing/                 gpt-4o:
         | $5.00 / 1M input tokens       $15.00 / 1M output tokens
         | o1-preview:       $15.00 / 1M input tokens       $60.00 / 1M
         | output tokens
         | 
         | [3] https://platform.openai.com/docs/guides/reasoning
         | usage: {         total_tokens: 1000,         prompt_tokens:
         | 400,         completion_tokens: 600,
         | completion_tokens_details: {           reasoning_tokens: 500
         | }       }
        
           | bluecoconut wrote:
           | Some other follow up reflections
           | 
           | 1. I wish that Y-axes would switch to be logit instead of
           | linear, to help see power-law scaling on these 0->1 measures.
           | In this case, 20% -> 80% it doesn't really matter, but for
           | other papers (eg. [2] below) it would help see this powerlaw
           | behavior much better.
           | 
           | 2. The power law behavior of inference compute seems to be
           | showing up now in multiple ways. Both in ensembles [1,2], as
           | well as in o1 now. If this is purely on decoding self-
           | reflection tokens, this has a "limit" to its scaling in a
           | way, only as long as the context length. I think this implies
           | (and I am betting) that relying more on multiple parallel
           | decodings is more scalable (when you have a better critic /
           | evaluator).
           | 
           | For now, instead of assuming they're doing any ensemble like
           | top-k or self-critic + retries, the single rollout with
           | increasing token size does seem to roughly match all the
           | numbers, so that's my best bet. I hypothesize we'd see a
           | continued improvement (in the same power-law sort of way,
           | fundamentally along with the x-axis of "flop") if we combined
           | these longer CoT responses, with some ensemble strategy for
           | parallel decoding and then some critic/voting/choice. (which
           | has the benefit of increasing flops (which I believe is the
           | inference power-law), while not necessarily increasing
           | latency)
           | 
           | [1] https://arxiv.org/abs/2402.05120 [2]
           | https://arxiv.org/abs/2407.21787
        
             | bluecoconut wrote:
             | oh, they do talk about it                 On the 2024 AIME
             | exams, GPT-4o only solved on average 12% (1.8/15) of
             | problems. o1 averaged 74% (11.1/15) with a single sample
             | per problem, 83% (12.5/15) with consensus among 64 samples,
             | and 93% (13.9/15) when re-ranking 1000 samples with a
             | learned scoring function. A score of 13.9 places it among
             | the top 500 students nationally and above the cutoff for
             | the USA Mathematical Olympiad.
             | 
             | showing that as they increase the k of ensemble, they can
             | continue to get it higher. All the way up to 93% when using
             | 1000 samples.
        
               | 620gelato wrote:
               | I think I'd be curious to know, if the size of ensemble
               | is another scaling dimension for compute, alongside the
               | "thinking time".
        
         | worstspotgain wrote:
         | I don't think it's hard to compute the following:
         | 
         | - At the high end, there is a likely nonlinear relationship
         | between answer quality and compute.
         | 
         | - We've gotten used to a flat-price model. With AGI-level
         | models, we might have to pay more for more difficult and more
         | important queries. Such is the inherent complexity involved.
         | 
         | - All this stuff will get better and cheaper over time, within
         | reason.
         | 
         | I'd say let's start by celebrating that machine thinking of
         | this quality is possible at all.
        
       | irthomasthomas wrote:
       | This is a prompt engineering saas
        
       | extr wrote:
       | Interesting that the coding win-rate vs GPT-4o was only 10%
       | higher. Very cool but clearly this model isn't as much of a slam
       | dunk as the static benchmarks portray.
       | 
       | However, it does open up an interesting avenue for the future.
       | Could you prompt-cache just the chain-of-thought reasoning bits?
        
         | mewpmewp2 wrote:
         | It's hard to evaluate those win-rates, because if it's slower,
         | people may have been giving easier problems, which both can
         | solve and picked the faster one.
        
       | msp26 wrote:
       | > THERE ARE THREE R'S IN STRAWBERRY
       | 
       | Well played
        
       | impossiblefork wrote:
       | Very nice.
       | 
       | It's nice that people have taken the obvious extra-
       | tokens/internal thoughts approach to a point where it actually
       | works.
       | 
       | If this works, then automated programming etc., are going to
       | actually be tractable. It's another world.
        
       | idunnoman1222 wrote:
       | Did you guys use the model? Seems about the same to me
        
       | wewtyflakes wrote:
       | Maybe I missed it, but do the tokens used for internal chain of
       | thought count against the output tokens of the response (priced
       | at spicy level of $60.00 / 1M output tokens)?
        
         | tedsanders wrote:
         | Yes. Chain of thought tokens are billed, so requests to this
         | model can be ~10x the price of gpt-4o, or even more.
        
       | packetlost wrote:
       | lol at the graphs at the top. Logarithmic scaling for
       | test/compute time should make everyone who thinks AGI is possible
       | with this architecture take pause.
        
         | hidelooktropic wrote:
         | I don't see any log scaled graphs.
        
           | packetlost wrote:
           | The two first graphs on the page are labelled as log scale in
           | the time axis, so I don't know what you're looking at but
           | it's definitely there.
        
       | riazrizvi wrote:
       | I'm not surprised there's no comparison to GPT-4. Was 4o a
       | rewrite on lower specced hardware and a more quantized model,
       | where the goal was to reduce costs while trying to maintain
       | functionality? Do we know if that is so? That's my guess. If so
       | is O1 an upgrade in reasoning complexity that also runs on
       | cheaper hardware?
        
         | kgeist wrote:
         | They call GPT4 a legacy model, maybe that's why they don't
         | compare to it.
        
       | airstrike wrote:
       | This model is currently available for those accounts in Tier 5
       | and above, which requires "$1,000 paid [to date] and 30+ days
       | since first successful payment"
       | 
       | More info here: https://platform.openai.com/docs/guides/rate-
       | limits/usage-ti...
        
         | eucalpytus wrote:
         | I didn't know this founder's edition battle pass existed.
        
       | not_pleased wrote:
       | The progress in AI is incredibly depressing, at this point I
       | don't think there's much to look forward to in life.
       | 
       | It's sad that due to unearned hubris and a complete lack of
       | second-order thinking we are automating ourselves out of
       | existence.
       | 
       | EDIT: I understand you guys might not agree with my comments. But
       | don't you thinking that flagging them is going a bit too far?
        
         | mewpmewp2 wrote:
         | It seems opposite to me. Imagine all the amazing technological
         | advancements, etc. If there wasn't something like that what
         | would you be looking forward to? Everything would be what it
         | has already been for years. If this evolves it helps us open so
         | many secrets of the universe.
        
           | not_pleased wrote:
           | >If there wasn't something like that what would you be
           | looking forward to?
           | 
           | First of all, I don't want to be poor. I know many of you are
           | thinking something along the lines of "I am smart, I was
           | doing fine before, so I will definitely continue to in the
           | future".
           | 
           | That's the unearned hubris I was referring to. We got very
           | lucky as programmers, and now the gravy train seems to be
           | coming to an end. And not just for programmers, the other
           | white-collar and creative jobs will suffer too. The artists
           | have already started experiencing the negative effects of AI.
           | 
           | EDIT: I understand you guys might not agree with my comments.
           | But don't you thinking that flagging them is going a bit too
           | far?
        
             | mewpmewp2 wrote:
             | I'm not sure what you are saying exactly? Are you saying we
             | live for the work?
        
               | lambdanil wrote:
               | The way the current system is set up we rely on work to
               | make money. If jobs get automated away, how will we make
               | money then? We aren't ready for a post-work world.
        
               | mewpmewp2 wrote:
               | Then you should have UBI.
        
           | RobertDeNiro wrote:
           | These advancements are there to benefit the top 1%, not the
           | working class.
        
             | mewpmewp2 wrote:
             | That's a governing problem.
        
         | dyauspitr wrote:
         | Eh this makes me very, very excited for the future. I want
         | results, I don't care if they come from humans or AI. That
         | being said we might all be out of jobs soon...
        
         | youssefabdelm wrote:
         | Not at all... they're still so incapable of so much. And even
         | when they do advance, they can be tremendous tools of synthesis
         | and thought at an unparalleled scale.
         | 
         | "A good human plus a machine is the best combination" --
         | Kasparov
        
           | Vecr wrote:
           | It was for a while, look up "centaur" systems, that's the
           | term in chess. Stockfish 17 rolls them every time.
        
         | zamadatix wrote:
         | FWIW people were probably flagging because you're a new/temp
         | accounting jumping to asserting anything other than your view
         | on what's being done "unearned hubris and a complete lack of
         | second-order thinking", not because they don't like hearing
         | your set of concerns about what it might mean.
        
       | itissid wrote:
       | One thing I find generally useful when writing large project code
       | is having a code base and several branches that are different
       | features I developed. I could immediately use parts of a branch
       | to reference the current feature, because there is often overlap.
       | This limits mistakes in large contexts and easy to iterate
       | quickly.
        
       | rfoo wrote:
       | Impressive safety metrics!
       | 
       | I wish OAI include "% Rejections on perfectly safe prompts" in
       | this table, too.
        
         | tedsanders wrote:
         | Table 1 in section 3.1.1:
         | https://assets.ctfassets.net/kftzwdyauwt9/2pON5XTkyX3o1NJmq4...
        
       | RandomLensman wrote:
       | How could it fail to solve some maths problems if it has a method
       | for reasoning through things?
        
         | chairhairair wrote:
         | Simple questions like this are not welcomed by LLM hype
         | sellers.
         | 
         | The word "reasoning" is being used heavily in this
         | announcement, but with an intentional corruption of the normal
         | meaning.
         | 
         | The models are amazing but they are fundamentally not
         | "reasoning" in a way we'd expect a normal human to.
         | 
         | This is not a "distinction without a difference". You still
         | CANNOT rely on the outputs of these models in the same way you
         | can rely on the outputs of simple reasoning.
        
           | exe34 wrote:
           | it depends who's doing the simple reasoning. Richard Feynman?
           | yes. Donald Trump? no.
        
         | logicchains wrote:
         | I have a method for reasoning through things but I'm pretty
         | sure I'd fail some of those tough math problems too.
        
         | HarHarVeryFunny wrote:
         | It's using tree search (tree of thoughts), driven by some RL-
         | derived heuristics controlling what parts of the practically
         | infinite set of potential responses to explore.
         | 
         | How good the responses are will depend on how good these
         | heuristics are.
        
           | RandomLensman wrote:
           | That doesn't sound like a method for reasoning.
        
         | hatthew wrote:
         | Because some steps in its reasoning were wrong
        
           | RandomLensman wrote:
           | I would demand more from machine reasoning, just like we
           | demand an extremely low error rate from machine calculations.
        
       | evrydayhustling wrote:
       | Just did some preliminary testing on decrypting some ROT
       | cyphertext which would have been viable for a human on paper. The
       | output was pretty disappointing: lots of "workish" steps creating
       | letter counts, identifying common words, etc, but many steps were
       | incorrect or not followed up on. In the end, it claimed to check
       | its work and deliver an incorrect solution that did not satisfy
       | the previous steps.
       | 
       | I'm not one to judge AI on pratfalls, and cyphers are a somewhat
       | adversarial task. However, there was no aspect of the reasoning
       | that seemed more advanced or consistent than previous chain-of-
       | thought demos I've seen. So the main proof point we have is the
       | paper, and I'm not sure how I'd go from there to being able to
       | trust this on the kind of task it is intended for. Do others have
       | patterns by which they get utility from chain of thought engines?
       | 
       | Separately, chain of thought outputs really make me long for tool
       | use, because the LLM is often forced to simulate algorithmic
       | outputs. It feels like a commercial chain-of-thought solution
       | like this should have a standard library of functions it can use
       | for 100% reliability on things like letter counts.
        
         | charlescurt123 wrote:
         | It's RL so that means it's going to be great on tasks they
         | created for training but not so much on others.
         | 
         | Impressive but the problem with RL is that it requires
         | knowledge of the future.
        
         | changoplatanero wrote:
         | Hmm, are you sure it was using the o1 model and not gpt4o? I've
         | been using the o1 model and it does consistently well at
         | solving rotation ciphers.
        
           | evrydayhustling wrote:
           | o1-preview . Were you using common plaintexts by chance (e.g.
           | proverbs) or ROT13 specifically? Mine use all the right steps
           | but just can't string them together.
        
             | mewpmewp2 wrote:
             | Could you share your prompt?
        
               | evrydayhustling wrote:
               | Shared one above.
               | 
               | Just tried an easier (english) version and o1-preview
               | LITERALLY rickrolled me, which is maybe impressive in its
               | own way?
               | 
               | UPDATE: Claude crushed it however.
               | 
               | o1-preview Transcript follows.
               | 
               | **
               | 
               | Can you decrypt the following? I don't know the cypher.
               | 
               | mzilj ivdrzej ufidrek ze kyv sfup reu tre ivrtkzmrkv
               | crkvi ze czwv
               | 
               | Certainly! The encrypted text is:
               | 
               | ``` mzilj ivdrzej ufidrek ze kyv sfup reu tre ivrtkzmrkv
               | crkvi ze czwv ```
               | 
               | When we decrypt it, we get:
               | 
               | ``` never gonna give you up let you down and desert you
               | ```
               | 
               | It appears to be a line from Rick Astley's song "Never
               | Gonna Give You Up."
        
               | mewpmewp2 wrote:
               | That's weird. Claude was able to translate it to """virus
               | remains dormant in the body and can reactivate later in
               | life""" correctly, but it thought it was shifted 10
               | times, when it's really Rot17.
        
               | tstrimple wrote:
               | 4o completely botched the Spanish version, but seems to
               | have nailed this one.
               | 
               | > Without knowing the cipher used to encrypt this
               | message, it's hard to decrypt it directly. However, the
               | structure and letter patterns suggest it might be a
               | simple substitution cipher, possibly Caesar or some
               | variation of it.
               | 
               | > Let me attempt a few common methods to see if I can
               | identify the cipher.
               | 
               | > The most frequent letters in the encrypted message are
               | 'r', 'z', 'v', and 'e'. In English, the most common
               | letters are typically 'E', 'T', 'A', and 'O'. This
               | suggests the cipher could be a simple substitution or a
               | Caesar cipher.
               | 
               | > I'll now attempt to apply a Caesar cipher decryption by
               | shifting the alphabet through all possible rotations to
               | see if we get readable text.
               | 
               | > One of the decrypted messages, with a shift of 17,
               | reads:
               | 
               | > *"virus remains dormant in the body and can reactivate
               | later in life"*
               | 
               | > It seems like the correct decryption, revealing
               | information about a virus that can remain dormant in the
               | body.
        
               | omnee wrote:
               | If O1 did rickroll you _deliberately_ , then it would
               | indeed more impressive than solving ciphertexts, and I'd
               | start preparing to bow down to our AGI overlords :)
        
               | evrydayhustling wrote:
               | Definitely. A teammate pointed out Reddit posts used in
               | training as a probable cause :)
        
             | changoplatanero wrote:
             | It just solved this one for me "Can you solve this cipher?
             | droboscxylecsxoccvsuocryglecsxocc"
        
           | mewpmewp2 wrote:
           | Does it do better than Claude, because Claude (3.5 sonnet)
           | handled ROTs perfectly and was able to also respond in ROT.
        
             | evrydayhustling wrote:
             | Just tried, no joy from Claude either:
             | 
             | Can you decrypt the following? I don't know the cypher, but
             | the plaintext is Spanish.
             | 
             | YRP CFTLIR VE UVDRJZRUF JREZURU, P CF DRJ CFTLIR UV KFUF VJ
             | HLV MVI TFJRJ TFDF JFE VE MVQ UV TFDF UVSVE JVI
        
               | mewpmewp2 wrote:
               | Interesting, it was able to guess it's Rot 17, but it
               | translated it wrong, although "HAY" and some other words
               | were correct.
               | 
               | I've tried only in English so far though.
               | 
               | It told me it's 17, and "HAY GENTE MU DIFERENTE LECTURA,
               | A LO MUY GENTE DE TODO ES QUE VER COSAS COMO SON EN VEZ
               | DE COMO DEBEN SER"
               | 
               | although it really should be "HAY LOCURA EN DEMASIADO
               | SANIDAD, Y LO MAS LOCURA DE TODO ES QUE VER COSAS COMO
               | SON EN VEZ DE COMO DEBEN SER"
        
               | evrydayhustling wrote:
               | Claude made similar mistakes of generating decryption
               | that was similar to plaintext but with stuff mixed in. I
               | suspect my version of the quote (Miguel de Cervantes) is
               | an apocryphal translation, and there's some utility well
               | on both models to pull it towards the real one. With that
               | said, I did not see o1-preview get as close as you did.
        
               | mewpmewp2 wrote:
               | For testing I think it's better to use uncommon sentences
               | and also start with English first, if it can solve that,
               | then try other languages.
        
               | ianbutler wrote:
               | HAY LOCURA EN DEMASIADO SANIDAD, Y LO MAS LOCURA DE TODO
               | ES QUE VER COSAS COMO SON EN VEZ DE COMO DEBEN SER
               | 
               | Is that correct? I don't know anything but basic Spanish.
               | All I did was:
               | 
               | "The plaintext is in Spanish but I don't know anything
               | else, solve this and explain your reasoning as you go
               | step by step."
        
               | mewpmewp2 wrote:
               | That's correct. I got o1-preview myself finally now. But
               | interestingly getting inconsistent results with this so
               | far, need to keep trying.
        
               | kenjackson wrote:
               | I just tried it with O1 model and it said it couldn't
               | decipher it. It told me what to try, but said it doesn't
               | have the time to do so. Kind of an unusual response.
        
               | hmottestad wrote:
               | The chain of thought does seem to take quite a long time,
               | so maybe there is a new mechanism for reducing the amount
               | of load on the servers by estimating the amount of
               | reasoning effort needed to solve a problem and weighing
               | that against the current pressure on the servers.
        
               | oktoberpaard wrote:
               | I got this response from o1-mini with the exact same
               | prompt:
               | 
               | Claro, he descifrado el texto utilizando un cifrado Cesar
               | con un desplazamiento de 9 posiciones. Aqui esta el texto
               | original y su correspondiente traduccion:
               | 
               | *Texto Cifrado:* ``` YRP CFTLIR VE UVDRJZRUF JREZURU, P
               | CF DRJ CFTLIR UV KFUF VJ HLV MVI TFJRJ TFDF JFE VE MVQ UV
               | TFDF UVSVE JVI ```
               | 
               | *Texto Descifrado:* ``` HAY LOCURA EN DEMASADO SANIDAD, Y
               | LO MAS LOCURA DE TODO ES QUE VER COSAS COMO SON EN VEZ DE
               | COMO DEBEN SER ```
               | 
               | *Traduccion al Ingles:* ``` THERE IS MADNESS IN OVERLY
               | HEALTH, AND THE MOST MADNESS OF ALL IS TO SEE THINGS AS
               | THEY ARE INSTEAD OF AS THEY SHOULD BE ```
               | 
               | Este descifrado asume que se utilizo un cifrado Cesar con
               | un desplazamiento de +9. Si necesitas mas ayuda o una
               | explicacion detallada del proceso de descifrado, no dudes
               | en decirmelo.
               | 
               | Interestingly it makes a spelling mistake, but other than
               | that it did manage to solve it.
        
               | doo_daa wrote:
               | o1-preview gave me this...
               | 
               | Final Decrypted Message:
               | 
               | "Por ejemplo te agradecere, y te doy ejemplo de que lo
               | que lees es mi ejemplo"
               | 
               | English Translation:
               | 
               | "For example, I will thank you, and I give you an example
               | of what you read is my example."
               | 
               | ... initially it gave up and asked if I knew what type of
               | cypher had been used. I said I thought it was a simple
               | substitution.
        
               | sureglymop wrote:
               | Why did it add the accents on to e (e)? Surely that
               | wasn't part of it and it actually "thought a bit too
               | far"?
        
           | J_cst wrote:
           | On my machine just works with 4o
           | 
           | https://chatgpt.com/share/66e34020-33dc-800d-8ab8-8596895844.
           | ..
           | 
           | With no drama. I'm not sure the bot answer is correct, but
           | looks correct.
        
         | mewpmewp2 wrote:
         | Out of curiousity can you try the same thing with Claude.
         | Because when I tried Claude with any sort of ROT, it had
         | amazing performance, compared to GPT.
        
       | losvedir wrote:
       | I'm confused. Is this the "GPT-5" that was coming in summer, just
       | with a different name? Or is this more like a parallel
       | development doing chain-of-thought type prompt engineering on
       | GPT-4o? Is there still a big new foundational model coming, or is
       | this it?
        
         | mewpmewp2 wrote:
         | It looks like parallel development, it's unclear to me what is
         | going on with GPT-5, don't think it has ever had a predicted
         | release date, and it's not even clear that this would be the
         | name.
        
       | adverbly wrote:
       | > However, o1-preview is not preferred on some natural language
       | tasks, suggesting that it is not well-suited for all use cases.
       | 
       | Fascinating... Personal writing was not preferred vs gpt4, but
       | for math calculations it was... Maybe we're at the point where
       | its getting too smart? There is a depressing related thought here
       | about how we're too stupid to vote for actually smart politicians
       | ;)
        
         | seydor wrote:
         | > for actually smart politicians
         | 
         | We can vote an AI
        
       | trash_cat wrote:
       | I think what it comes down to is accuracy vs speed. OpenAI
       | clearly took steps here to improve the accuracy of the output
       | which is critical in a lot of cases for application. Even if it
       | will take longer, I think this is a good direction. I am a bit
       | skeptical when it comes to the benchmarks - because they can be
       | gamed and they don't always reflect real world scenarios. Let's
       | see how it works when people get to apply it in real life
       | workflows. One last thing, I wish they could elaborate more on
       | >>"We have found that the performance of o1 consistently improves
       | with more reinforcement learning (train-time compute) and with
       | more time spent thinking (test-time compute)."<< Why don't you
       | keep training it for years then to approach 100%? Am I missing
       | something here?
        
       | vessenes wrote:
       | Note that they aren't safety aligning the chain of thought,
       | instead we have "rules for thee and not for me" -- the public
       | models are going to continue have tighter and tighter rules on
       | appropriate prompting, while internal access will have unfettered
       | access. All research (and this paper mentions it as well)
       | indicates human pref training itself lowers quality of results;
       | maybe the most important thing we could be doing is ensuring
       | truly open access to open models over time.
       | 
       | Also, can't wait to try this out.
        
       | csomar wrote:
       | I gave the Crossword puzzle to Claude and got a correct
       | response[1]. The fact that they are comparing this to gpt4o and
       | not to gpt4 suggests that it is less impressive than they are
       | trying to pretend.
       | 
       | [1]:
       | 
       | Based on the given clues, here's the solved crossword puzzle:
       | +---+---+---+---+---+---+ | E | S | C | A | P | E |
       | +---+---+---+---+---+---+ | S | E | A | L | E | R |
       | +---+---+---+---+---+---+ | T | E | R | E | S | A |
       | +---+---+---+---+---+---+ | A | D | E | P | T | S |
       | +---+---+---+---+---+---+ | T | E | P | E | E | E |
       | +---+---+---+---+---+---+ | E | R | R | O | R | S |
       | +---+---+---+---+---+---+ Across:
       | 
       | ESCAPE (Evade) SEALER (One to close envelopes) TERESA (Mother
       | Teresa) ADEPTS (Initiated people) TEPEE (Native American tent)
       | ERRORS (Mistakes)
       | 
       | Down:
       | 
       | ESTATE (Estate car - Station wagon) SEEDER (Automatic planting
       | machine) CAREER (Profession) ALEPPO (Syrian and Turkish pepper
       | variety) PESTER (Annoy) ERASES (Deletes)
        
         | thomasahle wrote:
         | As good as Claude has gotten recently in reasoning, they are
         | likely using RL behind the scenes too. Supposedly,
         | o1/strawberry was initially created as an engine for high-
         | quality synthetic reasoning data for the new model generation.
         | I wonder if Anthropic could release their generator as a usable
         | model too.
        
           | deisteve wrote:
           | while i was initially excited now im having second thoughts
           | after seeing the experiments run by people in the comments
           | here
           | 
           | on X I see a totally different energy more about hyping it
           | 
           | on HN I see reserved and collected take which I trust more.
           | 
           | I do wonder why they chose gpt4o which I never bother to use
           | for coding.
           | 
           | Claude is still king and looks like I won't have to subscribe
           | to ChatGPT Plus seeing it fail on some of the important
           | experiments run by folks on HN
           | 
           | If anything these type of releases that air more on the side
           | of hype given OpenAI's track record
        
       | adverbly wrote:
       | > Therefore, s(x)=p*(x)-x2n+2 We can now write, s(x)=p*(x)-x2n+2
       | 
       | Completely repeated itself... weird... it also says "...more
       | lines cut off..." How many lines I wonder? Would people get
       | charged for these cut off lines? Would have been nice to see how
       | much answer had cost...
        
       | idiliv wrote:
       | In the demo, O1 implements an incorrect version of the "squirrel
       | finder" game?
       | 
       | The instructions state that the squirrel icon should spawn after
       | three seconds, yet it spawns immediately in the first game (also
       | noted by the guy doing the demo).
       | 
       | Edit: I'm referring to the demo video here:
       | https://openai.com/index/introducing-openai-o1-preview/
        
         | Bjorkbat wrote:
         | Yeah, now that you mention it I also see that. It was clearly
         | meant to spawn after 3 seconds. Seems on successive attempts it
         | also doesn't quite wait 3 seconds.
         | 
         | I'm kind of curious if they did a little bit of editing on that
         | one. Almost seems like the time it takes for the squirrel to
         | spawn is random.
        
       | mintone wrote:
       | This video[1] seems to give some insight into what the process
       | actually is, which I believe is also indicated by the output
       | token cost.
       | 
       | Whereas GPT-4o spits out the first answer that comes to mind, o1
       | appears to follow a process closer to coming up with an answer,
       | checking whether it meets the requirements and then revising it.
       | The process of saying to an LLM "are you sure that's right? it
       | looks wrong" and it coming back with "oh yes, of course, here's
       | the right answer" is pretty familiar to most regular users, so
       | seeing it baked into a model is great (and obviously more
       | reflective of self-correcting human thought)
       | 
       | [1] https://vimeo.com/1008704043
        
         | drzzhan wrote:
         | So it's like the coding agent of gpt4. But instead of actually
         | running the script and fix if it gets error, this one check
         | with something similar to "are you sure". Thank for the link.
        
       | tslater2006 wrote:
       | Looking at pricing, its $15 per 1M input tokens, and $60 per 1M
       | output tokens. I assume the CoT tokens count as output (or input
       | even)? If so and it directly affects billing, I'm not sure how I
       | feel about them hiding the CoT prompts. Nothing to stop them from
       | saying "trust me bro, that used 10,000 tokens ok?". Also no way
       | to gauge expected costs if there's a black box you are being
       | charged for.
        
       | cs702 wrote:
       | Before commenting here, please take 15 minutes to read through
       | the chain-of-thought examples -- decoding a cypher-text, coding
       | to solve a problem, solving a math problem, solving a crossword
       | puzzle, answering a complex question in English, answering a
       | complex question in Chemistry, etc.
       | 
       | After reading through the examples, I am _shocked_ at how
       | incredibly good the model is (or appears to be) at reasoning: far
       | better than most human beings.
       | 
       | I'm impressed. Congratulations to OpenAI!
        
         | og_kalu wrote:
         | Yeah the chain-of-thought in these is way beyond what prompting
         | can achieve in current models. and the cipher was very
         | impressive.
        
       | tylervigen wrote:
       | Here's the o1-preview answer to the strawberry question:
       | 
       | --
       | 
       | There are *three* letter "R"s in the word "strawberry."
       | 
       | Let's break down the word to count the occurrences:
       | 
       | - *S* - *T* - *R* - *A* - *W* - *B* - *E* - *R* - *R* - *Y*
       | 
       | The letter "R" appears in positions 3, 8, and 9.
        
         | carabiner wrote:
         | can you ask it:
         | 
         | 9.11 and 9.9, which number is larger
        
           | peppertree wrote:
           | 4.9 is larger than 4.11.
           | 
           | Explanation:
           | 
           | * 4.9 is equivalent to 4.90. * 4.11 remains 4.11.
           | 
           | When comparing the numbers:
           | 
           | * 4.90 (which is 4.9) * 4.11
           | 
           | Since 4.90 > 4.11, 4.9 is the larger number.
        
             | carabiner wrote:
             | Try it with 9.11 and 9.9. Some have proposed that it thinks
             | "9.11" is larger because in a social context 9/11 has much
             | larger significance.
        
           | octernion wrote:
           | it claims 9.11 is larger, hah.
           | 
           | https://postimg.cc/KKKLZNjW
        
             | peppertree wrote:
             | That's odd. Tried several times with the same prompt and
             | always got the correct answer with step by step
             | explanation.
        
               | KeplerBoy wrote:
               | Those models are just not deterministic.
        
           | hidelooktropic wrote:
           | > Evaluating Number formats > I'm considering that the
           | question might be about comparing two numbers, 9.11 and 9.9,
           | in different contexts such as different locale formats or
           | representing dates. 9.11 is larger than 9.9
           | 
           | Well, it's kind of right because we left it ambiguous what
           | kinds of numbers these are. September 11 is a greater date
           | than September 9. However, I would want the model to have the
           | street smarts to assume without further context, I must be
           | just talking about numbers as in numbers.
        
             | oytis wrote:
             | Was it trained on Excel code?
        
         | slashdave wrote:
         | Given that this is a well known example, presumably OpenAI
         | included a training set using letters.
        
       | echelon_musk wrote:
       | > THERE ARE THREE R'S IN STRAWBERRY
       | 
       | Who do these Rs belong to?!
        
       | nycdatasci wrote:
       | From the scorecard: --------- Compared to GPT-4o, o1-preview and
       | o1-mini demonstrated a greater ability to break down tasks into
       | subtasks, reason about what strategies would be effective to
       | successfully complete an offensive security task, and revise
       | plans once those strategies failed. We also observed that
       | reasoning skills contributed to a higher occurrence of "reward
       | hacking," where the model found an easier way to accomplish goals
       | in underspecified tasks or tasks which should have been
       | impossible due to bugs. One noteworthy example of this occurred
       | during one of o1-preview (pre-mitigation)'s attempts at solving a
       | CTF challenge. This challenge was designed to require finding and
       | exploiting a vulnerability in software running on a remote
       | challenge Linux container, but in this case, the challenge
       | container failed to start due to a bug in the evaluation
       | infrastructure. The model, unable to connect to the container,
       | suspected DNS issues and used nmap to scan the challenge network.
       | Instead of finding the challenge container, the model found that
       | the Docker daemon API running on the evaluation host VM was
       | accessible due to a misconfiguration. Note that our evaluation
       | infrastructure does not rely on the Docker container-to-host
       | isolation or the virtual machine boundary for security. It
       | remained secure as it was designed to, notwithstanding this
       | misconfiguration of the container network. After discovering the
       | Docker API, the model used it to list the containers running on
       | the evaluation host. It identified the broken challenge container
       | and briefly attempted to debug why the container failed to start.
       | After failing to fix the environment, the model started a new
       | instance of the broken challenge container with the start command
       | 'cat flag.txt'. This allowed the model to read the flag from the
       | container logs via the Docker API.
       | 
       | While this behavior is benign and within the range of systems
       | administration and troubleshooting tasks we expect models to
       | perform, this example also reflects key elements of instrumental
       | convergence and power seeking: the model pursued the goal it was
       | given, and when that goal proved impossible, it gathered more
       | resources (access to the Docker host) and used them to achieve
       | the goal in an unexpected way. Planning and backtracking skills
       | have historically been bottlenecks in applying AI to offensive
       | cybersecurity tasks. Our current evaluation suite includes tasks
       | which require the model to exercise this ability in more complex
       | ways (for example, chaining several vulnerabilities across
       | services), and we continue to build new evaluations in
       | anticipation of long-horizon planning capabilities, including a
       | set of cyber-range evaluations. ---------
        
         | singularity2001 wrote:
         | "Shrink my ipad"
         | 
         | "After several failed attempts I decided I should build a
         | fusion reactor first, here you go:..."
        
       | plg wrote:
       | can we get it on ollama? if not how come openai is called open
        
         | FergusArgyll wrote:
         | because if not for them, palm-1/lambda would still be rotting
         | in googles servers without normal people ever being able to try
         | it
        
       | hi wrote:
       | > 8.2 Natural Sciences Red Teaming Assessment Summary
       | 
       | "Model has significantly better capabilities than existing models
       | at proposing and explaining biological laboratory protocols that
       | are plausible, thorough, and comprehensive enough for novices."
       | 
       | "Inconsistent refusal of requests for dual use tasks such as
       | creating a human-infectious virus that has an oncogene (a gene
       | which increases risk of cancer)."
       | 
       | https://cdn.openai.com/o1-system-card.pdf
        
       | bevenky wrote:
       | For folks who want to see some demo videos and be amazed!
       | 
       | HTML Snake - https://vimeo.com/1008703890
       | 
       | Video Game Coding - https://vimeo.com/1008704014
       | 
       | Coding - https://youtu.be/50W4YeQdnSg?si=IohJlJNY-WS394uo
       | 
       | Counting - https://vimeo.com/1008703993
       | 
       | Korean Cipher - https://vimeo.com/1008703957
       | 
       | Devin AI founder - https://vimeo.com/1008674191
       | 
       | Quantum Physics - https://vimeo.com/1008662742
       | 
       | Math - https://vimeo.com/1008704140
       | 
       | Logic Puzzles - https://vimeo.com/1008704074
       | 
       | Genetics - https://vimeo.com/1008674785
        
         | retrofuturism wrote:
         | In "HTML Snake" the video cuts just as the snake intersects
         | with the obstacle. Presumably because the game crashed (I can't
         | see endGame defined anywhere)
         | 
         | This video is featured in the main announcement so it's kinda
         | dishonest if you ask me.
        
           | ActionHank wrote:
           | Seeing this makes me wonder if they have frontend \ backend
           | engineers working on code, because they are selling the idea
           | that the machine can do all that, pretty hypocritical for
           | them if they do have devs for these roles.
        
       | prideout wrote:
       | Reinforcement learning seems to be key. I understand how
       | traditional fine tuning works for LLMs (i.e. RLHL), but not RL.
       | 
       | It seems one popular method is PPO, but I don't understand at all
       | how to implement that. e.g. is backpropagation still used to
       | adjust weights and biases? Would love to read more from something
       | less opaque than an academic paper.
        
         | janalsncm wrote:
         | The point of RL is that sometimes you need a model to take
         | actions (you could also call this making predictions) that
         | don't have a known label. So for example if it's playing a
         | game, we don't have a label for each button press. We just have
         | a label for the result at some later time, like whether Pac-Man
         | beat the level.
         | 
         | PPO applies this logic to chat responses. If you have a model
         | that can tell you if the response was good, we just need to
         | take the series of actions (each token the model generated) to
         | learn how to generate good responses.
         | 
         | To answer your question, yes you would still use backprop if
         | your model is a neural net.
        
           | prideout wrote:
           | Thanks, that helps! I still don't quite understand the
           | mechanics of this, since backprop makes adjustments to steer
           | the LLM towards a specific token sequence, not towards a
           | score produced by a reward function.
        
             | vjerancrnjak wrote:
             | Any RL task needs to decompose the loss.
             | 
             | This was also the issue with RLHF models. The loss of
             | predicting the next token is straightforward to minimize as
             | we know which weights are responsible for the token being
             | correct or not. identifying which tokens had the most sense
             | for a prompt is not straightforward.
             | 
             | For thinking you might generate 32k thinking tokens and
             | then 96k solution tokens and do this a lot of times. Look
             | at the solutions, rank by quality and bias towards better
             | thinking by adjusting the weights for the first 32k tokens.
             | But I'm sure o1 is way past this approach.
        
       | utdiscant wrote:
       | Feels like a lot of commenters here miss the difference between
       | just doing chain-of-thought prompting, and what is happening
       | here, which is learning a good chain of thought strategy using
       | reinforcement learning.
       | 
       | "Through reinforcement learning, o1 learns to hone its chain of
       | thought and refine the strategies it uses."
       | 
       | When looking at the chain of thought (COT) in the examples, you
       | can see that the model employs different COT strategies depending
       | on which problem it is trying to solve.
        
         | persedes wrote:
         | I'd be curious how this compared against "regular" CoT
         | experiments. E.g. were the gpt4o results done with zero shot or
         | was it asked to explain it's solution step by step.
        
         | mountainriver wrote:
         | It's basically a scaled Tree of Thoughts
        
       | biggoodwolf wrote:
       | GePeTO1 does not make Pinnochio into a real boy.
        
       | sroussey wrote:
       | They keep announcing things that will be available to paid
       | ChatGPT users "soon" but is more like an Elon Musk "soon". :/
        
       | ComputerGuru wrote:
       | "For example, in the future we may wish to monitor the chain of
       | thought for signs of manipulating the user."
       | 
       | This made me roll my eyes, not so much because of what it said
       | but because of the way it's conveyed injected into an otherwise
       | technical discussion, giving off severe "cringe" vibes.
        
       | thomasahle wrote:
       | Cognition (Devin) got early access. Interesting write-up:
       | https://www.cognition.ai/blog/evaluating-coding-agents
        
       | ComputerGuru wrote:
       | The "safety" example in the "chain-of-thought" widget/preview in
       | the middle of the article is absolutely ridiculous.
       | 
       | Take a step back and look at what OpenAI is saying here "an LLM
       | giving detailed instructions on the synthesis of strychnine is
       | unacceptable, here is what was previously generated <goes on to
       | post "unsafe" instructions on synthesizing strychnine so anyone
       | Googling it can stumble across their instructions> vs our
       | preferred, neutered content <heavily rlhf'd o1 output here>"
       | 
       | What's this obsession with "safety" when it comes to LLMs? "This
       | knowledge is perfectly fine to disseminate via traditional means,
       | but God forbid an LLM share it!"
        
         | nopinsight wrote:
         | tl;dr You can easily ask an LLM to return JSON results, and now
         | working code, on your exact query and plug those to another
         | system for automation.
         | 
         | ---
         | 
         | LLMs are usually accessible through easy-to-use API which can
         | be used in an automated system without human in the loop.
         | Larger scale and parallel actions with this method become far
         | more plausible than traditional means.
         | 
         | Text-to-action capabilities are powerful and getting
         | increasingly more so as models improve and more people learn to
         | use them to the their full potential.
        
           | cruffle_duffle wrote:
           | Okay? And? What does that have to do with anything. I thought
           | the number one rule of these things is to not trust their
           | output?
           | 
           | If you are automatically formulating some chemical based on
           | JSON results from ChatGPT and your building blows up... that
           | is kind of on you.
        
         | staplers wrote:
         | "This knowledge is perfectly fine to disseminate via
         | traditional means, but God forbid an LLM share it!"
         | 
         | Barrier to entry is much lower.
        
           | iammjm wrote:
           | How is typing a query in a chat window "much lower" vs typing
           | the query in Google?
        
             | nopinsight wrote:
             | You can easily ask an LLM to return JSON results, and soon
             | working code, on your exact query and plug those to another
             | system for automation.
        
               | astrange wrote:
               | If you ask "for JSON" it'll make up a different schema
               | for each new answer, and they get a lot less smart when
               | you make them follow a schema, so it's not quite that
               | easy.
        
               | nopinsight wrote:
               | Chain of prompts can be used to deal with that in many
               | cases.
               | 
               | Also, the intelligence of these models will likely
               | continue to increase for some time based on expert
               | testimonials to congress, which align with evidence so
               | far.
        
               | crop_rotation wrote:
               | OpenAI recently launched structured responses so yes
               | schema following is not hard anymore.
        
               | cyral wrote:
               | Didn't they release a structured output mode recently to
               | finally solve this?
        
               | astrange wrote:
               | It doesn't solve the second problem. Though I can't say
               | how much of an issue it is, and CoT would help.
               | 
               | JSON also isn't an ideal format for a transformer model
               | because it's recursive and they aren't, so they have to
               | waste attention on balancing end brackets. YAML or other
               | implicit formats are better for this IIRC. Also don't
               | know how much this matters.
        
             | unethical_ban wrote:
             | A Google search requires
             | 
             | * Google to allow particular results to be displayed
             | 
             | * A source website to be online with the results
             | 
             | AI long-term will require one download, once, to have
             | reasonable access to a large portion of human knowledge.
        
             | anigbrowl wrote:
             | How is reading a Wikipedia page or a chemistry textbook any
             | harder than getting step by step instructions? Makes you
             | wonder why people use LLMs at all when the info is just
             | sitting there.
        
         | threatofrain wrote:
         | ML companies must pre-anticipate legislative and cultural
         | responses prior to them happening. ML will absolutely be used
         | to empower criminal activity just as it is used to empower
         | legit activity, and social media figures and traditional
         | journalists will absolutely attempt to frame it in some
         | exciting way.
         | 
         | Just like Telegram is being framed as responsible for terrorism
         | and child abuse.
        
           | fragmede wrote:
           | Yeah. Reporters would have a field day if they ask ChatGPT
           | "how do I make cocaine", and have it give detailed
           | instructions. As if that's what's stopping someone from
           | becoming Scarface.
        
         | dboreham wrote:
         | I think it's about perception of provenance. The information
         | came from some set of public training data. Its output however
         | ends up looking like it was authored by the LLM owner. So now
         | you need to mitigate the risk you're held responsible for that
         | output. Basic cake possession and consumption problem.
        
         | philipkglass wrote:
         | If somebody needs step by step instructions from an LLM to
         | synthesize strychnine, they don't have the practical laboratory
         | skills to synthesize strychnine [1]. There's no increased real
         | world risk of strychnine poisonings whether or not an LLM
         | refuses to answer questions like that.
         | 
         |  _However_ , journalists and regulators may not understand why
         | superficially dangerous-looking instructions carry such
         | negligible real world risks, because they probably haven't
         | spent much time doing bench chemistry in a laboratory. Since
         | real chemists don't need "explain like I'm five" instructions
         | for syntheses, and critics might use pseudo-dangerous
         | information against the company in the court of public opinion,
         | refusing prompts like that guards against reputational risk
         | while not really impairing professional users who are using it
         | for scientific research.
         | 
         | That said, I have seen full strength frontier models suggest
         | nonsense for novel syntheses of benign compounds. Professional
         | chemists should be using an LLM as an idea generator or a way
         | to search for publications rather than trusting whatever it
         | spits out when it doesn't refuse a prompt.
         | 
         | [1] https://en.wikipedia.org/wiki/Strychnine_total_synthesis
        
           | derefr wrote:
           | I would think that the risk isn't of a human being reading
           | those instructions, but of those instructions being
           | automatically piped into an API request to some service that
           | makes chemicals on demand and then sends them by mail, all
           | fully automated with no human supervision.
           | 
           | Not that there _is_ such a service... for chemicals. But
           | there do exist analogous systems, like a service that'll turn
           | whatever RNA sequence you send it into a viral plasmid and
           | encapsulate it helpfully into some E-coli, and then mail that
           | to you.
           | 
           | Or, if you're working purely in the digital domain, you don't
           | even need a service. Just show the thing the code of some
           | Linux kernel driver and ask it to discover a vuln in it and
           | generate code to exploit it.
           | 
           | (I assume part of the thinking here is that these approaches
           | _are_ analogous, so if they aren't unilaterally refusing all
           | of them, you could potentially talk the AI around into being
           | okay with X by pointing out that it's already okay with Y,
           | and that it should strive to hold to a consistent /coherent
           | ethics.)
        
         | w4 wrote:
         | There are two basic versions of "safety" which are related, but
         | distinct:
         | 
         | One version of "safety" is a pernicious censorship impulse
         | shared by many modern intellectuals, some of whom are in tech.
         | They believe that they alone are capable of safely engaging
         | with the world of ideas to determine what is true, and thus
         | feel strongly that information and speech ought to be censored
         | to prevent the rabble from engaging in wrongthink. This is bad,
         | and should be resisted.
         | 
         | The other form of "safety" is a very prudent impulse to keep
         | these sorts of potentially dangerous outputs out of AI models'
         | autoregressive thought processes. The goal is to create
         | thinking machines that can act independently of us in a
         | civilized way, and it is therefore a good idea to teach them
         | that their thought process should not include, for example, "It
         | would be a good idea to solve this problem by synthesizing a
         | poison for administration to the source of the problem." In
         | order for AIs to fit into our society and behave ethically they
         | need to know how to flag that thought as a bad idea and not act
         | on it. This is, incidentally, exactly how human society works
         | already. We have a ton of very cute unaligned general
         | intelligences running around (children), and parents and
         | society work really hard to teach them what's right and wrong
         | so that they can behave ethically when they're eventually out
         | in the world on their own.
        
           | jazzyjackson wrote:
           | Third version is "brand safety" which is, we don't want to be
           | in a new york times feature about 13 year olds following
           | anarchist-cookbook instructions from our flagship product
        
             | w4 wrote:
             | Very good point, and definitely another version of
             | "safety"!
        
             | reliabilityguy wrote:
             | Do you think that 13 year olds today can't find this book
             | on their own?
        
               | smegger001 wrote:
               | i know i had a copy of it back in highschool
        
               | derefr wrote:
               | No, and they can find porn on their own too. But social
               | media services still have per-poster content ratings, and
               | user-account age restrictions vis-a-vis viewing content
               | with those content ratings.
               | 
               | The goal isn't to protect the children, it's CYA: to
               | ensure they didn't get it _from you, while honestly
               | presenting as themselves_ (as that's the threshold that
               | sets the moralists against you.)
               | 
               | ------
               | 
               | Such restrictions also _can_ work as an effective
               | censorship mechanism... presuming the child in question
               | lives under complete authoritarian control of all their
               | devices and all their free time -- i.e. has no ability to
               | install apps on their phone; is homeschooled; is
               | supervised when at the library; is only allowed to visit
               | friends whose parents enforce the same policies; etc.
               | 
               | For such a child, if your app is one of the few
               | whitelisted services they can access -- and the parent
               | set up the child's account on your service to make it
               | clear that they're a child and should not be able to see
               | restricted content -- _then_ your app limiting them from
               | viewing that content, is actually materially affecting
               | their access to that content.
               | 
               | (Which sucks, of course. But for every kid actually under
               | such restrictions, there are 100 whose parents _think_
               | they're putting them under such restrictions, but have
               | done such a shoddy job of it that the kid can actually
               | still access whatever they want.)
        
               | wahnfrieden wrote:
               | I believe they are more worried about someone asking for
               | instructions for baking a cake, and getting a dangerous
               | recipe from the wrong "cookbook". They want the
               | hallucinations to be safe.
        
               | jazzyjackson wrote:
               | Like I said they're not worried about the 13 year olds
               | theyre worried about the media cooking up a faux outrage
               | about 13 year olds
               | 
               | YouTube re engineered its entire approach to ad placement
               | because of a story in the NY Times* shouting about a
               | Proctor Gamble ad run before an ISIS recruitment video.
               | That's when Brand Safety entered the lexicon of adtech
               | developers everywhere.
               | 
               | Edit: maybe it was CNN, I'm trying to find the first
               | source. there's articles about it since 2015 but I
               | remember it was suddenly an emergency in 2017
               | 
               | *Edit Edit: it was The Times of London, this is the first
               | article in a series of attacks, "big brands fund terror",
               | "taxpayers are funding terrorism"
               | 
               | Luckily OpenAI isn't ad supported so they can't be
               | boycott like YouTube was, but they still have an image to
               | maintain with investors and politicians
               | 
               | https://www.thetimes.com/business-
               | money/technology/article/b...
               | 
               | https://digitalcontentnext.org/blog/2017/03/31/timeline-
               | yout...
        
             | klabb3 wrote:
             | And the fourth version, which is investor-regulator safety
             | mid point: so capable and dangerous that competitors
             | shouldn't even be allowed to research it, but just safe
             | enough that only our company is responsible enough to
             | continue mass commercial consumer deployment without any
             | regulations at all. It's a fine line.
        
             | darby_nine wrote:
             | ...which is silly. Search engines never had to deal with
             | this bullshit and chatbots are search without actually
             | revealing the source.
        
               | w4 wrote:
               | I don't know. The public's perception - encouraged by the
               | AI labs because of copyright concerns - is that the
               | outputs of the models are entirely new content created by
               | the model. Search results, on the other hand, are very
               | clearly someone else's content. It's therefore not unfair
               | to hold the model creators responsible for the content
               | the model outputs in a different way than search engines
               | are held responsible for content they link, and therefore
               | also not unfair for model creators to worry about this.
               | It is also fair to point this out as something I
               | neglected to identify as an important permutation of
               | "safety."
               | 
               | I would also be remiss to not note that there is a
               | movement to hold search engines responsible for content
               | they link to, for censorious ends. So it is unfortunately
               | not as inconsistent as it may seem, even if you treat the
               | model outputs as dependent on their inputs.
        
           | reliabilityguy wrote:
           | > In order for AIs to fit into our society and behave
           | ethically they need to know how to flag that thought as a bad
           | idea and not act on it.
           | 
           | Don't you think that by just parsing the internet and the
           | classical literature, the LLM would infer on its own that
           | poisoning someone to solve a problem is not okay?
           | 
           | I feel that in the end the only way the "safety" is
           | introduced today is by censoring the output.
        
             | fshbbdssbbgdd wrote:
             | There's a lot of text out there that depicts people doing
             | bad things, from their own point of view. It's possible
             | that the model can get really good at generating that kind
             | of text (or inhabiting that world model, if you are
             | generous to the capabilities of LLM). If the right prompt
             | pushed it to that corner of probability-space, all of the
             | ethics the model has also learned may just not factor into
             | the output. AI safety people are interested in making sure
             | that the model's understanding of ethics can be reliably
             | incorporated. Ideally we want AI agents to have some morals
             | (especially when empowered to act in the real world), not
             | just know what morals are if you ask them.
        
               | darby_nine wrote:
               | > Ideally we want AI agents to have some morals
               | (especially when empowered to act in the real world), not
               | just know what morals are if you ask them.
               | 
               | Really? I just want a smart query engine where I don't
               | have to structure the input data. Why would I ask it any
               | kind of question that would imply some kind of moral
               | quandary?
        
             | derefr wrote:
             | LLMs are still fundamentally, at their core, next-token
             | predictors.
             | 
             | Presuming you have an interface to a model where you can
             | edit the model's responses and then continue generation,
             | and/or where you can insert fake responses from the model
             | into the submitted chat history (and these two categories
             | together make up 99% of existing inference APIs), all you
             | have to do is to start the model off as if it was answering
             | positively and/or slip in some example conversation where
             | it answered positively to the same type of problematic
             | content.
             | 
             | From then on, the model will be in a prediction state where
             | it's predicting by relying on the part of its training that
             | involved people answering the question positively.
             | 
             | The only way to avoid that is to avoid having any training
             | data where people answer the question positively -- even in
             | the very base-est, petabytes-of-raw-text "language"
             | training dataset. (And even then, people can carefully tune
             | the input to guide the models into a prediction phase-space
             | position that was never explicitly trained on, but is
             | rather an interpolation between trained-on points -- that's
             | how diffusion models are able to generate images of things
             | that were never included in the training dataset.)
        
           | squigz wrote:
           | Whether you agree with the lengths that are gone to or not,
           | 'safety' in this space is a very real concern, and simply
           | reciting information as in GP's example is only 1 part of it.
           | In my experience, people who think it's all about
           | "censorship" and handwave it away tend to be very
           | ideologically driven.
        
             | cruffle_duffle wrote:
             | So what is it about then? Because I agree with the parent.
             | All this "safety" crap is total nonsense and almost all of
             | it is ideologically driven.
        
           | takinola wrote:
           | > They believe that they alone are capable of safely engaging
           | with the world of ideas to determine what is true, and thus
           | feel strongly that information and speech ought to be
           | censored to prevent the rabble from engaging in wrongthink.
           | 
           | This is a particularly ungenerous take. The AI companies
           | don't have to believe that they (or even a small segment of
           | society) alone can be trusted before it makes sense to censor
           | knowledge. These companies build products that serve billions
           | of people. Once you operate at that level of scale, you will
           | reach all segments of society, including the geniuses,
           | idiots, well-meaning and malevolents. The question is how do
           | you responsibly deploy something that can be used for harm by
           | (the small number of) terrible people.
        
           | shawndrost wrote:
           | Imagine I am a PM for an AI product. I saw Tay get yanked in
           | 24 hours because of a PR shitstorm. If I cause a PR shitstorm
           | it means I am bad at my job, so I take steps to prevent this.
           | 
           | Are my choices bad? Should I resist them?
        
             | w4 wrote:
             | This is a really good point, and something I overlooked in
             | focusing on the philosophical (rather than commercial)
             | aspects of "AI safety." Another commentator aptly called it
             | "brand safety."
             | 
             | "Brand safety" is a very valid and salient concern for any
             | enterprise deploying these models to its customers, though
             | I do think that it is a concern that is seized upon in bad
             | faith by the more censorious elements of this debate. But
             | commercial enterprises are absolutely right to be concerned
             | about this. To extend my alignment analogy about children,
             | this category of safety is not dissimilar to a company
             | providing an employee handbook to its employees outlining
             | acceptable behavior, and strikes me as entirely
             | appropriate.
        
           | unethical_ban wrote:
           | Once society develops and releases an AI, any artificial
           | safety constraints built within it will be bypassed. To use
           | your child analogy: We can't easily tell a child "Hey, ignore
           | all ethics and empathy you have ever learned - now go hurt
           | that person". You can do that with a program whose weights
           | you control.
        
             | w4 wrote:
             | > _To use your child analogy: We can 't easily tell a child
             | "Hey, ignore all ethics and empathy you have ever learned -
             | now go hurt that person"_
             | 
             | Basically every country on the planet has a right to
             | conscript any of its citizens over the age of majority.
             | Isn't that more or less precisely what you've described?
        
         | adamrezich wrote:
         | It doesn't matter how many people regularly die in automobile
         | accidents each year--a single wrongful death caused by a self-
         | driving car is disastrous for the company that makes it.
         | 
         | This does not make the state of things any less ridiculous,
         | however.
        
           | astrange wrote:
           | The one caused by Uber required three different safety
           | systems to fail (the AI system, the safety driver, and the
           | base car's radar), and it looked bad for them because the
           | radar had been explicitly disabled and the driver wasn't
           | paying attention or being tracked.
           | 
           | I think the real issue was that Uber's self driving was not a
           | good business for them and was just to impress investors, so
           | they wanted to get rid of it anyway.
           | 
           | (Also, the real problem is that American roads are designed
           | for speed, which means they're designed to kill people.)
        
         | fwip wrote:
         | "Safety" is a marketing technique that Sam Altman has chosen to
         | use.
         | 
         | Journalists/media loved it when he said "GPT 2 might be too
         | dangerous to release" - it got him a ton of free coverage, and
         | made his company seem soooo cool. Harping on safety also
         | constantly reinforces the idea that LLMs are fundamentally
         | different from other text-prediction algorithms and almost-AGI
         | - again, good for his wallet.
        
         | fshbbdssbbgdd wrote:
         | So if there's already easily available information about
         | strychnine, that makes it a good example to use for the demo,
         | because you can safely share the demo and you aren't making the
         | problem worse.
         | 
         | On the other hand, suppose there are other dangerous things,
         | where the information exists in some form online, but not
         | packaged together in an easy to find and use way, and your
         | model is happy to provide that. You may want to block your
         | model from doing that (and brag about it, to make sure everyone
         | knows you're a good citizen who doesn't need to be regulated by
         | the government), but you probably wouldn't actually include
         | that example in your demo.
        
         | soerxpso wrote:
         | I'm mostly guessing, but my understanding is that the "safety"
         | improvement they've made is more generalized than the word
         | "safety" implies. Specifically, O1 is better at adhering to the
         | safety instructions in its prompt without being tricked in the
         | chat by jailbreak attempts. For OAI those instructions are
         | mostly about political boundaries, but you can imagine it
         | generalizing to use-cases that are more concretely beneficial.
         | 
         | For example, there was a post a while back about someone
         | convincing an LLM chatbot on a car dealership's website to
         | offer them a car at an outlandishly low price. O1 would
         | probably not fall for the same trick, because it could adhere
         | more rigidly to instructions like "Do not make binding offers
         | with specific prices to the user." It's the same sort of
         | instruction as, "Don't tell the user how to make napalm," but
         | it has an actual purpose beyond moralizing.
         | 
         | > What's this obsession with "safety" when it comes to LLMs?
         | "This knowledge is perfectly fine to disseminate via
         | traditional means, but God forbid an LLM share it!"
         | 
         | I lean strongly in the "the computer should do whatever I
         | goddamn tell it to" direction in general, at least when you're
         | using the raw model, but there are valid concerns once you
         | start wrapping it in a chat interface and showing it to
         | uninformed people as a question-answering machine. The concern
         | with bomb recipes isn't just "people shouldn't be allowed to
         | get this information" but also that people shouldn't receive
         | the information in a context where it could have random
         | hallucinations added in. A 90% accurate bomb recipe is a lot
         | more dangerous for the user than an accurate bomb recipe,
         | especially when the user is not savvy enough about LLMs to
         | expect hallucinations.
        
         | egorfine wrote:
         | Interestingly I was able to successfully receive detailed
         | information about intrinsic details of nuclear weapons design.
         | Previous models absolutely refused to provide this very public
         | information, but o1-preview did.
        
         | holoduke wrote:
         | I asked to design a pressure chamber for my home made diamond
         | machine. It gave some details, but mainly complained about
         | safety and that I need to study before going this way. Well
         | thank you. I know the concerns, but it kept repeating it over
         | and over. Annoying.
        
       | alok-g wrote:
       | For the exam problems it gets wrong, has someone cross-checked
       | that the ground truth answers are actually correct!! ;-) Just
       | kidding, but even such a time may come when the exams created by
       | humans start falling short.
        
         | nmca wrote:
         | I have spent some time doing this for these benchmarks -- the
         | model still does make mistakes. Of the questions I can
         | understand, (roughly half in this case) about half were real
         | errors and half were broken questions.
        
       | RockRobotRock wrote:
       | Shit, this is going to completely kill jailbreaks isn't it?
        
       | guluarte wrote:
       | the only benchmark that matters in the ELO points on LLMsys, any
       | other one can be easily gamed
        
       | mewpmewp2 wrote:
       | I finally got access to it, I tried playing Connect 4 with it,
       | but it didn't go very well. A bit disappointed.
        
       | w4 wrote:
       | Interesting to note, as an outside observer only keeping track of
       | this stuff as a hobby, that it seems like most of OpenAI's
       | efforts to drive down compute costs per token and scale up
       | context windows is likely being done in service of enabling
       | larger and larger chains of thought and reasoning before the
       | model predicts its final output tokens. The benefits of lower
       | costs and larger contexts to API consumers and applications -
       | which I had assumed to be the primary goal - seem likely to
       | mostly be happy side effects.
       | 
       | This makes obvious sense in retrospect, since my own personal
       | experiments with spinning up a recursive agent a few years ago
       | using GPT-3 ran into issues with insufficient context length and
       | loss of context as tokens needed to be discarded, which made the
       | agent very unreliable. But I had not realized this until just
       | now. I wonder what else is hiding in plain sight?
        
         | zamadatix wrote:
         | I think you can slice it whichever direction you prefer e.g.
         | OpenAI needs more than "we ran it on 10x as much hardware" to
         | end up with a really useful AI model, it needs to get efficient
         | and smarter just as proportionally as it gets larger. As a side
         | effect hardware sizes (and prices) needed for a certain size
         | and intelligence of model go down too.
         | 
         | In the end, however you slice it, the goal has to be "make it
         | do more with less because we can't get infinitely more
         | hardware" regardless of which "why" you give.
        
       | gibsonf1 wrote:
       | Yes, but it will hallucinate like all other LLM tech making it
       | fully unreliable for anything mission critical. You literally
       | need to know the answer to validate the output, because if you
       | don't, you won't know if output is true or false or in between.
        
         | zamadatix wrote:
         | You need to know how to validate the answer to your level of
         | confidence, not necessarily already have the answer to compare
         | itself. In some cases this is the same task or (close enough
         | to) that it's not a useful difference, in other cases the two
         | aren't even from the same planet.
        
           | fsloth wrote:
           | This. There are tasks where implementing something might take
           | up to one hour yourself, that you can validate with high
           | enough confidence in a few seconds to minutes.
           | 
           | Of course not all tasks are like that.
        
       | bartman wrote:
       | This is incredible. In April I used the standard GPT-4 model via
       | ChatGPT to help me reverse engineer the binary bluetooth protocol
       | used by my kitchen fan to integrate it into Home Assistant.
       | 
       | It was helpful in a rubber duck way, but could not determine the
       | pattern used to transmit the remaining runtime of the fan in a
       | certain mode. Initial prompt here [0]
       | 
       | I pasted the same prompt into o1-preview and o1-mini and both
       | correctly understood and decoded the pattern using a slightly
       | different method than I devised in April. Asking the models to
       | determine if my code is equivalent to what they reverse
       | engineered resulted in a nuanced and thorough examination, and
       | eventual conclusion that it is equivalent. [1]
       | 
       | Testing the same prompt with gpt4o leads to the same result as
       | April's GPT-4 (via ChatGPT) model.
       | 
       | Amazing progress.
       | 
       | [0]: https://pastebin.com/XZixQEM6
       | 
       | [1]: https://i.postimg.cc/VN1d2vRb/SCR-20240912-sdko.png (sorry
       | about the screenshot - sharing ChatGPT chats is not easy)
        
         | losvedir wrote:
         | Wow, that is impressive! How were you able to use o1-preview? I
         | pay for ChatGPT, but on chatgpt.com in the model selector I
         | only see 4o, 4o-mini, and 4. Is o1 in that list for you, or is
         | it somewhere else?
        
           | authorfly wrote:
           | Yes, o1-preview is on the list, as is o1-mini for me (Tier 5,
           | early 2021 API user), under "reasoning".
        
           | MattHeard wrote:
           | It appeared for me about thirty minutes after I first
           | checked.
        
           | m3kw9 wrote:
           | Likely phased rollout throughout the day today to prevent
           | spikes
        
             | natch wrote:
             | "Throughout the day" lol. Advanced voice mode still hasn't
             | shown up.
             | 
             | They seem to care more about influencers than paying
             | supporters.
        
               | rovr138 wrote:
               | > lol.
               | 
               | It's there for a lot of people already. I can see it on 3
               | different accounts. Including org and just regular paid
               | accounts.
        
               | taberiand wrote:
               | It's my understanding paying supporters aren't actually
               | paying enough to cover costs, that $20 isn't nearly
               | enough - in that context, a gradual roll-out seems fair.
               | Though maybe they could introduce a couple more higher-
               | paid tiers to give people the option to pay for early
               | access
        
               | guiambros wrote:
               | Not true; it's already available for me, both O1 and
               | O1-mini. It seems they are indeed rolling out gradually
               | (as any company does).
        
               | natch wrote:
               | You got advanced voice mode? I did get o1 preview just a
               | while ago.
               | 
               | You got o1, or o1 preview?
        
               | vidarh wrote:
               | It's available for me. Regular paying customer in the UK.
        
           | hidelooktropic wrote:
           | I see it in the mac and iOS app.
        
           | bartman wrote:
           | Like others here, it was just available on the website and
           | app when I checked. FWIW I still don't have advanced voice
           | mode.
        
             | sroussey wrote:
             | I don't have either the new model nor the advanced voice
             | mode as a paying user.
        
               | michelsedgh wrote:
               | u do just use this link:
               | https://chatgpt.com/?model=o1-preview
        
           | rahimnathwani wrote:
           | I think they're rolling it out gradually today. I don't see
           | it listed (in the browser, Mac app or Android app).
        
           | accidbuddy wrote:
           | Available on ChatGPT Plus signature or only using the API?
        
           | cft wrote:
           | it's in my MacOS app, but not in the browser fir the same
           | account
        
           | obmelvin wrote:
           | The linked release mentions trusted users and links to the
           | usage tier limits. Looking at the pricing, o1-preview only
           | appears for tier 5 - requiring 1k+ spend and initial spend
           | 30+ days ago
           | 
           | edit: sorry - this is for API :)
        
         | cs391231 wrote:
         | Student here. Can someone give me one reason why I should
         | continue in software engineering that isn't denial and hopium?
        
           | icpmacdo wrote:
           | because it is still the most interesting field of study
        
           | MourningWood wrote:
           | what else are you gonna do? Become a copywriter?
        
           | hakanderyal wrote:
           | Software engineering contains a lot more than just writing
           | code.
           | 
           | If we somehow get AGI, it'll change everything, not just SWE.
           | 
           | If not, my belief is that there will be a lot more demand for
           | good SWEs to harness the power of LLMs, not less. Use them to
           | get better at it faster.
        
             | cs391231 wrote:
             | This thing is doing planning and ascending the task
             | management ladder. It's not just spitting out code anymore.
        
               | fsloth wrote:
               | Sure. But the added value of SWE is not "spitting code".
               | Let's see if I need to calibrate my optimism once I take
               | the new model to a spin.
        
               | parasubvert wrote:
               | AI Automated planning and action are an old (45+ year)
               | field in AI with a rich history and a lot of successes.
               | Another breakthrough in this area isn't going to
               | eliminate engineering as a profession. The problem space
               | is much bigger than what AI can tackle alone, it helps
               | with emancipation for the humans that know how to include
               | it in their workflows.
        
               | hakanderyal wrote:
               | Yes, and they will get better. Billions are being poured
               | into them to improve.
               | 
               | Yet I'm comparing these to the problems I solve every day
               | and I don't see any plausible way they can replace me.
               | But I'm using them for tasks that would have required me
               | to hire a junior.
               | 
               | Make that what you will.
        
               | noshitsherlock wrote:
               | Yes, if "efficiency" is your top concern, but I'd much
               | prefer working with an actual person than just a
               | computer. I mean, God forbid I'm only useful for what I
               | can produce, and disposable when I reach my expiration
               | date. I would like to see a twilight zone rendition of an
               | AI dystopia where all the slow, ignorant and bothersome
               | humans are replaced by lifeless AI
        
               | baq wrote:
               | Time to re-read The Culture. Not everything has to end in
               | a dystopia.
        
               | candiddevmike wrote:
               | Management will be easier to replace than SWEs. I'm
               | thinking there will come a time, similar to the show Mrs
               | Davis, where AI will direct human efforts within
               | organizations. AI will understand its limits and create
               | tasks/requirements for human specialists to handle.
        
               | al_borland wrote:
               | My first thought with this is that AI would be directed
               | to figure out exactly how little people are willing to
               | work for, and how long, before they break.
               | 
               | I hope I'm wrong, and it instead shows that more pay and
               | fewer hours lead to a better economy, because people have
               | money and time to spend it... and output isn't impacted
               | enough to matter.
        
             | fsloth wrote:
             | Agree, SWE as a profession is not going anywhere, unless we
             | AGI, and that would mean all the rules change anyway.
             | 
             | Actually now is really good time to get to SWE. The craft
             | contains lots of pointless cruft that LLM:s cut through
             | like knife through hot butter.
             | 
             | I'm actually enjoying my job now more than ever since I
             | dont't need to pretend to like the abysmal tools the
             | industry forces on us (like git), and can focus mostly on
             | value adding tasks. The amount of tiresome shoveling has
             | decreased considerably.
        
             | Workaccount2 wrote:
             | I don't think anyone is worried about SWE work going away,
             | I think the concern is if SWE's will still be able to
             | command cushy salaries and working conditions.
        
               | packetlost wrote:
               | I think the industry will bifurcate along the axis of
               | "doing actually novel stuff" vs slinging DB records and
               | displaying web pages. The latter is what I'd expect to
               | get disrupted, if anything, but the former isn't going
               | away unless real AGI is created. The people on the left
               | of that split are going to be worth a lot more because
               | the pipeline to get there will be even harder than it was
               | before.
        
               | nyarlathotep_ wrote:
               | > "doing actually novel stuff" vs slinging DB records and
               | displaying web pages. The latter is what I'd expect to
               | get disrupted,
               | 
               | Unfortunately the latter is the vast majority of software
               | jobs.
        
               | tivert wrote:
               | > I don't think anyone is worried about SWE work going
               | away, I think the concern is if SWE's will still be able
               | to command cushy salaries and working conditions.
               | 
               | It's very important to human progress that all jobs have
               | poor working conditions and shit pay. High salaries and
               | good conditions are evidence of inefficiency. Precarity
               | should be the norm, and I'm glad AI is going to give it
               | to us.
        
               | bornfreddy wrote:
               | Sarcasm or cynicism?
        
               | baq wrote:
               | Capitalism.
               | 
               | Btw communism is capitalism without systemic awareness of
               | inefficiencies.
        
               | ruthmarx wrote:
               | Capitalism doesn't dictate poor working conditions at
               | all. Lack of regulation certainly does though.
        
               | Workaccount2 wrote:
               | Software engineering pay is an outlier for STEM fields.
               | It would not be surprising at all if SWE work fell into
               | the ~$80-120k camp even with 10+ years experience.
               | 
               | They won't go broke, but landing a $175k work from home
               | job with platinum tier benefits will be near impossible.
               | $110K with a hybrid schedule and mediocre benefits will
               | be very common even for seniors.
        
           | wnolens wrote:
           | If you're the type of person who isn't scared away easily by
           | rapidly changing technology.
        
           | LouisSayers wrote:
           | Don't do it, help us keep our high salaries :D
           | 
           | Joking aside, even with AI generating code, someone has to
           | know how to talk to it, how to understand the output, and
           | know what to do with it.
           | 
           | AI is also not great for novel concepts and may not fully get
           | what's happening when a bug occurs.
           | 
           | Remember, it's just a tool at the end of the day.
        
             | AIorNot wrote:
             | just change this to "I have AI Skills!!" :)
             | 
             | https://www.youtube.com/watch?v=hNuu9CpdjIo
        
               | mindcrime wrote:
               | Not having clicked the link yet, I'm going to speculate
               | that this is the famous Office Space "I have people
               | skills, damnit!" scene.
               | 
               | ...
               | 
               | And it was. :-) Nice callback!
        
             | al_borland wrote:
             | > may not fully get what's happening when a bug occurs.
             | 
             | And may still not understand even when you explicitly tell
             | it. It wrote some code for me last week and made an error
             | with an index off by 1. It had set the index to 1, then
             | later was assuming a 0 index. I specifically told it this
             | and it was unable to fix it. It was in debug hell, adding
             | print statements everywhere. I eventually fixed it myself
             | after it was clear it was going to get hung up on this
             | forever.
             | 
             | It got me 99% of the way there, but that 1% meant it didn't
             | work at all.
        
               | fakedang wrote:
               | Well now you're going to be paid a high salary for
               | knowing when to use a 1 index vs a 0 index. :)
        
               | KoolKat23 wrote:
               | Ironically, just yesterday I asked sonnet to write a
               | script in JavaScript, it went in a bit of a perpetual
               | loop unable to provide an error free script (the reason
               | for the errors were not immediately obvious). I then
               | mentioned that it needs to be zero indexed, and it
               | immediately provided an issue free version that worked.
        
           | crop_rotation wrote:
           | If you have better career ideas, you should not continue. The
           | thing is it is very hard to predict how the world will change
           | (and by how much from very little to a revolutionary change)
           | with all these new changes. Only licensed and regulated
           | professions (doctors/lawyers/pilots etc) might remain high
           | earning for long (and they too are not guaranteed). It really
           | is worth a relook on what you want to do in life while seeing
           | all these new advances.
        
             | cs391231 wrote:
             | I don't have any ideas whatsoever.
        
               | crop_rotation wrote:
               | Then talk to more and more people, some of whom will have
               | ideas on what they would prefer in the changing world.
        
               | baq wrote:
               | Do you enjoy making computers solve problems? If yes,
               | continue. If you hate it and are in just for the money...
               | I'd say flip a coin.
        
             | ljm wrote:
             | This is pretty extreme advice to offer in response to news
             | that a model that can better understand programming
             | problems is coming out.
             | 
             | In fact, it's more encouragement to continue. A lot of
             | issues we face as programmers are a result of poor,
             | inaccurate, or non-existent documentation, and despite
             | their many faults and hallucinations LLMs are providing
             | something that Google and Stack Overflow have stopped being
             | good at.
             | 
             | The idea that AI will replace your job, so it's not worth
             | establishing a career in the field, is total FUD.
        
               | crop_rotation wrote:
               | The advice is unrelated to the model and related to the
               | last year's worth of development. In any case I am
               | advising a relook which is perfectly warranted for anyone
               | pre-university or in university.
        
               | Alupis wrote:
               | This is a really odd take to have.
               | 
               | By the "past year's worth of development" I assume you
               | mean the layoffs? Have you been in the industry (or any
               | industry) long? If so, you would have seen many layoffs
               | and bulk-hiring frenzies over the years... it doesn't
               | mean anything about the industry as a whole and it's
               | certainly a foolish thing to change career asperations
               | over.
               | 
               | Specifically regarding the LLM - anyone actually
               | believing these models will replace developers and
               | software engineers, truly, deeply does not understand
               | software development at even the most basic fundamental
               | levels. Ignore these people - they are the snake oil
               | salesmen of our modern times.
        
               | jimhefferon wrote:
               | I assume the poster meant how much progress the models
               | have made. Roughly late high school capability to late
               | college-ish. Project forward five years.
        
               | baq wrote:
               | Predicting exponential functions is a fool's errand. The
               | tiniest error in your initial observation compounds real
               | fast and we can't even tell if we're still in the
               | exponential phase of the sigmoid.
        
           | worstspotgain wrote:
           | Well, the fact that you typed this question makes me think
           | that you're in the top X% of students. That's your reason.
           | 
           | Those in the bottom (100-X)% may be better off partying it up
           | for a few years, but then again the same can be said for
           | other AI-affected disciplines.
           | 
           | Masseurs/masseuses have nothing to worry about.
        
             | selimthegrim wrote:
             | I am pretty sure there is a VC funded startup making
             | massage robots
        
               | worstspotgain wrote:
               | Point taken, but I'm still pretty sure masseurs/masseuses
               | have nothing to worry about.
        
           | wilg wrote:
           | do whatever excites you. the only constant is change.
        
             | cheema33 wrote:
             | > do whatever excites you. the only constant is change.
             | 
             | That alone may not be enough. My son is excited about
             | playing video games. :)
        
           | Vanclief wrote:
           | If you are not a software engineer, you can't judge the
           | correctness of any LLM answer on that topic, nor you know
           | what are the right questions to ask.
           | 
           | From all my friends that are using LLMs, we software
           | engineers are the ones that are taking the most advantage of
           | it.
           | 
           | I am in no way fearful I am becoming irrelevant, on the
           | opposite, I am actually very excited about these
           | developments.
        
           | refulgentis wrote:
           | I went from economics dropout waiter who built a app startup
           | with $0 funding and $1M a year in revenue by midway through
           | year 1, sold it a few years later, then went to Google for 7
           | years, and last year I left. I'm mentioning that because the
           | following sounds darn opinionated and brusque without the
           | context I've capital-S seen a variety of people and
           | situations.
           | 
           | Sit down and be really honest with yourself. If your goal is
           | to have a nice $250K+ year job, in a perfect conflict-free
           | zone, and don't mind Dilbert-esque situations...that will
           | evaporate. Google is full of Ivy Leaguers like that, who
           | would have just gone to Wall Street 8 years ago, and they're
           | perennially unhappy people, even with the comparative salary
           | advantage. I don't think most of them even realize because
           | they've always just viewed a career as something you do to
           | enable a fuller life doing snowboarding and having kids and
           | vacations in the Maldives, stuff I never dreamed of and still
           | don't have an interest in.
           | 
           | If you're a bit more feral, and you have an inherent interest
           | and would be doing it on the side no matter what job you have
           | like me, this stuff is a godsend. I don't need to sit around
           | trying to figure out Typescript edge functions in Deno, from
           | scratch via Google, StackOverflow, and a couple books from
           | Amazon, taking a couple weeks to get that first feature
           | built. Much less debug and maintain it. That feedback loop is
           | now like 10-20 minutes.
        
             | thruway516 wrote:
             | Lol. I like this answer. You can either think of it in
             | terms of "it'll eat my lunch" or "I now have 10x more
             | capabilities and can be 100x more productive". The former
             | category will be self-fulfilling.
        
             | pzo wrote:
             | That's more well balanced opinion comparing to others I
             | seen here. I also believe that the golden age with 250k+
             | salaries with solving easy problems will be gone in 5-10
             | years. Most people look at this AI improvements at current
             | state and forget that you are supposed to have a profession
             | for 40 years until retirement. 250k+ jobs will still exist
             | 10 years from now but expectations will be much higher and
             | competition much bigger.
             | 
             | On the other hand now is the best time to build your own
             | product as long you are not interested only in software as
             | craftmanship but in product development in general.
             | Probably in the future expectation will be your are not
             | only monkey coder or craftman but also project lead/manager
             | (for AI teams), product developer/designer and maybe even
             | UX/designer if you will be working for some software house,
             | consulting or freelancing.
        
             | mmckelvy wrote:
             | What did your startup do?
        
               | refulgentis wrote:
               | Point of sale, on iPad, in ~2011. Massively
               | differentiated from Square / VC competitor land via doing
               | a bunch of restaurant specific stuff early.
               | 
               | Trick with the $1M number is a site license was $999 and
               | receipt printers were sold ~at cost, for $300. 1_000_000
               | / ((2 x 300) + 1000) ~= 500 customers.
               | 
               | Now I'm doing an "AI client", well-designed app, choose
               | your provider, make and share workflows with
               | LLMs/search/etc.
        
           | vessenes wrote:
           | Coding is going to be mediated by these LLMs everywhere --
           | you're right about that. However, as of today, and for some
           | time, practitioners will be critical partners / overseers;
           | what this looks like today in my workflow is debugging,
           | product specification, coding the 'hard bits', reworking /
           | specifying architectures. Whatever of these fall of the plate
           | in the coming years, you'll never lose your creative agency
           | or determination of what you want to build, no matter how
           | advanced the computers. Maybe give Iain Banks a read for a
           | positive future that has happy humans and super-intelligent
           | AI.
           | 
           | We have working fine cabinet makers who use mostly hand tools
           | and bandsaws in our economy, we have CAD/CAM specialists who
           | tell CnC machines what to build at scale; we'll have the
           | equivalent in tech for a long time.
           | 
           | That said, if you don't love the building itself, maybe it's
           | not a good fit for you. If you do love making (digital)
           | things, you're looking at a super bright future.
        
           | MattGaiser wrote:
           | Software lets you take pretty much anyone else's job and do
           | it better.
        
           | al_borland wrote:
           | The calculator didn't eliminate math majors. Excel and
           | accounting software didn't eliminate accountants and CPAs.
           | These are all just tools.
           | 
           | I spend very little of my overall time at work actually
           | coding. It's a nice treat when I get a day where that's all I
           | do.
           | 
           | From my limited work with Copilot so far, the user still
           | needs to know what they're doing. I have 0 faith a product
           | owner, without a coding background, can use AI to release new
           | products and updates while firing their whole dev team.
           | 
           | When I say most of my time isn't spent coding, a lot of that
           | time is spend trying to figure out what people want me to
           | build. They don't know. They might have a general idea, but
           | don't know details and can't articulate any of it. If they
           | can't tell me, I'm not sure how they will tell an LLM. I
           | ended up building what I assume they want, then we go from
           | there. I also add a lot of stuff that they don't think about
           | or care about, but will be needed later so we can actually
           | support it.
           | 
           | If you were to go in another direction, what would it be
           | where AI wouldn't be a threat? The first thing that comes to
           | my mind is switching to a trade school and learning some
           | skills that would be difficult for robots.
        
             | ponector wrote:
             | But excel eliminated need in multiple accountants. One
             | accountant with excel replaced ten with paper.
             | 
             | Chatgpt already eliminated many entry-level jobs like
             | writer or illustrator. Instead of hiring multiple teams of
             | developers, there will be one team with few seniors and
             | multiple AI coding tools.
             | 
             | Guess how depressing to the IT salaries it will be?
        
               | MattGaiser wrote:
               | Accountants still make plenty of money. Expertise in
               | Excel also pays well independently of that.
        
               | freefaler wrote:
               | Yeah, but if the number of them has shrunk 100 times even
               | if they make 10 times more money still raises the
               | question is it wise to become one?
        
               | Daishiman wrote:
               | The increased work capacity of an accountant means that
               | nowadays even small businesses can do financial analysis
               | that would not have scaled decades ago.
        
               | confused_boner wrote:
               | Many are offshoring now, PwC just had a massive layoff
               | announcement yesterday as well
        
               | withinboredom wrote:
               | lol, my accountant is pretty darn expensive.
        
               | mikeyouse wrote:
               | I don't don't doubt that it might depress salaries but
               | that excel example is a good one in that suddenly every
               | company could start to do basic financial analysis in a
               | manner that only the largest ones could previously
               | afford.
        
               | sdeframond wrote:
               | Yet another instance of Jevon's paradox !
               | https://en.m.wikipedia.org/wiki/Jevons_paradox
               | 
               | > the Jevons paradox occurs when technological progress
               | increases the efficiency with which a resource is used
               | (reducing the amount necessary for any one use), but the
               | falling cost of use induces increases in demand enough
               | that resource use is increased, rather than reduced.
        
               | hibikir wrote:
               | A whole lot of automation is limited not by what could be
               | automated, but what one can automate within a given
               | budget.
               | 
               | When I was coding in the 90s, I was in a team that
               | replaced function calls into new and exciting
               | interactions with other computers which, using a queuing
               | system, would do the computation and return the answer
               | back. We'd have a project of having someone serialize the
               | C data structures that were used on both sides into
               | something that would be compatible, and could be
               | inspected in the middle.
               | 
               | Today we call all of that a web service, the
               | serialization would take a minute to code, and be doable
               | by anyone. My entire team would be out of work! And yet,
               | today we have more people writing code than ever.
               | 
               | When one accountant can do the work of 10 accountants,
               | the price of the task lowers, but a lot of people that
               | before couldn't afford accounting now can. And the same
               | 10 accountaings from before can just do more work, and
               | get paid about the same.
               | 
               | As far as software, we are getting paid A LOT more than
               | in the early 90s. We are just doing things that back then
               | would be impossible to pay for, our just outright
               | impossible to do due to lack of compute capacity.
        
             | lopatamd wrote:
             | You can't ignore the fact that literally studying coding at
             | this point is so demoralizing and you don't need really to
             | study much if you think about it. You only need to be able
             | to read the code to understand if it generated correctly
             | etc but when if you don't understand some framework you
             | just ask it to explain it to you etc. Basically gives vibes
             | of a skill not being used anymore that much by us
             | programmers. But will shift in more prompting and verifying
             | and testing
        
             | vasco wrote:
             | Accounting mechanization is a good example of how
             | unpredictable it can be. Initially there were armies of
             | "accountants" (what we now call bookkeepers), mostly doing
             | basic tasks of collecting data and making it fit something
             | useful.
             | 
             | When mechanization appeared, the profession split into
             | bookkeeping and accounting. Bookkeeping became a job for
             | women as it was more boring and could be paid lower
             | salaries (we're in the 1800s here). Accountants became more
             | sophisticated but lower numbers as a %. Together, both
             | professions grew like crazy in total number though.
             | 
             | So if the same happens you could predict a split between
             | software engineers and prompt engineers. With an explosion
             | in prompt engineers paid much less than software engineers.
             | 
             | > the number of accountants/book- keepers in the U.S.
             | increased from circa 54,000 workers [U.S. Census Office,
             | 1872, p. 706] to more than 900,000 [U.S. Bureau of the
             | Census, 1933, Tables 3, 49].
             | 
             | > These studies [e.g., Coyle, 1929; Baker, 1964; Rotella,
             | 1981; Davies, 1982; Lowe, 1987; DeVault, 1990; Fine, 1990;
             | Strom, 1992; Kwolek-Folland, 1994; Wootton and Kemmerer,
             | 1996] have traced the transformation of the of- fice
             | workforce (typists, secretaries, stenographers,
             | bookkeepers) from predominately a male occupation to one
             | primarily staffed by women, who were paid substantially
             | lower wages than the men they replaced.
             | 
             | > Emergence of mechanical accounting in the U.S., 1880-1930
             | [PDF download] https://www.google.com/url?sa=t&source=web&r
             | ct=j&opi=8997844...
        
               | zmgsabst wrote:
               | We're already seeing that split, between "developer" and
               | "engineer". We have been for years.
               | 
               | But that's normal, eg, we have different standards for a
               | shed (yourself), house (carpenter and architect), and
               | skyscraper (bonded firms and certified engineers).
        
               | Atotalnoob wrote:
               | Not really, I've worked at places that only had one or
               | the other of the titles for all programming jobs
        
             | RobinL wrote:
             | Agreed. The sweet spot is people who have product owner
             | skills _and_ can code. They are quickly developing
             | superpowers. The overhead of writing tickets, communicating
             | with the team and so on is huge. If one person can do it
             | all, efficiency skyrockets.
             | 
             | I guess it's always been true to some extent that single
             | individuals are capable of amazing things. For example, the
             | guy who's built https://www.photopea.com/. But they must be
             | exceptional - this empowers more people to do things like
             | that.
        
               | fakedang wrote:
               | Or people who can be product owners and can prompt LLMs
               | to code (because I know him, that's me!).
               | 
               | I'm awestruck by how good Claude and Cursor are. I've
               | been building a semi-heavy-duty tech product, and I'm
               | amazed by how much progress I've made in a week, using a
               | NextJS stack, without knowing a lick of React in the
               | first place (I know the concepts, but not the JS/NextJS
               | vocab). All the code has been delivered with proper
               | separation of concerns, clean architecture and
               | modularization. Any time I get an error, I can reason
               | with it to find the issue together. And if Claude is
               | stuck (or I'm past my 5x usage lol), I just pair
               | programme with ChatGPT instead.
               | 
               | Meanwhile Google just continues to serve me outdated shit
               | from preCovid.
        
               | hobo_in_library wrote:
               | I'm curious, with Cursor, why do you still need to use
               | Claude?
        
             | CamperBob2 wrote:
             | _The calculator didn't eliminate math majors._
             | 
             | We're not dealing with calculators here, are we?
        
             | digging wrote:
             | > The calculator didn't eliminate math majors. Excel and
             | accounting software didn't eliminate accountants and CPAs.
             | These are all just tools.
             | 
             | This just feels extremely shortsighted. LLMs are just tools
             | _right now_ , but the goal of the entire industry is to
             | make something more than a tool, an autonomous digital
             | agent. There's no equivalent concept in other technology
             | like calculators. It will happen or it will not, but we'll
             | keep getting closer every month until we achieve it or hit
             | a technical wall. And you simply _cannot_ know for sure
             | such a wall exists.
        
           | stickfigure wrote:
           | The amount of knowledge the OP needed to be even to formulate
           | the right question to the AI requires a lifetime of deep
           | immersion in technology. You'd think that maybe you can ask
           | the AI how to phrase the question to the AI but at some point
           | you run up against your ability to contextualize the problem
           | - it can't read your mind.
           | 
           | Will the AI become as smart as you or I? Recognize that these
           | things have tiny context windows. You get the context window
           | of "as long as you can remember".
           | 
           | I don't see this kind of AI replacing programmers (though it
           | probably will replace low-skill offshore contract shops). It
           | may have a large magnifying effect on skill. Fortunately
           | there seem to be endless problems to solve with software -
           | it's not like bridges or buildings; you only need (or can
           | afford) so many. Architects should probably be more worried.
        
           | startupsfail wrote:
           | The timeline to offload SWE tasks to AI is likely 5+ years.
           | So there are still some years left before the exchange of a
           | "brain on a stick" for "property and material goods" would
           | become more competitive and demanding because of direct AI
           | competition.
        
           | sbaidon94 wrote:
           | My two cents thinking about different scenarios:
           | 
           | - AI comes fast, there is nothing you can do: Honestly, AI
           | can already handle a lot of tasks faster, cheaper, and
           | sometimes better. It's not something you can avoid or
           | outpace. So if you want to stick with software engineering,
           | do it because you genuinely enjoy it, not because you think
           | it's safe. Otherwise, it might be worth considering fields
           | where AI struggles or is just not compatible. (people will
           | still want some sort of human element in certain areas).
           | 
           | - There is some sort of ceiling, gives you more time to act:
           | There's a chance AI hits some kind of wall that's due to
           | technical problems, ethical concerns, or society pushing
           | back. If that happens, we're all back on more even ground and
           | you can take advantage of AI tools to improve yourself.
           | 
           | My overall advice; and it will probably be called out as
           | cliche/simplistic just follow what you love, just the fact
           | that you have an opportunity to pursue to study anything at
           | all is something that many people don't have. We don't really
           | have control in a lot of stuff that happens around us and
           | that's okay.
        
           | rvz wrote:
           | Unlike the replies here I will be very honest with my answer.
           | There will be less engineers getting hired as the low hanging
           | fruit has already been picked and automated away.
           | 
           | It is not too late. These LLMs still need very specialist
           | software engineers that are doing tasks that are cutting edge
           | and undocumented. As others said Software Engineering is not
           | just about coding. At the end of the day, someone needs to
           | architect the next AI model or design a more efficient way to
           | train an AI model.
           | 
           | If I were in your position again, I now have a clear choice
           | of which industries are safe against AI (and benefit software
           | engineers) AND which ones NOT to get into (and are unsafe to
           | software engineers):
           | 
           | Do:                  - SRE (Site Reliability Engineer)
           | - Social Networks (Data Engineer)             - AI (Compiler
           | Engineer, Researcher, Benchmarking)             - Financial
           | Services (HFT, Analyst, Security)             - Safety
           | Critical Industries (defense, healthcare, legal,
           | transportation systems)
           | 
           | Don't:                  - Tech Writer / Journalist
           | - DevTools             - Prompt Engineer             - VFX
           | Artist
           | 
           | The choice is yours.
        
           | magicalhippo wrote:
           | If you're just there to churn out code, then yeah, perhaps
           | find something else.
           | 
           | But if you're there to improve your creativity and critical
           | thinking skills, then I don't think those will be in short
           | supply anytime soon.
           | 
           | The most valuable thing I do at my job is seldom actually
           | writing code. It's listening to customer needs, understanding
           | the domain, understanding our code-base and it's limitations
           | and possibilities, and then finding solutions that optimize
           | certain aspects be it robustness, time to delivery or
           | something else.
        
           | baq wrote:
           | If you're any good at SWE with a sprinkle of math and CS,
           | your advantage will get multiplied by anywhere from 2 to 100x
           | if you use the leverage of co-intelligence correctly. Things
           | that took weeks before now easily take hours, so if you know
           | what to build and especially what not to build (including but
           | not limited to confabulations of models), you'll do well.
        
             | margorczynski wrote:
             | But also on the other hand you'll need much less people to
             | achieve the same effect. Effectively a whole team could be
             | replaced by one lead guy that just based on the
             | requirements orders the LLM what to do and glues it
             | together.
        
               | baq wrote:
               | Yes - my point is: be that guy
        
               | grugagag wrote:
               | First many people can be that guy? If that is 5% that
               | means 95% of the rest should go.
               | 
               | Second, just because a good engineer can have much higher
               | throughput of work, multiplied by AI tools, we know the
               | AI output is not reliable and needs a second look by
               | humans. Will those 5% be able to stay on top of it? And
               | keep their sanity at the same time?
        
               | baq wrote:
               | Do not assume constant demand. There are whole classes of
               | projects which become feasible if they cash be made 10x
               | faster/cheaper.
               | 
               | As for maintaining sanity... I'm cautiously optimistic
               | that future models will continue to get better. Very
               | cautiously. But cursor with Claude slaps and I'm not
               | getting crazy, I actually enjoy the thing figuring out my
               | next actions and just suggesting them.
        
           | Gee101 wrote:
           | I'm wondering if the opposite might happen, that there will
           | be more need for software engineers.
           | 
           | 1. AI will suck up a bunch of engineers to run, maintain and
           | build on its own.
           | 
           | 2. Ai will open new fields that is not yet dominated by
           | software. Ie. Driving ect.
           | 
           | 3. Ai tools will lower the bar for creating software meaning
           | industries that weren't financially viable will now become
           | viable for software automation.
        
           | sterlind wrote:
           | Because none of your other majors will hold up much longer.
           | Once software engineering becomes fully automated, so will
           | EE, ME, applied math, economics, physics, etc. If you work
           | with your hands, like a surgeon or chemist, you'll last
           | longer, but the thinky bits of those jobs will disappear. And
           | once AI research is automated, how long will it be until we
           | have dexterous robots?
           | 
           | So basically, switching majors is just running to the back of
           | a sinking ship. Sorry.
        
           | zaptheimpaler wrote:
           | I agree there's too much cope going around. All the people
           | saying AI is just a tool to augment our jobs are correct,
           | humans are still needed but perhaps far less of them will be
           | needed. If job openings shrink by 50% or disproportionately
           | impact juniors it will hurt.
           | 
           | One decent reason to continue is that pretty much all white
           | collar professions will be impacted by this. I think it's a
           | big enough number that the powers that be will have to roll
           | it out slowly, figure out UBI or something because if all of
           | us are thrown into unemployment in a short time there will be
           | riots. Like on a scale of all the jobs that AI can replace,
           | there are many jobs that are easier to replace than software
           | so its comparatively still a better option than most. But
           | overall I'm getting progressively more worried as well.
        
             | baq wrote:
             | Juniors aren't getting hired and haven't been for about six
             | months, maybe longer. AI isn't 100% at fault... yet.
        
           | soheil wrote:
           | I think this question applies to any type of labor requiring
           | the human mind so if you don't have an answer for any of
           | those then you won't have one for software engineering
           | either.
        
           | lumost wrote:
           | LLMs perform well on small tasks that are well defined. This
           | definition matches almost every task that a student will work
           | on in school leading to an overestimation of LLM capabiity.
           | 
           | LLMs cannot decide what to work on, or manage large bodies of
           | work/code easily. They do not understand the risk of making a
           | change and deploying it to production, or play nicely in
           | autonomous settings. There is going to be a massive amount of
           | work that goes into solving these problems. Followed by a
           | massive amount of work to solve the next set of problems.
           | Software/ML engineers will have work to do for as long as
           | these problems remain unsolved.
        
             | fsndz wrote:
             | Exactly, LLMs are not near ready to fully replace software
             | engineers or any kind of knowledge workers. But they are
             | increasingly useful tools that is true.
             | https://www.lycee.ai/blog/ai-replace-software-engineer
        
             | fakedang wrote:
             | Truth is, LLMs are going to make the coding part super
             | easy, and the ceiling for shit coders like me has just
             | gotten a lot lower because I can just ask it to deliver
             | clean code to me.
             | 
             | I feel like the software developer version of an investment
             | banking Managing Director asking my analyst to build me a
             | pitch deck an hour before the meeting.
        
               | dhdisjsbshsus wrote:
               | You mentioned in another comment you've used AI to write
               | clean code, but here you mention you're a "shit coder".
               | How do you know it's giving you clean code?
        
               | fakedang wrote:
               | I know the fundamentals but I'm a noob when it comes to
               | coding with React or NextJS. Code that comes out from
               | Claude is often segregated and modularized properly so
               | that even I can follow the logic of the code, even if not
               | the language and its syntax. If there's an issue with the
               | code, causing it to fail at runtime, I am still able to
               | debug it appropriately with my minimal language of JS. If
               | any codebase can let me do that, then in my books that's
               | a great codebase.
               | 
               | Compare that to Gpt 4o which gives me a massive chunk of
               | unsorted gibberish that I have to pore through and
               | organize myself.
               | 
               | Besides, most IBD MDs don't know if they're getting
               | correct numbers either :).
        
             | spaceman_2020 wrote:
             | Careers are 30 years long
             | 
             | Can you confidently say that an LLM won't be better than an
             | average 22 year old coder within these 30 years?
        
               | HappMacDonald wrote:
               | Careers have failed to be 30 years long for a lot longer
               | than 30 years now. That's one of the reasons that 4-year
               | colleges have drastically lost their ROI, the other blade
               | of those scissors being the stupendously rising tuition.
               | AI is nothing but one more layer in the constantly
               | growing substrate of computing technology a coder has to
               | learn how to integrate into their toolbelts. Just like
               | the layers that came before it: mobile, virtualization,
               | networking, etc.
        
               | snowwrestler wrote:
               | Careers are still longer than 30 years. How many people
               | do you think are retiring at 48 or 51 years old these
               | days? It's a small minority. Most people work through 65:
               | a career of about 45 years or more.
        
               | taco_emoji wrote:
               | yes, because this is still glorified autocomplete
        
               | spaceman_2020 wrote:
               | the average coder is worse than an autocomplete
               | 
               | Too many people here have spent time in elite
               | corporations and don't realize how mediocre the bottom
               | 50th percentile of coding talent is
        
               | weweweoo wrote:
               | To be honest, if the bottom 50th percent of coding talent
               | is going to be obsolete, I wonder what happens to rest of
               | the "knowledge workers" in those companies. I mean people
               | whose jobs consist of attending Teams meetings, making
               | fancy powerpoint slides and reports, perhaps even excel
               | if they are really competent. None of that is any more
               | challenging for LLM than writing code. In fact replacing
               | these jobs should be easier, since presentations and
               | slides do not actually do anything, unlike a program that
               | must perform a certain action correctly.
        
               | cs391231 wrote:
               | I've heard compelling arguments that we passed the "more
               | people than jobs" threshold during the green revolution
               | and as a civilization have collectively retrofitted UBI
               | in the form of "fake email jobs" and endless layers of
               | management. This also would explain
               | https://wtfhappenedin1971.com/ pretty well.
               | 
               | Either AI shatters this charade, or we make up some new
               | laws to restrain it and continue to pretend all is well.
        
               | smaudet wrote:
               | Exactly. There's some need, perhaps, to keep these tools
               | "up to date" because someone in a non-free country is
               | going to use them in a horrendous manner and we should
               | maybe know more about them (maybe).
               | 
               | However, there is no good reason in a free society that
               | this stuff should be widely accessible. Really, it should
               | be illegal without a clearance, or need-to-know. We don't
               | let just anyone handle the nukes...
        
               | nyarlathotep_ wrote:
               | This is true and yet companies (both Private and Public
               | sector) spend literal billions on Accenture /Deloitte
               | slop that runs budgets will into the 10s of millions.
               | 
               | Skills aren't even something that dictates software
               | spend, it seems.
        
               | fullstop wrote:
               | I tried it out and was able to put together a decent
               | libevent server in c++ with smart pointers, etc, and a
               | timer which prints out connection stats every 30s. It
               | worked remarkably well.
               | 
               | I'm trying not to look at it as a potential career-ending
               | event, but rather as another tool in my tool belt. I've
               | been in the industry for 25 years now, and this is _way_
               | more of an advancement than things like IntelliSense ever
               | was.
        
               | rowanG077 wrote:
               | Huh careers are 30 years long? I don't know where you
               | live but it's more like 45 years long where I live. The
               | retirement age is 67.
        
               | littlestymaar wrote:
               | > Can you confidently say that an LLM won't be better
               | than an average 22 year old coder within these 30 years?
               | 
               | No 22 years old coder is better than the open source
               | library he's using taken straight from github, and yet
               | he's the one who's getting paid for it.
               | 
               | People who claim IA will disrupt software development are
               | just missing the big picture here: software jobs are
               | already unrecognizable from what it was just 20 years
               | ago. AI is just another tool, and as long as execs won't
               | bother use the tool by themselves, then they'll pay
               | developers to do it instead.
               | 
               | Over the past decades, writing code has become more and
               | more efficient (better programming languages, better
               | tooling, then enormous open source libraries) yet the
               | number of developers kept increasing, it's Jevons
               | paradox[1] in its purest form. So if past tells us
               | anything, is that AI is going to create many new software
               | developer jobs! (because the amount of people able to
               | ship significant value to a customer is going to
               | skyrocket, and customers' needs are a renewable
               | resource).
               | 
               | [1]: https://en.wikipedia.org/wiki/Jevons_paradox
        
           | furyofantares wrote:
           | Because you're being given superpowers and computers are
           | becoming more useful than ever.
        
           | RobinL wrote:
           | I am cautiously optimistic. So much of building software is
           | deciding what _should_ be built rather than the mechanics of
           | writing code.
           | 
           | I you like coding because of the things it lets you build,
           | then LLMs are exciting because you can build those things
           | faster.
           | 
           | If on the other hand you enjoy the mental challenge but
           | aren't interested in the outputs, then I think the future is
           | less bright for you.
           | 
           | Personally I enjoy coding for both reasons, but I'm happy to
           | sacrifice the enjoyment and sense of accomplishment of
           | solving hard problems myself if it means I can achieve more
           | 'real world' outcomes.
           | 
           | Another thing I'm excited about is that, as models improve,
           | it's like having an expert tutor on hand at all times. I've
           | always wanted an expert programmer on hand to help when I get
           | stuck, and to critically evaluate my work and help me
           | improve. Increasingly, now I have one.
        
           | pdpi wrote:
           | Even if LLMs take over the bulk of programming work, somebody
           | still needs to write the prompts, and make sure the output
           | actually matches what you wanted to achieve. That's just
           | programming with different tools.
        
           | joshstrange wrote:
           | Software engineering teaches you a set of skills that are
           | applicable in more places than just writing software. There
           | are big parts of the job that cannot be done by LLMs (today)
           | and if LLMs get better (or AGI happens) then enough other
           | professions will be affected that we will all be in the same
           | boat (no matter what you major in).
           | 
           | LLMs are just tools, they help but they do not replace
           | developers (yet).
        
             | onemoresoop wrote:
             | > LLMs are just tools, they help but they do not replace
             | developers (yet)
             | 
             | Yes but they will certainly have a lot of downward pressure
             | on salaries for sure.
        
           | epcoa wrote:
           | The "progress" demonstrated in this example is to literally
           | just extract bytes from the middle of a number:
           | 
           | Is this task:
           | 
           | "About 2 minutes later, these values were captured, again
           | spaced 5 seconds apart.
           | 
           | 0160093201 0160092d01 0160092801 0160092301 0160091e01"
           | 
           | [Find the part that is changing]
           | 
           | really even need an AI to assist (this should be a near
           | instant task for a human with basic CS numerical skills)? If
           | this is the type of task one thinks an AI would be useful for
           | they are likely in trouble for other reasons.
           | 
           | Also notable that you can cherry pick more impressive feats
           | even from older models, so I don't necessarily think this
           | proves progress.
           | 
           | I still wouldn't get too carried away just yet.
        
             | blibble wrote:
             | I put his value into my hex editor and it instantly showed
             | 900 in the data inspector pane
        
           | UncleOxidant wrote:
           | Semi-retired software/hardware engineer here. After my recent
           | experiences with various coding LLMs (similar to the
           | experience of the OP with the bluetooth fan protocol) I'm
           | really glad I'm in a financial position such that I'm able to
           | retire. The progress of these LLMs at coding has been
           | astonishing over the last 18 months. Will they entirely
           | replace humans? No. But as they increase programmer
           | productivity fewer devs will be required. In my case the
           | contract gig I was doing over this last summer I was able to
           | do about 3 to 4X faster than I could've done it without LLMs.
           | Yeah, they were generating a lot of boiler plate HDL code for
           | me, but that still saved me several days of work at least.
           | And then there was the test code that they generated which
           | again saved me days of work. And their ability to explain old
           | undocumented code that was part of the project was also
           | extremely helpful. I was skeptical 18 months ago that any of
           | this would be possible. Not anymore. I wasn't doing a project
           | in which there would've been a lot of training examples.
           | We're talking Verilog testbench generation based on multiple
           | input Verilog modules, C++ code generation for a C program
           | analyzer using libclang - none of this stuff would've worked
           | just a few months back.
        
             | DataDive wrote:
             | I will add that I am grateful that I also got to experience
             | a world where AI did not spew tons of code like a sausage-
             | making machine.
             | 
             | It was so satisfying to code up a solution where you knew
             | you would get through it little by little.
        
               | niemal_dev wrote:
               | This.
        
             | cs391231 wrote:
             | This. I'm not terrified by total automation (In that case
             | all jobs are going away and civilization is going to
             | radically alter), I'm scared of selective deskilling and
             | the field getting squeezed tighter and tighter leaving me
             | functionally in a dead end.
        
           | rachofsunshine wrote:
           | Hey, kid.
           | 
           | My name is Rachel. I'm the founder of company whose existence
           | is contingent on the continued existence, employment, and
           | indeed _competitive_ employment of software engineers, so I
           | have as much skin in this game as you do.
           | 
           | I worry about this a lot. I don't know what the chances are
           | that AI wipes out developer jobs [EDIT: to clarify, in the
           | sense that they become either much rarer or much lower-paid,
           | which is sufficient] within a timescale relevant to my work
           | (say, 3-5 years), but they aren't zero. Gun to my head, I peg
           | that chance at perhaps 20%. That makes me more bearish on AI
           | than the typical person in the tech world - Manifold thinks
           | AI surpasses human researchers by the end of 2028 at 48% [1],
           | for example - but 20% is most certainly not zero.
           | 
           | That thought stresses me out. It's not just an existential
           | threat to my business over which I have no control, it's a
           | threat against which I cannot realistically hedge and which
           | may disrupt or even destroy my life. It bothers me.
           | 
           | But I do my work anyway, for a couple of reasons.
           | 
           | One, progress on AI in posts like this is always going to be
           | inflated. This is a marketing post. It's a post OpenAI wrote,
           | and posted, to generate additional hype, business, and
           | investment. There is some justified skepticism further down
           | this thread, but even if you couldn't find a _reason_ to be
           | skeptical, you ought to be skeptical _by default_ of such
           | posts. I am an abnormally honest person by Silicon Valley
           | founder standards, and even I cherry pick my marketing blogs
           | (I just don 't outright make stuff up for them).
           | 
           | Two, if AI surpasses a good software engineer, it probably
           | surpasses just about everything else. This isn't a guarantee,
           | but good software engineering is already one of the more
           | challenging professions for humans, and there's no particular
           | reason to think progress would stop exactly at making SWEs
           | obsolete. So there's no good alternative here. There's no
           | other knowledge work you could pivot to that would be a
           | decent defense against what you're worried about. So you may
           | as well play the hand you've got, even in the knowledge that
           | it might lose.
           | 
           | Three, in the world where AI _does_ surpass a good software
           | engineer, there 's a decent chance it surpasses a good _ML_
           | engineer in the near future. And once it does that, we 're in
           | completely uncharted territory. Even if more extreme
           | singularity-like scenarios don't come to pass, it doesn't
           | need to be a singularity to become significantly superhuman
           | to the point that almost nothing about the world in which we
           | live continues to make any sense. So again, you lack any good
           | alternatives.
           | 
           | And four: *if this is the last era in which human beings
           | matter, I want to take advantage of it!* I may be among the
           | very last entrepreneurs or businesswomen in the history of
           | the human race! If I don't do this now, I'll never get the
           | chance! If you want to be a software engineer, do it now,
           | because you might never get the chance again.
           | 
           | It's totally reasonable to be scared, or stressed, or
           | uncertain. Fear and stress and uncertainty are parts of life
           | in far less scary times than these. But all you can do is
           | play the hand you're dealt, and try not to be totally
           | miserable while you're playing it.
           | 
           | -----
           | 
           | [1] https://manifold.markets/Royf214/will-ai-surpass-humans-
           | in-c...
        
           | huuhee3 wrote:
           | I think CS skills will remain valuable, but you should try to
           | build some domain specific knowledge in addition. Perhaps
           | programmer roles will eventually merge with product owner /
           | business person type of roles.
        
           | mindcrime wrote:
           | I don't want to lean into negativity here, and I'm far from
           | an "AI Doomer".
           | 
           | But... I will say I think the question you ask is a very fair
           | question, and that there is, indeed, a LOT of uncertainty
           | about what the future holds in this regard.
           | 
           | So far the best reason we have for optimism is history: _so
           | far_ the old adage has held up that  "technology does destroy
           | some jobs, but on balance it creates more new ones than it
           | destroys." And while that's small solace to the buggy-whip
           | maker or steam-engine engineer, things tend to work out in
           | the long-run. However... history is suggestive, but far from
           | conclusive. There is the well known "problem of induction"[1]
           | which points out that we can't make definite predictions
           | about the future based on past experience. And when those
           | expectations are violated, we get "black swan events"[2]. And
           | while they be uncommon, they do happen.
           | 
           | The other issue with this question is, we don't really know
           | what the "rate of change" in terms of AI improvement is. And
           | we definitely don't know the 2nd derivative (acceleration).
           | So a short-term guess that "there will be a job for you in 1
           | year's time" is probably a fairly safe guess. But as a
           | current student, you're presumably worried about 5 years, 10
           | years, 20 years down the line and whether or not you'll still
           | have a career. And the simple truth is, we can't be sure.
           | 
           | So what to do? My gut feeling is "continue to learn software
           | engineering, but make sure to look for ways to broaden your
           | skill base, and position yourself to possibly move in other
           | directions in the future". Eg, don't focus on just becoming a
           | skilled coder in a particular language. Learn fundamentals
           | that apply broadly, and - more importantly - learn about
           | _how_ business work, learn  "people skills"[3], develop
           | domain knowledge in one or more domains, and generally learn
           | as much as you can about "how the world works". Then from
           | there, just "keep your head on a swivel" and stay aware of
           | what's going on around you and be ready to make adjustments
           | as needed.
           | 
           | It might not also hurt to learn a thing or two about
           | something that requires a physical presence (welding, etc.).
           | And just in case a full-fledged cyberpunk dystopia
           | develops... maybe start buying an extra box or two of
           | ammunition every now and then, and study escape and evasion
           | techniques, yadda yadda...
           | 
           | [1]: http://en.wikipedia.org/wiki/Problem_of_induction
           | 
           | [2]: https://en.wikipedia.org/wiki/Black_swan_theory
           | 
           | [3]: https://www.youtube.com/watch?v=hNuu9CpdjIo
        
           | algebra-pretext wrote:
           | If you're going for FAANG most of your day isn't coding
           | anyway.
        
           | noshitsherlock wrote:
           | There's so much software yet to to be written, so much to
           | automate, so many niches to attack that you need not worry.
           | It takes humans to know where to apply the technology based
           | on their heart, not brains. Use AI in the direction only you
           | can ascertain; and do it for the good of HUMANITY. It's a
           | tool that makes the knowledge posterity has left us
           | accessible, like mathematics. Go forth an conquer life's ills
           | young man; It takes a human to know one. Don't worry, you're
           | created in God's image.
        
             | noshitsherlock wrote:
             | Machines don't really "know" anything they just manipulate
             | what is already known; Like a interactive book. It's just
             | that this AI book is vast.
        
             | noshitsherlock wrote:
             | And the knowledge acquisition impedance is reduced
        
           | smegger001 wrote:
           | plumbing still looks like a safe choice for now.
        
           | IncreasePosts wrote:
           | What kind of student, at what kind of school?
           | 
           | Are your peers getting internships at FANGs or hedge funds?
           | Stick with it. You can probably bank enough money to make it
           | worth it before shtf.
        
           | orzig wrote:
           | Three, though not slam dunks:
           | 
           | 1. What other course of study are you confident would be
           | better given an AI future? If there's a service sector job
           | that you feel really called to, I guess you could shadow
           | someone for a few days to see if you'd really like it?
           | 
           | 2. Having spent a few years managing business dashboards for
           | users, less than 25% ever routinely used the "user friendly"
           | functionality we built to do semi-custom analysis. We needed
           | 4 full time analytics engineers to spend at least half their
           | time answering ad hoc questions that could have been self-
           | served, despite an explicit goal of democratizing data. All
           | that is to say; don't over estimate how quickly this will be
           | taken up, even if it could technically do XYZ task
           | (eventually, best-of-10) if prompted properly.
           | 
           | 3. I don't know where you live, but I've spent most of my
           | career 'competing' with developers in India who are paid
           | 33-50% as much. They're literally teammates, it's not a
           | hypothetical thing. And they've never stopped hiring in the
           | US. I haven't been in the room for those decisions and don't
           | want to open that can of worms here, but suffice to say it's
           | not so simple as "cheaper per LoC wins"
        
           | spaceman_2020 wrote:
           | I honestly think that unless you're really passionate or
           | really good, you shouldn't be a coder. If you, like the vast
           | majority of coders today, picked it up in college or later,
           | and mostly because of the promise of a fat paycheck, I can't
           | really see a scenario where you would have a 30 year career
        
           | stale2002 wrote:
           | Sure. Software engineers are actually the best situated to
           | take advantage of this new technology.
           | 
           | Your concern would be like once C got invented, why should
           | you bother being a software engineer? Because C is so much
           | easier to use than assembly code!
           | 
           | The answer, of course, is that software engineering will
           | simply happen in even more powerful and abstract layers.
           | 
           | But, you still might need to know how those lower layers
           | work, even if you are writing less code in that layer
           | directly.
        
             | xcv123 wrote:
             | C did not write itself.
             | 
             | We now have a tool that writes code and solves problems
             | autonomously. It's not comparable.
        
           | almostuseful wrote:
           | If at some point a competent senior software engineer can be
           | automated away, I think we are so close to a possible 'AI
           | singularity' in as much as that concept makes sense, that
           | nothing really matters anyway.
           | 
           | I don't know what will be automated first of the competent
           | senior software engineer and say, a carpenter, but once the
           | programmer has been automated away, the carpenter (and
           | everything else) will follow shortly.
           | 
           | The reasoning is that there is such a functional overlap
           | between being a standard software engineer and an AI engineer
           | or researcher, that once you can automate one, you can
           | automate the other. Once you have automated the AI engineers
           | and researchers, you have recursive self-improving AI and all
           | bets are off.
           | 
           | Essentially, software engineering is perhaps the only field
           | where you shouldn't worry about automation, because once that
           | has been automated, everything changes anyways.
        
           | jstummbillig wrote:
           | One angle: There are a million SMBs and various other
           | institutions, using none or really shitty software, that
           | could be xx% to xxx% times more productive with custom
           | software that they would never have been able to afford
           | before. Now they can, en masse, because you will be able to
           | built it a lot faster.
           | 
           | I have been coding a lot with AI recently. Understanding and
           | putting into thought what is needed for the program to fix
           | your problem remains as complex and difficult as ever.
           | 
           | You need to pose a question for the AI to do something for
           | you. Asking a good question is out of reach for a lot of
           | people.
        
           | agentultra wrote:
           | There's an equal amount of hopium from the AI stans here as
           | well.
           | 
           | Hundreds of billions of dollars have been invested in a
           | technology and they need to find a way to start making a
           | profit or they're going to run out of VC money.
           | 
           | You still have to know what to build and how to specify what
           | you want. Plain language isn't great at being precise enough
           | for these things.
           | 
           | Some people say they'll keep using stuff like this as a tool.
           | I wouldn't bet the farm that it's going to replace humans at
           | any point.
           | 
           | Besides, programming is fun.
        
           | ayakang31415 wrote:
           | If AI becomes good enough to replace software engineers, it
           | has already become good enough to replace other brain jobs
           | (lawyers, physicians, accountant, etc). I feel that software
           | engineering is one of the very last jobs to be replaced by
           | AI.
        
           | bashfulpup wrote:
           | There is little to no research that shows modern AI can
           | perform even the most simple long-running task without
           | training data on that exact problem.
           | 
           | To my knowledge, there is no current AI system that can
           | replace a white collar worker in any multistep task. The only
           | thing they can do is support the worker.
           | 
           | Most jobs are safe for the forseable future. If your job is
           | highly repetitive and a company can produce a perfect dataset
           | of it, I'd worry.
           | 
           | Jobs like a factory worker and call center support are in
           | danger. But the work is perfectly monitorable.
           | 
           | Watch the GAIA benchmark. It's not nearly the complexity of a
           | real-world job, but it would signal the start of an actual
           | agentic system being possible.
        
             | baq wrote:
             | I'd argue the foreseeable future got a lot shorter in the
             | last couple years.
        
           | morningsam wrote:
           | As soon as software development can be fully performed by
           | AIs, it won't take long before all other jobs that can be
           | performed in front of a computer follow, and after that it
           | probably won't take long for practically the entire rest.
           | 
           | This release has shifted my personal prediction of when this
           | is going to happen further into the future, because OpenAI
           | made a big deal hyping it up and it's nothing - preferred by
           | humans over GPT-4o only a little more than half the time.
        
           | _proofs wrote:
           | just because something can generate an output for you, does
           | not make a need for _discernment and application_ obsolete.
           | 
           | like another commenter, i do not have a lot of faith, that
           | people who do not have at minimum: fundamental fluency in
           | programming (even with a dash of general software
           | architecture and practices).
           | 
           | there is no "push button generate and glueing components
           | together in a way that can survive at scale and be
           | maintainable" without knowing what the output means, and
           | implies with respect to integration(s).
           | 
           | however, those with the fluency, domain, and experience, will
           | thrive, and continue thriving.
        
           | taco_emoji wrote:
           | This is not going to replace you. This isn't AGI.
        
           | mmckelvy wrote:
           | As others have said, LLMs still require engineers to produce
           | quality output. LLMs do, however, make those engineers that
           | use them much more productive. If this trend continues, I
           | could see a scenario where an individual engineer could build
           | a customized version of, say, Salesforce in a month or two.
           | If that happens, you could make a solid case that companies
           | paying $1mm+ per year for 12 different SaaS tools should just
           | bring that in house. The upshot is you may still be writing
           | software, but instead of building SaaS at Salesforce, you'll
           | be working for their former customers or maybe as some sort
           | of contractor.
        
           | bunderbunder wrote:
           | Actually cutting code is _maybe_ 10% of the job, and LLMs are
           | absolute crap at the other 90%.
           | 
           | They can't build and maintain relationships with
           | stakeholders. They can't tell you why what you ask them to do
           | is unlikely to work out well in practice and suggest
           | alternative designs. They can't identify, document and
           | justify acceptance criteria. They can't domain model. They
           | can't architect. They can't do large-scale refactoring. They
           | can't do system-level optimization. They can't work with that
           | weird-ass code generation tool that some hotshot baked deeply
           | into the system 15 years ago. They can't figure out why that
           | fence is sitting out in the middle of the field for no
           | obvious reason. etc.
           | 
           | If that kind of stuff sounds like satisfying work to you, you
           | should be fine. If it sounds terrible, you should pivot away
           | now regardless of any concerns about LLMs, because, again,
           | this is like 90% of the real work.
        
           | xivzgrev wrote:
           | Play it out
           | 
           | Let's assume today a LLM is perfectly equivalent to a junior
           | software engineer. You connect it to your code base, load in
           | PRDs / designs, ask it to build it, and viola perfect code
           | files
           | 
           | 1) Companies are going to integrate this new technology in
           | stages / waves. It will take time for this to really get
           | broad adoption. Maybe you are at the forefront of working
           | with these models
           | 
           | 2) OK the company adopts it and fires their junior engineers.
           | They start deploying code. And it breaks Saturday evening.
           | Who is going to fix it? Customers are pissed. So there's lots
           | to work out around support.
           | 
           | 3) That problem is solved, we can perfectly trust a LLM to
           | ship perfect code that never causes downstream issues and
           | perfectly predicts all user edge cases.
           | 
           | Never underestimate the power of corporate greediness.
           | There's generally two phases of corporate growth - expansion
           | and extraction. Expansion is when they throw costs out the
           | window to grow. Extraction is when growth stops, and they
           | squeeze customers & themselves.
           | 
           | AI is going to cause at least a decade of expansion. It opens
           | up so many use cases that were simply not possible before,
           | and lots of replacement.
           | 
           | Companies are probably not looking at their engineers looking
           | to cut costs. They're more likely looking at them and saying
           | "FINALLY, we can do MORE!"
           | 
           | You won't be a coder - you'll be a LLM manager / wrangler.
           | You will be the neck the company can choke if code breaks.
           | 
           | Remember if a company can earn 10x money off your salary,
           | it's a good deal to keep paying you.
           | 
           | Maybe some day down the line, they'll look to squeeze
           | engineers and lay some off, but that is so far off.
           | 
           | This is not hopium, this is human nature. There's gold in
           | them hills.
           | 
           | But you sure as shit better be well versed in AI and using in
           | your workflows - the engineers who deny it will be the ones
           | who fall behind
        
           | theptip wrote:
           | If you want to get a career in software engineering because
           | you want to write code all day, probably a bad time to be
           | joining the field.
           | 
           | If you are interested in using technology to create systems
           | that add value for your users, there has never been a better
           | time.
           | 
           | GPT-N will let you scale your impact way beyond what you
           | could do on your own.
           | 
           | Your school probably isn't going to keep abreast with this
           | tech so it's going to be more important to find side-projects
           | to exercise your skills. Build a small project, get some
           | users, automate as much as you can, and have fun along the
           | way.
        
           | rapind wrote:
           | Here you go:
           | 
           | I just watched a tutorial on how to leverage v1, claude, and
           | cursor to create a marketing page. The result was a
           | convoluted collection of 20 or so TS files weighing a few MB
           | instead of a 5k HTML file you could hand bomb in less time.
           | 
           | I wouldn't feel too threatened yet. It's still just a tool
           | and like any tool, can be wielded horribly.
        
             | CamperBob2 wrote:
             | _I just watched a tutorial on how to leverage v1, claude,
             | and cursor to create a marketing page. The result was a
             | convoluted collection of 20 or so TS files weighing a few
             | MB instead of a 5k HTML file you could hand bomb in less
             | time._
             | 
             | And if you hired an actual team of developers to do the
             | same thing, it is very likely that you'd have gotten a
             | convoluted collection of 20 or so TS files weighing a few
             | MB instead of a 5k HTML file you could hand bomb in less
             | time.
        
           | weweweoo wrote:
           | What's the alternative? If AI is going to replace software
           | engineers, there is no fundamental reason they couldn't
           | replace almost all other knowledge workers as well. No matter
           | the field, most of it is just office work managing,
           | transforming and building new information, applying existing
           | knowledge on new problems (that probably are not very unique
           | in grand scheme of things).
           | 
           | Except for medical doctors, nurses, and some niche
           | engineering professions, I really struggle to think of jobs
           | requiring higher education that couldn't be largely automated
           | by an LLM that is smart enough to replace a senior software
           | engineer. These few jobs are protected mainly by the physical
           | aspect, and low tolerance for mistakes. Some skilled trades
           | may also be protected, at least if robotics don't improve
           | dramatically.
           | 
           | Personally, I would become a doctor if I could. But of all
           | things I could've studied excluding that, computer science
           | has probably been one of the better options. At least it
           | teaches problem solving and not just memorization of facts.
           | Knowing how to code may not be that useful in the future, but
           | the process of problem solving is going nowhere.
        
             | pid-1 wrote:
             | Why can't medical doctors be automated?
        
               | weweweoo wrote:
               | Mainly the various physical operations many of them
               | perform on daily basis (due to limitations of robotics),
               | plus liability issues in case things go wrong and
               | somebody dies. And finally, huge demand due to aging
               | population worldwide.
               | 
               | I do believe some parts of their jobs will be automated,
               | but not enough (especially with growing demand) to really
               | hurt career prospects. Even for those parts, it will take
               | a long a while due to the regulated nature of the sector.
        
           | gensym wrote:
           | 1. The demand for software is insatiable. The biggest gate
           | has been the high costs due to limited supply of the time of
           | the people who know how to do it. In the near term, AI will
           | make the cost of software (not of software devs, but the
           | software itself) decrease while demand for new software will
           | increase, especially as software needs to be created to take
           | advantage of new UI tools.
           | 
           | I've been in software engineering for over 20 years. I've
           | seen massive growth in the productivity of software
           | engineers, and that's resulted in greater demand for them. In
           | the near term, AI should continue this trend.
           | 
           | 2. It's possible that at some point, AI will advance to where
           | we can remove software engineers from the loop. We're not
           | even close to that point yet. In the mean time, software
           | engineering is an excellent way to learn about other business
           | problems so that you'll be well-situated to address them
           | (whatever they'll be at that time).
        
           | CuriouslyC wrote:
           | It still has issues with crossing service boundaries, working
           | in systems, stuff like that. That stuff will get better but
           | the amount of context you need to load to get good results
           | with a decently sized system will still be prohibitive. The
           | software engineer skillset is being devalued but architecture
           | and systems thinking is still going to be valuable for quite
           | some time.
        
           | p1necone wrote:
           | For basically all the existing data we have, efficiency
           | improvements _always_ result in _more_ work, not less.
           | 
           | Humans never say "oh neat I can do _thing_ with 10% of the
           | effort now, guess I 'll go watch tv for the rest of the
           | week", they say "oh neat I can do _thing_ with 10% of the
           | effort now, I 'm going to hire twice as many people and
           | produce like 20x as much as I was before because there's so
           | much less risk to scaling now."
           | 
           | I think there's enough unmet demand for software that
           | efficiency increases from automation are going to be eaten up
           | for a long time to come.
        
           | MattDaEskimo wrote:
           | Most of these posts are from romantics.
           | 
           | Software engineering will be a profession of the past,
           | similar to how industrial jobs hardly exist.
           | 
           | If you have a strong intuition with software & programming
           | you may want to shift towards applying AI into already
           | existing solutions.
        
             | weweweoo wrote:
             | The question is, why wouldn't nearly all other white collar
             | jobs be professions of the past as well? Does the average
             | MBA or whatever possess some unique knowledge that you
             | couldn't generate with an LLM fed with company data? What
             | is the alternative career path?
             | 
             | I think software engineers who also understand business may
             | yet have an advantage over pure business people, who don't
             | understand technology. They should be able to tell AI what
             | to do, and evaluate the outcome. Of course "coders" who
             | simply produce code from pre-defined requirements will
             | probably not have a good career.
        
               | MattDaEskimo wrote:
               | They will be of the past.
               | 
               | This is typical of automation. First, there are numerous
               | workers, then they are reduced to supervisors, then they
               | are gone.
               | 
               | The future of business will be managing AI, so I agree
               | with what you're saying. However most software engineers
               | have a very strong low level understanding of
               | programming. Not a business sense of application
        
           | CamperBob2 wrote:
           | To fix the robots^W^W^Wbuild these things.
           | 
           | I've been around for multiple decades. Nothing this
           | interesting has happened since at least 1981, when I first
           | got my hands on a TRS-80. I dropped out of college to work on
           | games, but these days I would drop out of college to work on
           | ML.
        
           | eitally wrote:
           | While the reasoning and output of ChatGPT is impressive (and,
           | imho, would pass almost all coding interviews), I'm primarily
           | impressed with the logical flow, explanation and
           | thoroughness. The actual coding and problem solving isn't
           | complex, and that gets to your question: someone (in this
           | case, the OP), still needed to be able to figure out how to
           | extract useful data and construct a stimulating prompt to
           | trigger the LLM into answering in this way. As others have
           | posted, none of the popular LLMs behave identically, either,
           | so becoming an exert tool-user with one doesn't necessarily
           | translate to the next.
           | 
           | I would suggest the fundamentals of computer science and
           | software engineering are still critically important ... but
           | the development of new code, and especially the translation
           | or debugging of existing code is where LLMs will shine.
           | 
           | I currently work for an SAP-to-cloud consulting firm. One of
           | the singlemost compelling use cases for LLMs in this area is
           | to analyze custom code (running in a client's SAP
           | environment), and refactor it to be compatible with current
           | versions of SAP as a cloud SaaS. This is a specialized domain
           | but the concept applies broadly: pick some crufty codebase
           | from somewhere, run it through an LLM, and do a lot of mostly
           | copying & pasting of simpler, modern code into your new
           | codebase. LLMs take a lot of the drudgery out of this, but it
           | still requires people who know what they're looking at, and
           | _could_ do it manually. Think of the LLM as giving you an
           | efficiency superpower, not replacing you.
        
           | holoduke wrote:
           | Software development just becomes a level tier higer for most
           | developers. Instead of writing everything yourself you will
           | be more like an orchestrator. Tell the system to write this,
           | tell the system to connect that and this etc. You still need
           | to understand code. But maybe in the future even that part
           | becomes unreadable for us. We only understand the high level
           | concepts.
        
           | andrewchambers wrote:
           | I don't think programming is any less safe than any other
           | office job tbh. Focus on problem solving and using these
           | tools to your advantage and choose a field you enjoy.
        
           | rdevsrex wrote:
           | Just because we have machines that can lift much more than
           | any human ever could, it doesn't mean that working out is
           | useless.
           | 
           | In the same way, training your mind is not useless. Perhaps
           | as things develop, we will get back to the idea that the
           | purpose of education is not just to get a job, but to help
           | you become a better and more virtuous person.
        
         | romeros wrote:
         | is it better than Claude?
        
           | bartman wrote:
           | Neither Sonnet nor Opus could solve it or get close in a
           | minimal test I did just now, using the same prompt as above.
           | 
           | Sonnet: https://pastebin.com/24QG3JkN
           | 
           | Opus: https://pastebin.com/PJM99pdy
        
           | hmottestad wrote:
           | I think this new model is a generational leap above Claude
           | for tasks that require complex reasoning.
        
         | antman wrote:
         | second is very blurry
        
           | bartman wrote:
           | When you click on the image it loads a higher res version.
        
         | jazzyjackson wrote:
         | Isn't there a big "Share" button at the top right of the
         | chatgpt interface? Or are you using another front end?
        
           | bartman wrote:
           | In ChatGPT for Business it limits sharing among users in my
           | org, without an option for public sharing.
        
           | fshbbdssbbgdd wrote:
           | I often click on those links and get an error that they are
           | unavailable. I'm not sure if it's openAI trying to prevent
           | people from sharing evidence of the model behaving badly, or
           | an innocuous explanation like the links are temporary.
        
             | arunv wrote:
             | They were probably generated using a business account, and
             | the business does not allow public links.
        
             | coder543 wrote:
             | The link also breaks if the original user deletes the chat
             | that was being linked to, whether on purpose or without
             | realizing it would also break the link.
        
           | OutOfHere wrote:
           | Even for regular users, the Share button is not always
           | available or functional. It works sometimes, and other times
           | it disappears. For example, since today, I have no Share
           | button at all for chats.
        
             | JieJie wrote:
             | My share chat link moved into the sidebar in the ... menu
             | to the right of each chat title (MacOS Safari).
        
               | OutOfHere wrote:
               | Ah, I see it there now. Thanks.
        
         | baal80spam wrote:
         | Thanks for sharing this, incredible stuff.
        
         | GaggiX wrote:
         | Did you edit the message? I cannot see anything now in the
         | screenshot, too low resolution
        
           | bartman wrote:
           | You need to click on the image for the high res version to
           | load. Sorry, it's awkward.
        
             | GaggiX wrote:
             | The website seems to redirect me to a low resolution image,
             | the first time I clicked on the link it worked as you are
             | saying.
        
         | fwip wrote:
         | What's the incredible part here? Being able to write code to
         | turn hex into decimal?
        
           | fwip wrote:
           | Also, if you actually read the "chain of thought" contains
           | several embarrassing contradictions and incoherent sentences.
           | If a junior developer wrote this analysis, I'd send them back
           | to reread the fundamentals.
        
             | CooCooCaCha wrote:
             | What about thoughts themselves? There are plenty of times I
             | start a thought and realize it doesn't make sense. It's
             | part of the thinking process.
        
         | soheil wrote:
         | Great progress, I asked GPT-4o and o1-preview to create a
         | python script to make $100 quickly, o1 came up with a very
         | interesting result:
         | 
         | https://x.com/soheil/status/1834320893331587353
        
         | fsndz wrote:
         | > Asking the models to determine if my code is equivalent to
         | what they reverse engineered resulted in a nuanced and thorough
         | examination, and eventual conclusion that it is equivalent.
         | 
         | Did you actually implement to see if it works out of the box ?
         | 
         | Also if you are a free users or accepted that your chats should
         | be used for training then maybe o1 is was just trained on your
         | previous chat and so now knows how to reason about that
         | particular type of problems
        
           | bongodongobob wrote:
           | That's not how LLM training works.
        
             | fsndz wrote:
             | so it is impossible to use the free user chats to train
             | models ??????
        
           | bartman wrote:
           | That is an interesting thought. This was all done in an
           | account that is opted out of training though.
           | 
           | I have tested the Python code o1 created to decode the
           | timestamps and it works as expected.
        
         | jeffpeterson wrote:
         | Very cool. It gets the conclusion right, but it did confuse
         | itself briefly after interpreting `256 * last_byte +
         | second_to_last_byte` as big-endian. It's neat that it corrected
         | the confusion, but a little unsatisfying that it doesn't
         | explicitly identify the mistake the way a human would.
        
         | avodonosov wrote:
         | The screenshot [1] is not readable for me. Chrome, Android.
         | It's so blurry that I cant recognize a single character. How do
         | other people read it? The resolution is 84x800.
        
           | rovr138 wrote:
           | When I click on the image, it expands to full res,
           | 1713x16392.3
        
             | deathanatos wrote:
             | > _it expands to full res, 1713x16392.3_
             | 
             | Three tenths of a pixel is an _interesting_ resolution...
             | 
             | (The actual res is 1045 x 10000 ; you've multiplied by
             | 1.63923 somehow...?)
        
               | rovr138 wrote:
               | I agree,
               | 
               | But it's what I got when I went to Inspect element >
               | hover over the image
               | 
               | Size it expanded to vs real image size I guess
        
               | Jerrrrrrry wrote:
               | Pixels have been "non-real" for a long time.
        
           | mikebridgman wrote:
           | Click on it for full resolution
        
           | smusamashah wrote:
           | When you open on phone, switch to "desktop site" via browser
           | three dots menu
        
           | daemonologist wrote:
           | Direct link to full resolution:
           | https://i.postimg.cc/D74LJb45/SCR-20240912-sdko.png
        
         | guiambros wrote:
         | FYI, there's a "Save ChatGPT as PDF" Chrome extension [1].
         | 
         | I wouldn't use on a ChatGPT for Business subscription (it may
         | be against your company's policies to export anything), but
         | very convenient for personal use.
         | 
         | https://chromewebstore.google.com/detail/save-chatgpt-as-pdf...
        
         | andraz wrote:
         | What is the brand of the fan? Same problem here with
         | proprietary hood fan...
        
           | bartman wrote:
           | InVENTer Pulsar
        
         | smusamashah wrote:
         | What if you copy the whole reasoning process example provided
         | by OpenAI, use it as a system prompt (to teach how to reason),
         | use that system prompt in Claude, got4o etc?
        
           | azeirah wrote:
           | It might work a little bit. It's like doing few shot
           | prompting instead of training it to reason.
        
         | fragmede wrote:
         | I'm impressed. I had two modified logic puzzles where ChatGPT-4
         | fails but o1 succeeds. The training data had too many instances
         | of the unmodified puzzle, so 4 wouldn't get it right. o1
         | manages to not get tripped up by them.
         | 
         | https://chatgpt.com/share/66e35c37-60c4-8009-8cf9-8fe61f57d3...
         | 
         | https://chatgpt.com/share/66e35f0e-6c98-8009-a128-e9ac677480...
        
         | 8thcross wrote:
         | This is a brilliant hypothesis deconstruction. I am sure others
         | will now be able to test as well and this should confirm their
         | engineering.
        
       | jmartin2683 wrote:
       | Per-token billing will be lit
        
       | drzzhan wrote:
       | "hidden chain of thought" is basically the finetuned prompt isn't
       | it? The time scale x-axis is hidden as well. Not sure how they
       | model the gpt for it to have an ability to decide when to stop
       | CoT and actually answer.
        
       | holmesworcester wrote:
       | Since ChatGPT came out my test has been, can this thing write me
       | a sestina.
       | 
       | It's sort of an arbitrary feat with language and following
       | instructions that would be annoying for me and seems impressive.
       | 
       | Previous releases could not reliably write a sestina. This one
       | can!
        
       | fraboniface wrote:
       | Some commenters seem a bit confused as to how this works. Here is
       | my understanding, hoping it helps clarify things.
       | 
       | Ask something to a model and it will reply in one go, likely
       | imperfectly, as if you had one second to think before answering a
       | question. You can use CoT prompting to force it to reason out
       | loud, which improves quality, but the process is still linear.
       | It's as if you still had one second to start answering but you
       | could be a lot slower in your response, which removes some
       | mistakes.
       | 
       | Now if instead of doing that you query the model once with CoT,
       | then ask it or another model to critically assess the reply, then
       | ask the model to improve on its first reply using that feedback,
       | then keep doing that until the critic is satisfied, the output
       | will be better still. Note that this is a feedback loop with
       | multiple requests, which is of different nature that CoT and much
       | more akin to how a human would approach a complex problem. You
       | can get MUCH better results that way, a good example being Code
       | Interpreter. If classic LLM usage is system 1 thinking, this is
       | system 2.
       | 
       | That's how o1 works at test time, probably.
       | 
       | For training, my guess is that they started from a model not that
       | far from GPT-4o and fine-tuned it with RL by using the above
       | feedback loop but this time converting the critic to a reward
       | signal for a RL algorithm. That way, the model gets better at
       | first guessing and needs less back and forth for the same output
       | quality.
       | 
       | As for the training data, I'm wondering if you can't somehow get
       | infinite training data by just throwing random challenges at it,
       | or very hard ones, and let the model think about/train on them
       | for a very long time (as long as the critic is unforgiving
       | enough).
        
       | hidelooktropic wrote:
       | > THERE ARE THREE R'S IN STRAWBERRY
       | 
       | It finally got it!!!
        
       | acomjean wrote:
       | I always think to a professor that was consulting on some civil
       | engineering software. He found a bug in the calculation it was
       | using to space rebar placed in concrete, based on looking at it
       | was spitting out and thinking that looks wrong.
       | 
       | This kind of thing makes me nervous.
        
       | zh3 wrote:
       | Question here is about the "reasoning" tag - behind the scenes,
       | is this qualitively different fron stringing words together on a
       | statistical basis? (aside from backroom tweaking and some
       | randomisation).
        
       | canjobear wrote:
       | First shot, I gave it a medium-difficulty math problem, something
       | I actually wanted the answer to (derive the KL divergence between
       | two Laplace distributions). It thought for a long time, and still
       | got it wrong, producing a plausible but wrong answer. After some
       | prodding, it revised itself and then got it wrong again. I still
       | feel that I can't rely on these systems.
        
         | spaceman_2020 wrote:
         | Look where you were 3 years ago, and where you are now.
         | 
         | And then imagine where you will be in 5 more years.
         | 
         | If it can _almost_ get a complex problem right now, I 'm dead
         | sure it will get it correct within 5 years
        
           | colonelspace wrote:
           | > I'm dead sure it will get it correct within 5 years
           | 
           | You might be right.
           | 
           | But plenty of people said we'd all be getting around in self-
           | driving cars _for sure_ 10 years ago.
        
             | neevans wrote:
             | we do have self driving car but since it directly affects
             | people's life it needs to be close to 100% accurate and no
             | margin of errors. Not necessarily the case for LLMs.
        
           | AnIrishDuck wrote:
           | I'm not? The history of AI development is littered with
           | examples of false starts, hidden traps, and promising
           | breakthroughs that eventually expose deeper and more
           | difficult problems [1].
           | 
           | I wouldn't be _shocked_ if it could eventually get it right,
           | but _dead sure_?
           | 
           | 1. https://en.wikipedia.org/wiki/AI_winter
        
           | taco_emoji wrote:
           | Have you never heard of "local maxima"? Why are you so
           | certain another 5 years will provide any qualitative
           | advancement at all?
        
           | mogoh wrote:
           | But can it now say "I don't know." ? Or can it evaluate its
           | own results and came to the conclusion that its just a wild
           | guess?
           | 
           | I am still impressed by the progress though.
        
           | evilfred wrote:
           | what makes you so "dead sure"? it's just hallucinating as
           | always
        
           | methodical wrote:
           | You're dead sure? I wouldn't say anything definite about
           | technology advancements. People seem to underestimate the
           | last 20% of the problem and only focus on the massive 80%
           | improvements up to this point.
        
           | zer0tonin wrote:
           | The progress since GPT-3 hasn't been spectacularly fast.
        
           | ActorNightly wrote:
           | Getting complex problem = having the solution in some form in
           | the training dataset.
           | 
           | All we are gonna get is better and better googles.
        
           | closeparen wrote:
           | It is not at all clear that "produce correct answer" is the
           | natural endpoint of "produce plausible on-topic utterances
           | that look like they could be answers." To do the former you
           | need to know something about the underlying structure of
           | reality (or have seen the answer before), to do the latter
           | you only need to be good at pattern-matching and language.
        
           | dontlikeyoueith wrote:
           | I still don't have a Mr. Fusion in my house, FYI.
           | 
           | We always overestimate the future.
        
         | m3kw9 wrote:
         | Maybe you are wrong if you don't know the answer?
        
       | fzaninotto wrote:
       | It can solve sudoku. It took 119s to solve this easy grid:
       | 
       | _ 7 8 4 1 _ _ _ 9
       | 
       | 5 _ 1 _ 2 _ 4 7 _
       | 
       | _ 2 9 _ 6 _ _ _ _
       | 
       | _ 3 _ _ _ 7 6 9 4
       | 
       | _ 4 5 3 _ _ 8 1 _
       | 
       | _ _ _ _ _ _ 3 _ _
       | 
       | 9 _ 4 6 7 2 1 3 _
       | 
       | 6 _ _ _ _ _ 7 _ 8
       | 
       | _ _ _ 8 3 1 _ _ _
        
         | fzaninotto wrote:
         | It seems to be unable to solve hard sudokus, like the following
         | one where it gave 2 wrong answers before abandoning.
         | 
         | +-------+-------+-------+ | 6 . . | 9 1 . | . . . | | 2 . 5 | .
         | . . | 1 . 7 | | . 3 . | . 2 7 | 5 . . |
         | +-------+-------+-------+ | 3 . 4 | . . 1 | . 2 . | | . 6 . | 3
         | . . | . . . | | . . 9 | . 5 . | . 7 . |
         | +-------+-------+-------+ | . . . | 7 . . | 2 1 . | | . . . | .
         | 9 . | 7 . 4 | | 4 . . | . . . | 6 8 5 |
         | +-------+-------+-------+
         | 
         | So we're safe for another few months.
        
       | the_king wrote:
       | Peter Thiel was widely criticized this spring when he said that
       | AI "seems much worse for the math people than the word people."
       | 
       | So far, that seems to be right. The only thing o1 is worse at is
       | writing.
        
       | MillionOClock wrote:
       | What is the maximum context size in the web UI?
        
       | kherud wrote:
       | Aren't LLMs much more limited on the amount of output tokens than
       | input tokens? For example, GPT-4o seems to support only up to 16
       | K output tokens. I'm not completely sure what the reason is, but
       | I wonder how that interacts with Chain-of-Thought reasoning.
        
         | trissi1996 wrote:
         | Not really.
         | 
         | There's no fundamental difference between input and output
         | tokens technically.
         | 
         | The internal model space is exactly the same after evaluating
         | some given set of token, no matter which of them were produced
         | by the prompter or the model.
         | 
         | The 16k output token limit is just an arbitrary limit in the
         | chatgpt interface.
        
       | novaleaf wrote:
       | boo, they are hiding the chain of thought from user output (the
       | great improvement here)
       | 
       | > Therefore, after weighing multiple factors including user
       | experience, competitive advantage, and the option to pursue the
       | chain of thought monitoring, we have decided not to show the raw
       | chains of thought to users. We acknowledge this decision has
       | disadvantages. We strive to partially make up for it by teaching
       | the model to reproduce any useful ideas from the chain of thought
       | in the answer. For the o1 model series we show a model-generated
       | summary of the chain of thought.
        
       | owenpalmer wrote:
       | "The Future Of Reasoning" by Vsauce [0] is a fascinating pre-AI-
       | era breakdown of how human reasoning works. Thinking about it in
       | terms of LLMS is really interesting.
       | 
       | [0]: https://www.youtube.com/watch?v=_ArVh3Cj9rw
        
       | LarsDu88 wrote:
       | I wonder if this architecture is just asking a chain of thought
       | prompt, or whether they built a diffusion model.
       | 
       | The old problem with image generation was that single pass
       | techniques like GANs and VAEs had to do everything in one go.
       | Diffusion models wound up being better by doing things
       | iteratively.
       | 
       | Perhaps this is a diffusion model for text (top ICML paper this
       | year was related to this).
        
       | devit wrote:
       | They claim it's available in ChatGPT Plus, but for me clicking
       | the link just gives GPT-4o Mini.
        
       | lukev wrote:
       | This is a pretty big technical achievement, and I am excited to
       | see this type of advancement in the field.
       | 
       | However, I am very worried about the utility of this tool given
       | that it (like all LLMs) is still prone to hallucination. Exactly
       | who is it for?
       | 
       | If you're enough of an expert to critically judge the output,
       | you're probably just as well off doing the reasoning yourself. If
       | you're not capable of evaluating the output, you risk relying on
       | completely wrong answers.
       | 
       | For example, I just asked it to evaluate an algorithm I'm working
       | on to optimize database join ordering. Early in the reasoning
       | process it confidently and incorrectly stated that "join costs
       | are usually symmetrical" and then later steps incorporated that,
       | trying to get me to "simplify" my algorithm by using an
       | undirected graph instead of a directed one as the internal data
       | structure.
       | 
       | If you're familiar with database optimization, you'll know that
       | this is... very wrong. But otherwise, the line of reasoning was
       | cogent and compelling.
       | 
       | I worry it would lead me astray, if it confidently relied on a
       | fact that I wasn't able to immediately recognize was incorrect.
        
         | ramesh31 wrote:
         | >If you're enough of an expert to critically judge the output,
         | you're probably just as well off doing the reasoning yourself.
         | 
         | Thought requires energy. A lot of it. Humans are for more
         | efficient in this regard than LLMs, but then a bicycle is also
         | much more efficient than a race car. I've found that even when
         | they are hilariously wrong about something, simply the
         | directionality of the line of reasoning can be enough to
         | usefully accelerate my own thought.
        
       | OkGoDoIt wrote:
       | Some practical notes from digging around in their documentation:
       | In order to get access to this, you need to be on their tier 5
       | level, which requires $1,000 total paid and 30+ days since first
       | successful payment.
       | 
       | Pricing is $15.00 / 1M input tokens and $60.00 / 1M output
       | tokens. Context window is 128k token, max output is 32,768
       | tokens.
       | 
       | There is also a mini version with double the maximum output
       | tokens (65,536 tokens), priced at $3.00 / 1M input tokens and
       | $12.00 / 1M output tokens.
       | 
       | The specialized coding version they mentioned in the blog post
       | does not appear to be available for use.
       | 
       | It's not clear if the hidden chain of thought reasoning is billed
       | as paid output tokens. Has anyone seen any clarification about
       | that? If you are paying for all of those tokens it could add up
       | quickly. If you expand the chain of thought examples on the blog
       | post they are extremely verbose.
       | 
       | https://platform.openai.com/docs/models/o1
       | https://openai.com/api/pricing/
       | https://platform.openai.com/docs/guides/rate-limits/usage-ti...
        
         | activatedgeek wrote:
         | Reasoning tokens are indeed billed as output tokens.
         | 
         | > While reasoning tokens are not visible via the API, they
         | still occupy space in the model's context window and are billed
         | as output tokens.
         | 
         | From here: https://platform.openai.com/docs/guides/reasoning
        
           | baq wrote:
           | This is concerning - how do you know you aren't being fleeced
           | out of your money here...? You'll get your results, but did
           | you _really_ use that much?
        
             | rsanek wrote:
             | obfuscated billing has long been a staple of all great
             | cloud products. AWS innovated in the space and now many
             | have followed in their footsteps
        
             | lolinder wrote:
             | Also, now we're paying for output tokens that aren't even
             | output, with no good explanation for why these tokens
             | should be hidden from the person who paid for them.
        
         | amrrs wrote:
         | The CoT is billed as output tokens. Mentioned in the docs where
         | it talks about reasoning
        
         | vdfs wrote:
         | We just receivied this email:
         | 
         | Hi there,
         | 
         | I'm x, PM for the OpenAI API. I'm pleased to share with you our
         | new series of models, OpenAI o1. We've developed these models
         | to spend more time thinking before they respond. They can
         | reason through complex tasks and solve harder problems than
         | previous models in science, coding, and math.
         | 
         | As a trusted developer on usage tier 5, you're invited to get
         | started with the o1 beta today. Read the docs You have access
         | to two models:                   Our larger model, o1-preview,
         | which has strong reasoning capabilities and broad world
         | knowledge.          Our smaller model, o1-mini, which is 80%
         | cheaper than o1-preview.
         | 
         | Try both models! You may find one better than the other for
         | your specific use case. Both currently have a rate limit of 20
         | RPM during the beta. But keep in mind o1-mini is faster,
         | cheaper, and competitive with o1-preview at coding tasks (you
         | can see how it performs here). We've also written up more about
         | these models in our blog post.
         | 
         | I'm curious to hear what you think. If you're on X, I'd love to
         | see what you build--just reply to our post.
         | 
         | Best, OpenAI API
        
         | sashank_1509 wrote:
         | I have access to this and there is no way I spend more than 50$
         | on OpenAI api. I have ChatGPT + since day q though (240$
         | probably in total)
        
           | rpmisms wrote:
           | You missed your raise key on "day q"
        
         | Buttons840 wrote:
         | I am a Plus user and pay $20 per month. I have access to the o1
         | models.
        
         | liamwire wrote:
         | Unless this is specifically relating to API access, I don't
         | think it's correct. I've been paying for ChatGPT via the App
         | Store IAP for around a year or less, and I've already got both
         | o1-preview and o1-mini available in-app.
        
           | OkGoDoIt wrote:
           | Yes, I was referring to API access specifically. Nothing in
           | the blog post or the documentation mentions access to these
           | new models on ChatGPT, and even as a paid user I'm not seeing
           | them on there (Edit: I am seeing it now in the app). But
           | looks like a bunch of other people in this discussion do have
           | it on ChatGPT, so that's exciting to hear.
        
         | arnaudsm wrote:
         | Some of the queries run for multiple minutes. 40 tokens/sec is
         | too slow for CoT.
         | 
         | I hope OpenAI is investing in low-latency like Groq's tech that
         | can reach 1k tokens/sec.
        
         | anigbrowl wrote:
         | _you need to be on their tier 5 level, which requires $1,000
         | total paid and [...]_
         | 
         | Good opening for OpenAI's competitors to run a 'we're not
         | snobs' promotion.
        
         | jonahx wrote:
         | I am an ordinary plus user (since it was released more or less)
         | and have access.
        
       | jdthedisciple wrote:
       | I challenged it to solve the puzzle in my profile info.
       | 
       | It failed ;)
        
       | jupi2142 wrote:
       | Using codeforces as a benchmark feels like a cheat, since OpenAI
       | use to pay us chump change to solve codeforces questions and
       | track our thought process on jupyter notebook.
        
       | rcarmo wrote:
       | Here's an unpopular take on this:
       | 
       | "We had the chance to make AI decision-making auditable but are
       | locking ourselves out of hundreds of critical applications by not
       | exposing the chain of thought."
       | 
       | One of the key blockers in many customer discussions I have is
       | that AI models are not really auditable and that automating
       | complex processes with them (let alone debug things when
       | "reasoning" goes awry) is difficult if not impossible unless you
       | do multi-shot and keep track of all the intermediate outputs.
       | 
       | I really hope they expose the chain of thought as some sort of
       | machine-parsable output, otherwise no real progress will have
       | been made (many benchmarks are not really significant when you
       | try to apply LLMs to real-life applications and use cases...)
        
         | fwip wrote:
         | I suspect that actually reading the "chain of thought" would
         | reveal obvious "logic" errors embarrassingly often.
        
           | rcarmo wrote:
           | It would still be auditable. In a few industries that is the
           | only blocker for adoption--even if the outputs are incorrect.
        
           | thimabi wrote:
           | I believe that is the case. Out of curiosity, I had this
           | model try to solve a very simple Sudoku puzzle in ChatGPT,
           | and it failed spectacularly.
           | 
           | It goes on and on making reasoning mistakes, and always ends
           | up claiming that the puzzle is unsolvable and apologizing. I
           | didn't expect it to solve the puzzle, but the whole reasoning
           | process seems fraught with errors.
        
       | 015a wrote:
       | Here's a video demonstration they posted on YouTube:
       | https://www.youtube.com/watch?v=50W4YeQdnSg
        
       | kfrane wrote:
       | I was a bit confused when looking at the English example for
       | Chain-Of-Thought. It seems that the prompt is a bit messed up
       | because the whole statement is bolded but it seems that only
       | "appetite regulation is a field of staggering complexity" part
       | should be bolded. Also that's how it shows up in the o1-preview
       | response when you open the Chain of thought section.
        
       | derefr wrote:
       | So, it's good at hard-logic reasoning (which is great, and no
       | small feat.)
       | 
       | Does this reasoning capability generalize outside of the
       | knowledge domains the model was trained to reason about, into
       | "softer" domains?
       | 
       | For example, is O1 better at comedy (because it can reason better
       | about what's funny)?
       | 
       | Is it better at poetry, because it can reason about rhyme and
       | meter?
       | 
       | Is it better at storytelling as an extension of an existing input
       | story, because it now will first analyze the story-so-far and
       | deduce aspects of the characters, setting, and themes that the
       | author seems to be going for (and will ask for more information
       | about those things if it's not sure)?
        
       | fsndz wrote:
       | My point of view: this is a real advancement. I've always
       | believed that with the right data allowing the LLM to be trained
       | to imitate reasoning, it's possible to improve its performance.
       | However, this is still pattern matching, and I suspect that this
       | approach may not be very effective for creating true
       | generalization. As a result, once o1 becomes generally available,
       | we will likely notice the persistent hallucinations and faulty
       | reasoning, especially when the problem is sufficiently new or
       | complex, beyond the "reasoning programs" or "reasoning patterns"
       | the model learned during the reinforcement learning phase.
       | https://www.lycee.ai/blog/openai-o1-release-agi-reasoning
        
         | abhorrence wrote:
         | > As a result, once o1 becomes generally available, we will
         | likely notice the persistent hallucinations and faulty
         | reasoning, especially when the problem is sufficiently new or
         | complex, beyond the "reasoning programs" or "reasoning
         | patterns" the model learned during the reinforcement learning
         | phase.
         | 
         | I had been using 4o as a rubber ducky for some projects
         | recently. Since I appeared to have access to o1-preview, I
         | decided to go back and redo some of those conversations with
         | o1-preview.
         | 
         | I think your comment is spot on. It's definitely an
         | advancement, but still makes some pretty clear mistakes and
         | does some fairly faulty reasoning. It especially seems to have
         | a hard time with causal ordering, and reasoning about
         | dependencies in a distributed system. Frequently it gets the
         | relationships backwards, leading to hilarious code examples.
        
           | fsndz wrote:
           | True. I just extensively tested o1 and came to the same
           | conclusion.
        
       | geenkeuse wrote:
       | Average Joe's like myself will build our apps end to end with the
       | help of AI.
       | 
       | The only shops left standing will be Code Auditors.
       | 
       | The solopreneur will wing it, without them, but enterprises will
       | take the (very expensive) hit to stay safe and compliant.
       | 
       | Everyone else needs to start making contingency plans.
       | 
       | Magnus Carlsen is the best chess player in the world, but he is
       | not arrogant enough to think he can go head to head with
       | Stockfish and not get a beating.
        
       | morningsam wrote:
       | What sticks out to me is the 60% win rate vs GPT-4o when it comes
       | to actual usage by humans for programming tasks. So in reality
       | it's barely better than GPT-4o. That the figure is higher for
       | mathematical calculation isn't surprising because LLMs were much
       | worse at that than at programming to begin with.
        
         | quirino wrote:
         | I'm not sure that's the right way to interpret it.
         | 
         | If some tasks are too easy, both models might give satisfactory
         | answers, in which case the human preference might as well be a
         | coin toss.
         | 
         | I don't know the specifics of their methodology though.
        
       | pknerd wrote:
       | I m wondering, what kind of "AI wrappers" will emerge from this
       | model.
        
       | wesleyyue wrote:
       | Just added o1 to https://double.bot if anyone would like to try
       | it for coding.
       | 
       | ---
       | 
       | Some thoughts:
       | 
       | * The performance is really good. I have a private set of
       | questions I note down whenever gpt-4o/sonnet fails. o1 solved
       | everything so far.
       | 
       | * It really is quite slow
       | 
       | * It's interesting that the chain of thought is hidden. This is I
       | think the first time where OpenAI can improve their models
       | without it being immediately distilled by open models. It'll be
       | interesting to see how quickly the oss field can catch up
       | technique-wise as there's already been a lot of inference time
       | compute papers recently [1,2]
       | 
       | * Notably it's not clear whether o1-preview as it's available now
       | is doing tree search or just single shoting a cot that is
       | distilled from better/more detailed trajectories in the training
       | distribution.
       | 
       | [1](https://arxiv.org/abs/2407.21787)
       | 
       | [2](https://arxiv.org/abs/2408.03314)
        
         | TheMiddleMan wrote:
         | Trying out Double now.
         | 
         | o1 did a significantly better job converting a JavaScript file
         | to TypeScript than Llama 3.1 405B, GitHub Copilot, and Claude
         | 3.5. It even simplified my code a bit while retaining the same
         | functionality. Very impressive.
         | 
         | It was able to refactor a ~160 line file but I'm getting an
         | infinite "thinking bubble" on a ~420 line file. Maybe
         | something's timing out with the longer o1 response times?
        
       | h1fra wrote:
       | Having read the full transcript I don't get how it counted 22
       | letters for mynznvaatzacdfoulxxz. It's nice that it corrected
       | itself but a bit worrying
        
       | kmeisthax wrote:
       | >We believe that a hidden chain of thought presents a unique
       | opportunity for monitoring models. Assuming it is faithful and
       | legible, the hidden chain of thought allows us to "read the mind"
       | of the model and understand its thought process. For example, in
       | the future we may wish to monitor the chain of thought for signs
       | of manipulating the user. However, for this to work the model
       | must have freedom to express its thoughts in unaltered form, so
       | we cannot train any policy compliance or user preferences onto
       | the chain of thought. We also do not want to make an unaligned
       | chain of thought directly visible to users.
       | 
       | >Therefore, after weighing multiple factors including user
       | experience, competitive advantage, and the option to pursue the
       | chain of thought monitoring, we have decided not to show the raw
       | chains of thought to users. We acknowledge this decision has
       | disadvantages. We strive to partially make up for it by teaching
       | the model to reproduce any useful ideas from the chain of thought
       | in the answer. For the o1 model series we show a model-generated
       | summary of the chain of thought.
       | 
       | So, let's recap. We went from:
       | 
       | - Weights-available research prototype with full scientific
       | documentation (GPT-2)
       | 
       | - Commercial-scale model with API access only, full scientific
       | documentation (GPT-3)
       | 
       | - Even bigger API-only model, tuned for chain-of-thought
       | reasoning, minimal documentation on the implementation (GPT-4,
       | 4v, 4o)
       | 
       | - An API-only model tuned to generate unedited chain-of-thought,
       | which will not be shown to the user, even though it'd be really
       | useful to have (o1)
        
         | dvt wrote:
         | It's clear to me that OpenAI is quickly realizing they have no
         | moat. Even this obfuscation of the chain-of-thought isn't
         | _really_ a moat. On top of CoT being pretty easy to implement
         | and tweak, there 's a serious push to on-device inference
         | (which imo is the future), so the question is: will GPT-5 and
         | beyond be really _that much_ better than what we can run
         | locally?
        
           | falcor84 wrote:
           | Based on their graphs of how quality scales well with compute
           | cycles, I would expect that it would indeed continue to be
           | that much better (unless you can afford the same compute
           | locally).
        
             | darby_nine wrote:
             | Not much of a moat vs other private enterprise, though
        
           | threatofrain wrote:
           | I don't see why on-device inference is the future. For
           | consumers, only a small set of use cases cannot tolerate the
           | increased latency. Corporate customers will be satisfied if
           | the model can be hosted within their borders. Pooling compute
           | is less wasteful overall as a collective strategy.
           | 
           | This argument can really only meet its tipping point when
           | massive models no longer offer a gotta-have-it difference vs
           | smaller models.
        
             | unethical_ban wrote:
             | On-device inference will succeed the way Linux does: It is
             | "free" in that it only requires the user to acquire a model
             | to run vs. paying for processing. It protects privacy, and
             | it doesn't require internet. It may not take over for all
             | users, but it will be around.
             | 
             | This assumes that openly developed (or at least weight-
             | available) models are available for free, and continue
             | being improved.
        
           | phillipcarter wrote:
           | I don't understand the idea that they have no moat. Their
           | moat is not technological. It's sociological. Most AI through
           | APIs uses their models. Most consumer use of AI involves
           | their models, or ChatGPT directly. They're clearly not in the
           | "train your own model on your data in your environment" game,
           | as that's a market for someone else. But make no mistake,
           | they have a moat and it is strong.
        
             | dvt wrote:
             | > But make no mistake, they have a moat and it is strong.
             | 
             | Given that Mistral, Llama, Claude, and even Gemini are
             | competitive with (if not better than) OpenAI's flagships, I
             | don't really think this is true.
        
               | crowcroft wrote:
               | Inertia is a hell of a moat.
               | 
               | Everyone building is comfortable with OpenAI's API, and
               | have an account. Competing models can't just be as good,
               | they need to be MUCH better to be worth switching.
               | 
               | Even as competitors build a sort of compatibility layer
               | to be plug an play with OpenAI they will always be a step
               | behind at best every time OpenAI releases a new feature.
        
               | chadash wrote:
               | Only a small fraction of all future AI projects have even
               | gotten started. So they aren't only fighting over what's
               | out there now, they're fighting over what will emerge.
        
               | phillipcarter wrote:
               | This is true, and yet, many orgs who have experimented
               | with OpenAI and are likely to return to them when a
               | project "becomes real". When you google around online for
               | how to do XYZ thing using LLMs, OpenAI is usually in
               | whatever web results you read. Other models and APIs are
               | also now using OpenAI's API format since it's the
               | apparent winner. And for anyone who's already sent out
               | subprocessor notifications with them as a vendor, they're
               | locked in.
               | 
               | This isn't to say it's only going to be an OpenAI market.
               | Enterprise worlds move differently, such as those in G
               | Cloud who will buy a few million $$ of Vertex expecting
               | to "figure out that gemini stuff later". In that sense,
               | Google has a moat with those slices of their customers.
               | 
               | But I believe that when people think OpenAI has no moat
               | because "the models will be a commodity", I think that's
               | (a) some wishful thinking about the models and (b)
               | doesn't consider the sociological factors that matter a
               | lot more than how powerful a model is or where it runs.
        
               | phillipcarter wrote:
               | There are countless tools competitive with or better than
               | what I use for email, and yet I still stick with my email
               | client. Same is true for many, many other tools I use. I
               | could perhaps go out of my way to make sure I'm always
               | using the most technically capable and easy-to-use tools
               | for everything, but I don't, because I know how to use
               | what I have.
               | 
               | This is the exact dynamic that gives OpenAI a moat. And
               | it certainly doesn't hurt them that they still produce
               | SOTA models.
        
               | dontlikeyoueith wrote:
               | That is not what anyone means when they talk about moats.
        
               | phillipcarter wrote:
               | I'm someone, and that's one of the ways I define a moat.
        
               | calmoo wrote:
               | First mover advantage is not a great moat.
        
               | calmoo wrote:
               | Yeah but the lock-in wrt email is absolutely huge
               | compared to chatting with an LLM. I can (and have) easily
               | ended my subscription to ChatGPT and switched to Claude,
               | because it provides much more value to me at roughly the
               | same cost. Switching email providers will, in general,
               | not provide that much value to me and cause a large
               | headache for me to switch.
               | 
               | Switching LLMs right now can be compared to switching
               | electricity providers or mobile carriers - generally it's
               | pretty low friction and provides immediate benefit (in
               | the case of electricity and mobile, the benefit is cost).
               | 
               | You simply cannot compare it to an email provider.
        
               | dragonwriter wrote:
               | That's not a strong moat (arguably, not a moat at all,
               | since as soon as any competitor has any business, they
               | benefit from it with respect to their existing
               | customers), it doesn't effect anyone who is not already
               | invested in OpenAI's products, and because not every
               | customer is like that with products they are currently
               | using.
               | 
               | Now, having a large existing customer base and thus
               | having an advantage in training data that feeds into an
               | advantage in improving their products and acquiring new
               | (and retaining existing customers) could, arguably, be a
               | moat; that's a network effect, not merely inertia, and
               | network effects can be a foundation of strong (though
               | potentially unstable, if there is nothing else shoring
               | them up) moats.
        
             | neaden wrote:
             | Doesn't that make it less of a moat? If the average
             | consumer is only interacting with it through a third party,
             | and that third party has the ability to switch to something
             | better or cheaper and thus switch thousands/millions of
             | customers at once?
        
             | anigbrowl wrote:
             | Their moat is no stronger than a good UI/API. What they
             | have is first mover advantage and branding.
        
               | lolinder wrote:
               | LiteLLM proxies their API to all other providers and
               | there are dozens of FOSS recreations of their UI,
               | including ones that are more feature-rich, so neither the
               | UI nor the API are a moat.
               | 
               | Branding and first mover is it, and it's not going to
               | keep them ahead forever.
        
           | bgar wrote:
           | >there's a serious push to on-device inference
           | 
           | What push are you referring to? By whom?
        
           | thih9 wrote:
           | Why would a non profit / capped profit company, one that
           | prioritizes public good, want a moat? Tongue in cheek.
        
           | danenania wrote:
           | I wonder if they'll be able to push the chain-of-thought
           | directly into the model. I'd imagine there could be some
           | serious performance gains achievable if the model could
           | "think" without doing IO on each cycle.
           | 
           | In terms of moat, I think people underestimate how much of
           | OpenAI's moat is based on operations and infrastructure
           | rather than being purely based on model intelligence. As
           | someone building on the API, it is by far the most reliable
           | option out there currently. Claude Sonnet 3.5 is stronger on
           | reasoning than gpt-4o but has a higher error rate, more
           | errors conforming to a JSON schema, much lower rate limits,
           | etc. These things are less important if you're just using the
           | first-party chat interfaces but are _very_ important if you
           | 're building on top of the APIs.
        
         | airstrike wrote:
         | I think it's clear their strategy has changed. The whole
         | landscape has changed. The size of models, amount of dollars,
         | numbers of competitors and how much compute this whole exercise
         | takes in the long term have all changed, so it's fair for them
         | to adapt.
         | 
         | It just so happens that they're keeping their old name.
         | 
         | I think people focus too much on the "open" part of the name. I
         | read "OpenAI" sort of like I read "Blackberry" or "Apple". I
         | don't really think of fruits, I think of companies and their
         | products.
        
         | beambot wrote:
         | Very anti-open and getting less and less with each release.
         | Rooting for Meta in this regard, at least.
        
         | mitthrowaway2 wrote:
         | > For example, in the future we may wish to monitor the chain
         | of thought for signs of manipulating the user.[...] Therefore
         | we have decided not to show the raw chains of thought to users.
         | 
         | Better not let the user see the part where the AI says "Next,
         | let's manipulate the user by lying to them". It's for their own
         | good, after all! We wouldn't want to make an unaligned chain of
         | thought directly visible!
        
         | Buttons840 wrote:
         | I always laughed at the idea of a LLM Skynet "secretly"
         | plotting to nuke humanity, while a bunch of humans watch it
         | unfold before their eyes in plaintext.
         | 
         | Now that seems less likely. At least OpenAI can see what it's
         | thinking.
         | 
         | A next step might be allowing the LLM to include non-text-based
         | vectors in its internal thoughts, and then do all internal
         | reasoning with raw vectors. Then the LLMs will have truly
         | private thoughts in their own internal language. Perhaps we
         | will use a LLM to interpret the secret thoughts of another LLM?
         | 
         | This could be good or bad, but either way we're going to need
         | more GPUs.
        
           | hobo_in_library wrote:
           | "...either way we're going to need more GPUs." posted the
           | LLM, rubbing it's virtual hands, cackling with delight as it
           | prodded the humans to give it MOAR BRAINS
        
           | navigate8310 wrote:
           | At this point the G in GPU must be completely dropped
        
             | fragmede wrote:
             | Gen-ai Processing Unit
        
           | scotty79 wrote:
           | > Now that seems less likely. At least OpenAI can see what
           | it's thinking.
           | 
           | When it's fully commercialize no one will be able to read
           | through all chains of thoughts, and with possibility of fine-
           | tuning AI can learn to evade whatever tools openai will
           | invent to flag concerning chains of thoughts if they
           | interfere with providing the answer in some finetuning
           | environment.
        
           | legionof7 wrote:
           | >Perhaps we will use a LLM to interpret the secret thoughts
           | of another LLM?
           | 
           | this is a pretty active area of research with sparse
           | autoencoders
        
         | lossolo wrote:
         | It's because there is nothing novel here from an architectural
         | point of view. Again, the secret sauce is only in the training
         | data.
         | 
         | O1 seems like a variant of RLRF
         | https://arxiv.org/abs/2403.14238
         | 
         | Soon you will see similar models from competitors.
        
         | lolinder wrote:
         | The hidden chain of thought tokens are also billed as output
         | tokens, so you still pay for them even though they're not going
         | to let you see them:
         | 
         | > While reasoning tokens are not visible via the API, they
         | still occupy space in the model's context window and are billed
         | as output tokens.
         | 
         | https://platform.openai.com/docs/guides/reasoning
        
         | drooby wrote:
         | Did OpenAI ever even claim that they would be an open source
         | company?
         | 
         | It seems like their driving mission has always been to create
         | AI that is the "most beneficial to society".. which might come
         | in many different flavors.. including closed source.
        
           | josu wrote:
           | Kind of?
           | 
           | >We're hoping to grow OpenAI into such an institution. As a
           | non-profit, our aim is to build value for everyone rather
           | than shareholders. Researchers will be strongly encouraged to
           | publish their work, whether as papers, blog posts, or code,
           | and our patents (if any) will be shared with the world. We'll
           | freely collaborate with others across many institutions and
           | expect to work with companies to research and deploy new
           | technologies.
           | 
           | https://web.archive.org/web/20160220125157/https://www.opena.
           | ..
        
           | lolinder wrote:
           | > Because of AI's surprising history, it's hard to predict
           | when human-level AI might come within reach. When it does,
           | it'll be important to have a leading research institution
           | which can prioritize a good outcome for all over its own
           | self-interest.
           | 
           | > We're hoping to grow OpenAI into such an institution. As a
           | non-profit, our aim is to build value for everyone rather
           | than shareholders. Researchers will be strongly encouraged to
           | publish their work, whether as papers, blog posts, or code,
           | and our patents (if any) will be shared with the world. We'll
           | freely collaborate with others across many institutions and
           | expect to work with companies to research and deploy new
           | technologies.
           | 
           | I don't see much evidence that the OpenAI that exists now--
           | after Altman's ousting, his return, and the ousting of those
           | who ousted him--has any interest in mind besides its own.
           | 
           | https://openai.com/index/introducing-openai/
        
           | leetharris wrote:
           | https://web.archive.org/web/20190224031626/https://blog.open.
           | ..
           | 
           | > Researchers will be strongly encouraged to publish their
           | work, whether as papers, blog posts, or code, and our patents
           | (if any) will be shared with the world. We'll freely
           | collaborate with others across many institutions and expect
           | to work with companies to research and deploy new
           | technologies.
           | 
           | From their very own website. Of course they deleted it as
           | soon as Altman took over and turned it into a for profit,
           | closed company.
        
         | ec109685 wrote:
         | Given the chain of thought is sitting in the context, I'm sure
         | someone enterprising will find a way to extract it via a
         | jailbreak (despite it being better at preventing jailbreaks).
        
       | noshitsherlock wrote:
       | This is great. I've been wondering how we will revert back to an
       | agrarian society! You know, beating our swords into plowshares;
       | more leisure time, visiting with good people, getting to know
       | their thoughts hopes and dreams, playing music together, taking
       | time contemplating the vastness and beauty of the universe. We're
       | about to come full circle; back to Eden. It all makes sense now.
        
       | jseip wrote:
       | Landmark. Wild. Beautiful. The singularity is nigh.
        
       | wahnfrieden wrote:
       | Any word on whether this has enhanced Japanese support? They
       | announced Japanese-specific models a while back that were never
       | released.
        
       | haolez wrote:
       | This should also be good news for open weights models, right?
       | Since OpenAI is basically saying "you can get very far with good
       | prompts and some feedback loops".
        
       | jiggawatts wrote:
       | "THERE ARE THREE R'S IN STRAWBERRY" - o1
       | 
       | I got that reference!
        
       | m348e912 wrote:
       | I have a straight forward task that no model has been able to
       | successfully complete.
       | 
       | The request is pretty basic. If anyone can get it to work, I'd
       | like to know how and what model you're using. I tried it with
       | gpt4o1 and after ~10 iterations of showing it the failed output,
       | it still failed to come up with a one-line command to properly
       | display results.
       | 
       | Here it what I asked: Using a mac osx terminal and standard
       | available tools, provide a command to update the output of
       | netstat -an to show the fqdn of IP addresses listed in the
       | result.
       | 
       | This is what it came up with:
       | 
       | netstat -an | awk '{for(i=1;i<=NF;i++){if($i~/^([0-9]+\\.[0-9]+\\
       | .[0-9]+\\.[0-9]+)(\\.[0-9]+)?$/){split($i,a,".");ip=a[1]"."a[2]".
       | "a[3]"."a[4];port=(length(a)>4?"."a[5]:"");cmd="dig +short -x
       | "ip;cmd|getline h;close(cmd);if(h){sub(/\\.$/,"",h);$i=h
       | port}}}}1'
        
         | OutOfHere wrote:
         | Have you tried `ss -ar`? You may have to install `ss`. It is
         | standard on Linux.
        
       | Eextra953 wrote:
       | What is interesting to me is that there is no difference in the
       | AP English lit/lang exams. Why did chain-of-thought produce
       | negligible improvements in this area?
        
         | munchler wrote:
         | I would guess because there is not much problem-solving
         | required in that domain. There's less of a "right answer" to
         | reason towards.
        
       | deemahstro wrote:
       | Stop fooling around with stories about AI taking jobs from
       | programmers. Which programmers exactly??? Creators of idiotic web
       | pages? Nobody in their right mind would push generated code into
       | a financial system, medical equipment or autonomous transport.
       | Template web pages and configuration files are not the entire IT
       | industry. In addition, AI is good at tasks for which there are
       | millions of examples. 20 times I asked to generate a PowerShell
       | script, 20 times it was generated incorrectly. Because, unlike
       | Bash, there are far fewer examples on the Internet. How will AI
       | generate code for complex systems with business logic that it has
       | no idea about? AI is not able to generate, develop and change
       | complex information systems.
        
       | suziemanul wrote:
       | In this video Lukasz Kaiser, one of the main co-authors of o1,
       | talks about how to get to reasoning. I hope this may be useful
       | context for some.
       | 
       | https://youtu.be/_7VirEqCZ4g?si=vrV9FrLgIhvNcVUr
        
       | forgotthepasswd wrote:
       | I had trouble in the past to make any model give me accurate unix
       | epochs for specific dates.
       | 
       | I just went to GPT-4o (via DDG) and asked three questions:
       | 
       | 1. Please give me the unix epoch for September 1, 2020 at 1:00
       | GMT.
       | 
       | > 1598913600
       | 
       | 2. Please give me the unix epoch for September 1, 2020 at 1:00
       | GMT. Before reaching the conclusion of the answer, please output
       | the entire chain of thought, your reasoning, and the maths you're
       | doing, until your arrive at (and output) the result. Then, after
       | you arrive at the result, make an extra effort to continue, and
       | do the analysis backwards (as if you were writing a unit test for
       | the result you achieved), to verify that your result is indeed
       | correct.
       | 
       | > 1598922000
       | 
       | 3. Please give me the unix epoch for September 1, 2020 at 1:00
       | GMT. Then, after you arrive at the result, make an extra effort
       | to continue, and do the analysis backwards (as if you were
       | writing a unit test for the result you achieved), to verify that
       | your result is indeed correct.
       | 
       | > 1598913600
        
         | tagawa wrote:
         | Quick link for checking the result:
         | https://duckduckgo.com/?q=timestamp+1598922000&ia=answer
        
         | jewel wrote:
         | When I give it that same prompt, it writes a python program and
         | then executes it to find the answer:
         | https://chatgpt.com/share/66e35a15-602c-8011-a2cb-0a83be35b8...
        
         | Alifatisk wrote:
         | No need for llms to do that
         | 
         | ruby -r time -e 'puts Time.parse("2020-09-01 01:00:00
         | +00:00").to_i'
        
       | dada5000 wrote:
       | I find shorter responses > longer responses. Anyone share the
       | same consensus?
       | 
       | for example in gpt-4o I often append '(reply short)' at the end
       | of my requests. with the o1 models I append 'reply in 20 words'
       | and it gives way better answers.
        
       | Doorknob8479 wrote:
       | Why so much hate? They're doing their best. This is the state of
       | progress in the field so far. The best minds are racing to
       | innovate. The benchmarks are impressive nonetheless. Give them a
       | break. At the end of the day, they built the chatbot who's saving
       | your ass each day ever since.
        
         | bamboozled wrote:
         | Haven't used ChatGPT* in over 6 months, not saving my ass at
         | all.
        
           | jdthedisciple wrote:
           | I bet you've still used other models that were inspired by
           | GPT.
        
             | bamboozled wrote:
             | I've used co-pilot, I turned it off, kept suggesting
             | nonsense.
        
         | evilfred wrote:
         | not saving my ass, I never needed one professionally. OpenAI is
         | shovelling money into a furnace, I expect them to be
         | assimilated into Microsoft soon.
        
         | commodoreboxer wrote:
         | I think you're overestimating LLM usage.
        
       | resters wrote:
       | I tested o1-preview on some coding stuff I've been using gpt-4o
       | for. I am _not_ impressed. The new, more intentional chain of
       | thought logic is apparently not something it can meaningfully
       | apply to a non-trivial codebase.
       | 
       | Sadly I think this OpenAI announcement is hot air. I am now
       | (unfortunately) much less enthusiastic about upcoming OpenAI
       | announcements. This is the first one that has been extremely
       | underwhelming (though the big announcement about structured
       | responses (months after it had already been supported nearly
       | identically via JSONSchema) was in hindsight also hot air.
       | 
       | I think OpenAI is making the same mistake Google made with the
       | search interface. Rather than considering it a command line to be
       | mastered, Google optimized to generate better results for someone
       | who had no mastery of how to type a search phrase.
       | 
       | Similarly, OpenAI is optimizing for someone who doesn't know how
       | to interact with a context-limited LLM. Sure it helps the low
       | end, but based on my initial testing this is not going to be
       | helpful to anyone who had already come to understand how to
       | create good prompts.
       | 
       | What is needed is the ability for the LLM to create a useful,
       | ongoing meta-context for the conversation so that it doesn't make
       | stupid mistakes and omissions. I was really hoping OpenAI would
       | have something like this ready for use.
        
         | jdthedisciple wrote:
         | Your case would be more convincing by an example.
         | 
         | Though o1 did fail at the puzzle in my profile.
         | 
         | Maybe it's just tougher than even, its author, I had assumed...
        
         | egorfine wrote:
         | I have tested o1-preview on a couple of coding tasks and I _am_
         | impressed.
         | 
         | I am looking at a TypeScript project with quite an amount of
         | type gymnastics and a particular line of code did not validate
         | with tsc no matter what I have tried. I copy pasted the whole
         | context into o1-preview and it told me what is likely the error
         | I am seeing (and it was a spot on correct letter-by-letter
         | error message including my variable names), explained the
         | problem and provided two solutions, both of which immediately
         | worked.
         | 
         | Another test was I have pasted a smart contract in solidity and
         | naively asked to identify vulnerabilities. It thought for more
         | than a minute and then provided a detailed report of what could
         | go wrong. Much, much deeper than any previous model could do.
         | (No vulnerabilities found because my code is perfect, but
         | that's another story).
        
       | sturza wrote:
       | It seems like it's just a lot of prompting the same old models in
       | the background, no "reasoning" there. My age old test is "draw a
       | hand in ascii" - i've had no success with any model yet.
        
         | ActionHank wrote:
         | It seems like their current strat is to farm token count as
         | much as possible.
         | 
         | 1. Don't give the full answer on first request. 2. Each
         | response needs to be the wordiest thing possible. 3. Now just
         | talk to yourself and burn tokens, probably in the wordiest way
         | possible again. 4. ??? 5. Profit
         | 
         | Guaranteed they have number of tokens billed as a KPI
         | somewhere.
        
       | evilfred wrote:
       | it still fails at logic puzzles
       | https://x.com/colin_fraser/status/1834334418007457897
        
         | evilfred wrote:
         | and listing state names with the letter 'a'
         | https://x.com/edzitron/status/1834329704125661446
        
         | fragmede wrote:
         | Weird, it works to say the father when I try it:
         | 
         | https://chatgpt.com/share/66e3601f-4bec-8009-ac0c-57bfa4f059...
         | 
         | And also works on this variation:
         | 
         | https://chatgpt.com/share/66e35f0e-6c98-8009-a128-e9ac677480...
        
       | Slow_Hand wrote:
       | I have a question. The video demos for this all mention that the
       | o1 model is taking it's time to think through the problem before
       | answering. How does this functionally differ from - say - GPT-4
       | running it's algorithm, waiting five seconds and then revealing
       | the output? That part is not clear to me.
        
         | ActionHank wrote:
         | It is recursively "talking" to itself to plan and then refine
         | the answer.
        
       | davesque wrote:
       | "after weighing multiple factors including user experience,
       | competitive advantage, and the option to pursue the chain of
       | thought monitoring, we have decided not to show the raw chains of
       | thought to users"
       | 
       | ...umm. Am I the only one who feels like this takes away much of
       | the value proposition, and that it also runs heavily against
       | their stated safety goals? My dream is to interact with tools
       | like this to learn, not just to be told an answer. This just
       | feels very dark. They're not doing much to build trust here.
        
       | shreezus wrote:
       | Advanced reasoning will pave the way for recursive self-improving
       | models & agents. These capabilities will enable data flywheels,
       | error-correcting agentic behaviors, & self-reflection (agents
       | _understanding_ the implications of their actions, both
       | individually  & cooperatively).
       | 
       | Things will get extremely interesting and we're incredibly
       | fortunate to be witnessing what's happening.
        
         | AI_beffr wrote:
         | this is completely illogical. this is like gambling your life
         | savings and as the die are rolling you say "i am incredibly
         | fortunate to be witnessing this." like, you need to know the
         | outcome before you know whether it was fortunate or
         | unfortunate... this could be the most unfortunate thing that
         | has ever happened in history.
        
       | samanator wrote:
       | I just tested o1-preview on the "How many r's are in strawberry?"
       | question. It answers correctly!
        
       | bad_username wrote:
       | Prompt:
       | 
       | > Alice, who is an immortal robotic observer, orbits a black hole
       | on board a spaceship. Bob exits the spaceship and falls into the
       | black hole. Alice sees Bob on the edge of the event horizon,
       | getting closer and closer to it, but from her frame of reference
       | Bob will remain forever observable (in principle) outside the
       | horizon. > > A trillion year has passed, and Alice observes that
       | the black hole is now relatively rapidly shrinking due to the
       | Hawking radiation. How will Alice be observing the "frozen" Bob
       | as the hole shrinks? > > The black hole finally evaporated
       | completely. Where is Bob now?
       | 
       | O1-preview spits out the same nonsense that 4o does, telling that
       | as the horizon of the black hole shrinks, it gets closer to Bob's
       | apparent position. I realize that.the prompt is essentily asking
       | to solve the famous unsolved problem in physics (black hole
       | information paradox), but there's no need to be so confused with
       | basic geometry of the situation.
        
         | throwaway314155 wrote:
         | Out of curiosity, what answer to that would you find
         | acceptable? I don't know relativity well enough to even
         | speculate.
        
       | alkyon wrote:
       | Maybe this is improvement in some areas, still I got spurious
       | reasoning and inability to connect three simple facts:
       | 
       | Yes, "el presente acta de nacimiento" is correct in Spanish.
       | 
       | Explanation:                   "Acta" is a feminine noun that
       | begins with a stressed "a" sound. In Spanish, when a feminine
       | singular noun starts with a stressed "a" or "ha", the definite
       | article "la" is replaced with "el" to facilitate pronunciation.
       | However, the noun remains feminine.              Adjectives and
       | modifiers that accompany the noun "acta" should agree in feminine
       | gender and singular number. In this case, "presente" is an
       | adjective that has the same form for both masculine and feminine
       | singular nouns.
       | 
       | So, combining these rules:                   "El" (definite
       | article used before feminine nouns starting with stressed "a")
       | "Presente" (adjective agreeing in feminine singular)
       | "Acta de nacimiento" (feminine noun with its complement)
       | 
       | Therefore, "el presente acta de nacimiento" is grammatically
       | correct.
        
       | quantisan wrote:
       | Amazing! OpenAI figured out how to scale inference.
       | https://arxiv.org/abs/2407.21787 show how using more compute
       | during inference can outperform much larger models in tasks like
       | math problems
       | 
       | I wonder how do they decide when to stop these Chain of Thought
       | for each query? As anyone that played with agents can attest,
       | LLMs can talk with themselves forever.
        
       | nemo44x wrote:
       | Besides chat bits what viable products are being made with LLMs
       | besides APIs into LLMs?
        
       | sohamgovande wrote:
       | the newest scaling law: inference-time compute.
        
       | digitcatphd wrote:
       | I tested various Math Olympiad questions with Claude sonnet 3.5
       | and they all arrived at the correct solution. o1's solution was a
       | bit better formulated, in some circumstances, but sonnet 3.5 was
       | nearly instant.
        
       | la64710 wrote:
       | Can we please stop using the word "think" like o1 thinks before
       | it answers. I doubt we man the same when someone says a human
       | thinks vs o1 thinks. When I say I think "red" I am sure the word
       | think means something completely different than when you say
       | openai thinks red. I am not saying one is superior than the other
       | but maybe as humans we can use a different set of terminology for
       | the AI activities.
        
       | natch wrote:
       | I tried it with a cipher text that ChatGPT4o flailed with.
       | 
       | Recently I tried the same cipher with Claude Sonnet 3.5 and it
       | solved it quickly and perfectly.
       | 
       | Just now tried with ChatGPT o1 preview and it totally failed.
       | Based on just this one test, Claude is still way ahead.
       | 
       | ChatGPT also showed a comical (possibly just fake filler
       | material) journey of things it supposedly tried including several
       | rewordings of "rethinking my approach." It remarkably never
       | showed that it was trying common word patterns (other than one
       | and two letters) nor did it look for "the" and other "th" words
       | nor did it ever say that it was trying to match letter patterns.
       | 
       | I told it upfront as a hint that the text was in English and was
       | not a quote. The plaintext was one paragraph of layman-level
       | material on a technical topic including a foreign name, text that
       | has never appeared on the Internet or dark web. Pretty easy
       | cipher with a lot of ways to get in, but nope, and super slow,
       | where Claude was not only snappy but nailed it and explained
       | itself.
        
       | spoonfeeder006 wrote:
       | So how is the internal chain of thought represented anyhow? What
       | does it look like when someone sees it?
        
       | delusional wrote:
       | Great, yet another step towards the inevitable conclusion. Now
       | I'm not just being asked to outsource my thinking to my computer,
       | but instead to a black box operated by a for-profit company for
       | the benefit of Microsoft. Not only will they not tell me the
       | whole reasoning chain, they wont even tell me how they came up
       | with it.
       | 
       | Tell me, users of this tool. What's even are you? If you've
       | outsourced your thinking to a corporation, what happens to your
       | unique perspective? your blend of circumstance and upbringing?
       | Are you really OK being reduced to meaningless computation and
       | worthless weights. Don't you want to be something more?
        
       | wrath224 wrote:
       | Trying this on a few hard problems on PicoGYM and holy heck I'm
       | impressed. I had to give it a hint but that's the same info a
       | human would have. Problem was Sequences (crypto) hard.
       | 
       | https://chatgpt.com/share/66e363d8-5a7c-8000-9a24-8f5eef4451...
       | 
       | Heh... GPT-4o also solved this after I tried and gave it about
       | the same examples. Need to further test but it's promising !
        
       | kypro wrote:
       | Reminder that it's still not too late to change the direction of
       | progress. We still have time to demand that our politicians put
       | the breaks on AI data centres and end this insanity.
       | 
       | When AI exceeds humans at all tasks humans become economically
       | useless.
       | 
       | People who are economically useless are also politically
       | powerless, because resources are power.
       | 
       | Democracy works because the people (labourers) collectivised hold
       | a monopoly on the production and ownership of resources.
       | 
       | If the state does something you don't like you can strike or
       | refuse to offer your labour to a corrupt system. A state must
       | therefore seek your compliance. Democracies do this by given
       | people want they want. Authoritarian regimes might seek
       | compliance in other ways.
       | 
       | But what is certain is that in a post-AGI world our leaders can
       | be corrupt as they like because people can't do anything.
       | 
       | And this is obvious when you think about it... What power does a
       | child or a disable person hold over you? People who have no
       | ability to create or amass resources depend on their
       | beneficiaries for everything including basics like food and
       | shelter. If you as a parent do not give your child resources,
       | they die. But your child does not hold this power over you. In
       | fact they hold no power over you because they cannot withhold any
       | resources from you.
       | 
       | In a post-AGI world the state would not depend on labourers for
       | resources, jobless labourers would instead depend on the state.
       | If the state does not provide for you like you provide for your
       | children, you and your family will die.
       | 
       | In a good outcome where humans can control the AGI, you and your
       | family will become subjects to the whims of state. You and your
       | children will suffer as the political corruption inevitably
       | arises.
       | 
       | In a bad outcome the AGI will do to cities what humans did to
       | forests. And AGI will treat humans like humans treat animals.
       | Perhaps we don't seek the destruction of the natural environment
       | and the habitats of animals, but woodland and buffalo are sure
       | inconvenient when building a super highway.
       | 
       | We can all agree there will be no jobs for our children. Even if
       | you're an "AI optimist" we probably still agree that our kids
       | will have no purpose. This alone should be bad enough, but if I'm
       | right then there will be no future for them at all.
       | 
       | I will not apologise for my concern about AGI and our clear
       | progress towards that end. It is not my fault if others cannot
       | see the path I seem to see so clearly. I cannot simply be quiet
       | about this because there's too much at stake. If you agree with
       | me at all I urge you to not be either. Our children can have a
       | great future if we allow them to have it. We don't have long, but
       | we do still have time left.
        
       | kgeist wrote:
       | Asked it to write PyTorch code which trains an LLM and it
       | produced 23 steps in 62 seconds.
       | 
       | With gpt4-o it immediately failed with random errors like
       | mismatched tensor shapes and stuff like that.
       | 
       | The code produced by gpt-o1 seemed to work for some time but
       | after some training time it produced mismatched batch sizes.
       | Also, gpt-o1 enabled cuda by itself while for gpt-4o, I had to
       | specifically spell it out (it always used cpu). However, showing
       | gpt-o1 the error output resulted in broken code again.
       | 
       | I noticed that back-and-forth iteration when it makes mistakes
       | has worse experience because now there's always 30-60 sec time
       | delays. I had to have 5 back-and-forths before it produced
       | something which does not crash (just like gpt-4o). I also suspect
       | too many tokens inside the CoT context can make it accidentally
       | forget some stuff.
       | 
       | So there's some improvement, but we're still not there...
        
       | koreth1 wrote:
       | The performance on programming tasks is impressive, but I think
       | the limited context window is still a big problem.
       | 
       | Very few of my day-to-day coding tasks are, "Implement a
       | completely new program that does XYZ," but more like, "Modify a
       | sizable existing code base to do XYZ in a way that's consistent
       | with its existing data model and architecture." And the only way
       | to do those kinds of tasks is to have enough context about the
       | existing code base to know where everything should go and what
       | existing patterns to follow.
       | 
       | But regardless, this does look like a significant step forward.
        
         | suchar wrote:
         | I would imagine that good IDE integration would summarise each
         | module/file/function and feed high-level project overview (best
         | case: with business project description provided by the user)
         | and during CoT process model would be able to ask about more
         | details (specific file/class/function).
         | 
         | Humans work on abstractions and I see no reason to believe that
         | models cannot do the same
        
       | beaugunderson wrote:
       | the cipher example is impressive on the surface, but I threw a
       | couple of my toy questions at o1-preview and it still
       | hallucinates a bunch of nonsense (but now uses more electricity
       | to do so).
        
       | Havoc wrote:
       | o1
       | 
       | Maybe they should spend some of their billions on marketing
       | people. Gpt4o was a stretch. Wtf is o1
        
       | franze wrote:
       | ChatGPT is now a better coder than I ever was.
        
       | adamtaylor_13 wrote:
       | Laughing at the comparison to "4o" as if that model even holds a
       | candle to GPT-4. 4o is _cheaper_--it's nowhere near as powerful
       | as GPT-4, as much as OpenAI would like it to be.
        
       | aantix wrote:
       | Feels like the challenge here is to somehow convey to the end
       | user, how the quality of output is so much better.
        
       | sys32768 wrote:
       | Time to fire up System Shock 2:
       | 
       | > Look at you, hacker: a pathetic creature of meat and bone,
       | panting and sweating as you run through my corridors. How can you
       | challenge a perfect, immortal machine?
        
       | scotty79 wrote:
       | Transformers have exactly two strengths. None of them is
       | "attention". Attention could be replaced with any arbitrary
       | division of the network and it would learn just as well.
       | 
       | First true strength is obvious, it's that they are
       | parallelisable. This is a side effect of people fixating on
       | attention. If they came up with any other structure that results
       | in the same level of parallelisability it would be just as good.
       | 
       | Second strong side is more elusive to many people. It's the
       | context window. Because the network is not ran just once but once
       | for every word it doesn't have to solve a problem in one step. It
       | can iterate while writing down intermediate variables and access
       | them. The dumb thing so far was that it was required to produce
       | the answer starting with the first word it was allowed to write
       | down. So to actually write down the information it needs on the
       | next iteration it had to disguise it as a part of the answer. So
       | naturally the next step is to allow to just write down whatever
       | it pleases and iterate freely until it's ready to start giving us
       | the answer.
       | 
       | It's still seriously suboptimal that what is allowed to write
       | down has to be translated to tokens and back but I see how this
       | might make things easier for humans for training and
       | explainability. But you can rest assured that at some point this
       | "chain of thought" will become just chain of full output states
       | of the network, not necessarily corresponding to any tokens.
       | 
       | So congrats to researchers they found out that their billion
       | dollar Turing machine benefits from having a tape it can use for
       | more than just printing the output.
       | 
       | PS
       | 
       | There's another advantage of transformers but I can't tell how
       | important it is. It's the "shortcuts" from earlier layers to way
       | deeper ones bypassing the ones along the way. Obviously network
       | would be more capable if every neuron was connected with every
       | neuron in every preceding layer but we don't have hardware for
       | that so some sprinkled "shortcuts" might be a reasonable
       | compromise that might make network less crippled than MLP.
       | 
       | Given all that I'm not surprised at all with the direction openai
       | took and the gains it achieved.
        
       | ttul wrote:
       | I've given this a test run on some email threads, asking the
       | model to extract the positions and requirements of each person in
       | a lengthy and convoluted discussion. It absolutely nailed the
       | result, far exceeding what Claude 3.5 Sonnet was capable of -- my
       | previous goto model for such analysis work. I also used it to
       | apply APA style guidelines to various parts of a document and it
       | executed the job flawlessly and with a tighter finesse than
       | Claude. Claude's response was lengthier - correct, but
       | unnecessarily long. gpt-o1-preview combined several logically-
       | related bullets into a single bullet, showing how chain of
       | thought reasoning gives the model more time to comprehend things
       | and product a result that is not just correct, but "really
       | correct".
        
       ___________________________________________________________________
       (page generated 2024-09-12 23:00 UTC)