[HN Gopher] Learning to Reason with LLMs
___________________________________________________________________
Learning to Reason with LLMs
Author : fofoz
Score : 1289 points
Date : 2024-09-12 17:08 UTC (5 hours ago)
(HTM) web link (openai.com)
(TXT) w3m dump (openai.com)
| fsflover wrote:
| Dupe: https://news.ycombinator.com/item?id=41523050
| farresito wrote:
| Damn, that looks like a big jump.
| deisteve wrote:
| so o1 seems like it has real measurable edge, crushing it in
| every single metric, i mean 1673 elo is insane, and 89th
| percentile is like a whole different league, and it looks like
| it's not just a one off either, it's consistently performing
| way better than gpt-4o across all the datasets, even in the
| ones where gpt-4o was already doing pretty well, like math and
| mmlu, o1 is just taking it to the next level, and the fact that
| it's not even showing up in some of the metrics, like mmmu and
| mathvista, just makes it look even more impressive, i mean
| what's going on with gpt-4o, is it just a total dud or what,
| and btw what's the deal with the preview model, is that like a
| beta version or something, and how does it compare to o1, is it
| like a stepping stone to o1 or something, and btw has anyone
| tried to dig into the actual performance of o1, like what's it
| doing differently, is it just a matter of more training data or
| is there something more going on, and btw what's the plan for
| o1, is it going to be released to the public or is it just
| going to be some internal tool or something
| farresito wrote:
| > like what's it doing differently, is it just a matter of
| more training data or is there something more going on
|
| Well, the model doesn't start with "GPT", so maybe they have
| come up with something better.
| rvnx wrote:
| It sounds like GPT-4o with a long CoT prompt no ?
| spaceman_2020 wrote:
| 1673 ELO is wild
|
| If its actually true in practice, I sincerely cannot imagine
| a scenario where it would be cheaper to hire actual junior or
| mid-tier developers (keyword: "developers", not architects or
| engineers).
|
| 1,673 ELO should be able to build very complex, scalable apps
| with some guidance
| deisteve wrote:
| currently my workflow is generate some code, run it, if it
| doesn't run i tell LLM what I expected, it will then
| produce code and I frequently tell it how to reason about
| the problem.
|
| with O1 being in the 89th percentile would mean it should
| be able to think at junior to intermediate level with very
| strong consistency.
|
| i dont think people in the comments realize the implication
| of this. previously LLMs were able to only "pattern match"
| but now its able to evaluate itself (with some guidance
| ofc) essentially, steering the software into depth of edge
| cases and reason about it in a way that feels natural to
| us.
|
| currently I'm copying and pasting stuff and notifying LLM
| the results but once O1 is available its going to
| significantly lower that frequency.
|
| For example, I expect it to self evaluate the code its
| generate and think at higher levels.
|
| ex) oooh looks like this user shouldn't be able to escalate
| privileges in this case because it would lead to security
| issues or it could conflict with the code i generated 3
| steps ago, i'll fix it myself.
| usaar333 wrote:
| I'm not sure how well codeforces percentiles correlate to
| software engineering ability. Looking at all the data, it
| still isn't. Key notes:
|
| 1. AlphaCode 2 was already at 1650 last year.
|
| 2. SWE-bench verified under an agent has jumped from 33.2%
| to 35.8% under this model (which doesn't really matter).
| The full model is at 41.4% which still isn't a game changer
| either.
|
| 3. It's not handling open ended questions much better than
| gpt-4o.
| deisteve wrote:
| i think you are right now actually initially i got
| excited but now i think OpenAI pulled the hype card again
| to seem relevant as they struggle to be profitable
|
| Claude on the other hand has been fantastic and seems to
| do similar reasoning behind the scenes with RL
| usaar333 wrote:
| The model is really impressive to be fair. It's just how
| economically relevant it is.
| dinobones wrote:
| Generating more "think out loud" tokens and hiding them from the
| user...
|
| Idk if I'm "feeling the AGI" if I'm being honest.
|
| Also... telling that they choose to benchmark against CodeForces
| rather than SWE-bench.
| thelastparadise wrote:
| Why not? Isn't that basically what humans do? Sit there and
| think for a while before answering, going down different
| branches/chains of thought?
| dinobones wrote:
| This new approach is showing:
|
| 1) The "bitter lesson" may not be true, and there is a
| fundamental limit to transformer intelligence.
|
| 2) The "bitter lesson" is true, and there just isn't enough
| data/compute/energy to train AGI.
|
| All the cognition should be happening inside the transformer.
| Attention is all you need. The possible cognition and
| reasoning occurring "inside" in high dimensions is much more
| advanced than any possible cognition that you output into
| text tokens.
|
| This feels like a sidequest/hack on what was otherwise a
| promising path to AGI.
| gradus_ad wrote:
| Does that mean human intelligence is cheapened when you
| talk out a problem to yourself? Or when you write down
| steps solving a problem?
|
| It's the exact same thing here.
| barrell wrote:
| lol come on it's not the exact same thing. At best this
| is like gagging yourself while you talk about it then
| engaging yourself when you say the answer. And that
| presupposing LLMs are thinking in, your words, exactly
| the same way as humans.
|
| At best it maybe vaguely resembles thinking
| exe34 wrote:
| > "lol come on"
|
| I've never found this sort of argument convincing. it's
| very Chalmers.
| youssefabdelm wrote:
| > Does that mean human intelligence is cheapened when you
| talk out a problem to yourself?
|
| In a sense, maybe yeah. Of course if one were to really
| be absolute about that statement it would be absurd, it
| would greatly overfit the reality.
|
| But it is interesting to assume this statement as true.
| Oftentimes when we think of ideas "off the top of our
| heads" they are not as profound as ideas that "come to
| us" in the shower. The subconscious may be doing 'more'
| 'computation' in a sense. Lakoff said the subconscious
| was 98% of the brain, and that the conscious mind is the
| tip of the iceberg of thought.
| slashdave wrote:
| The similarity is cosmetic only. The reason it is used is
| because it's easy to leverage existing work in LLMs, and
| scaling (although not cheap) is an obvious approach.
| grbsh wrote:
| On the contrary, this suggests that the bitter lesson is
| alive and kicking. The bitter lesson doesn't say "compute
| is all you need", it says "only those methods which allow
| you to make better use of hardware as hardware itself
| scales are relevant".
|
| This chain of thought / reflection method allows you to
| make better use of the hardware as the hardware itself
| scales. If a given transformer is N billion parameters, and
| to solve a harder problem we estimate we need 10N billion
| parameters, one way to do it is to build a GPU cluster 10x
| larger.
|
| This method shows that there might be another way: instead
| train the N billion model differently so that we can use
| 10x of it at inference time. Say hardware gets 2x better in
| 2 years -- then this method will be 20x better than now!
| seydor wrote:
| Attention is about similarity/statistical correlation which
| is fundamentally stochastic , while reasoning needs to be
| truthful and exact to be successful.
| user9925 wrote:
| I think it's too soon to tell. Training the next generation
| of models means building out entire datacenters. So while
| they wait they have engineers build these sidequests/hacks.
| 93po wrote:
| Karpathy himself believes that neural networks are
| perfectly plausible as a key component to AGI. He has said
| that it doesn't need to be superseded by something better,
| it's just that everything else around it (especially
| infrastructure) needs to improve. As one of the most
| valuable opinions in the entire world on the subject, I
| tend to trust what he said.
|
| source: https://youtu.be/hM_h0UA7upI?t=973
| authorfly wrote:
| Imagine instead the bitter lesson says: we can expand an
| outwards circle in many dimensions of ways to continuously
| mathematically manipulate data to adjust outputs.
|
| Even the attention-token approach is on the grand scale of
| things a simple line outwards from the centre; we have not
| even explored around the centre (with the same compute
| spend) for things like non-token generation, different
| layers/different activation functions and norming /
| query/key/value set up (why do we only use the 3 inherent
| to contextualising tokens, why not add a 4th matrix for
| something else?), character, sentence, whole thought,
| paragraph one-shot generation, positional embeddings which
| could work differently.
|
| The bitter lesson says there is almost a work completely
| untouched by our findings for us to explore. The temporary
| work of non-data approaches can piggy back off a point on
| the line; it cannot expand it like we can as we exude out
| from the circle..
| aktuel wrote:
| Sure, but if I want a human, I can hire a human. Humans also
| do many other things I don't want my LLM to do.
| forgot_old_user wrote:
| well it could be a lot cheaper to hire the AI model instead
| of a human?
| imiric wrote:
| Except that these aren't thoughts. These techniques are
| improvements to how the model breaks down input data, and how
| it evaluates its responses to arrive at a result that most
| closely approximates patterns it was previously rewarded for.
| Calling this "thinking" is anthropomorphizing what's really
| happening. "AI" companies love to throw these phrases around,
| since it obviously creates hype and pumps up their valuation.
|
| Human thinking is much more nuanced than this mechanical
| process. We rely on actually understanding the meaning of
| what the text represents. We use deduction, intuition and
| reasoning that involves semantic relationships between ideas.
| Our understanding of the world doesn't require "reinforcement
| learning" and being trained on all the text that's ever been
| written.
|
| Of course, this isn't to say that machine learning methods
| can't be useful, or that we can't keep improving them to
| yield better results. But these are still methods that mimic
| human intelligence, and I think it's disingenuous to label
| them as such.
| golol wrote:
| It becomes thinking when you reinforcement learn on those
| Chain-of-Thought generations. The LLM is just a very good
| initialization.
| slashdave wrote:
| Without a world model, not really.
| UniverseHacker wrote:
| The whole thing is a world model- accurately predicting
| text that describes things happening in a world, can only
| be done by modeling the world.
| freejazz wrote:
| Is it?
| NewEntryHN wrote:
| Yes but with concepts instead of tokens spelling out the
| written representation of those concepts.
| WXLCKNO wrote:
| Exploring different approaches and stumbling on AGI eventually
| through a combination of random discoveries will be the way to
| go.
|
| Same as Bitcoin being the right combination of things that
| already existed.
| ActionHank wrote:
| Crypto being used as an example of how we have moved forward
| successfully as a species is backward toilet sitting
| behaviour.
| lloydatkinson wrote:
| What's with this how many r's in a strawberry thing I keep
| seeing?
| bn-l wrote:
| It's a common LLM riddle. Apparently many fail to give the
| right answer.
| seydor wrote:
| Somebody please ask o1 to solve it
| lloydatkinson wrote:
| The link shows it solving it
| dr_quacksworth wrote:
| LLM are bad at answering that question because inputs are
| tokenized.
| swalsh wrote:
| Models don't really predict the next word, they predict the
| next token. Strawberry is made up of multiple tokens, and the
| model doesn't truely understand the characters in it... so it
| tends to struggle.
| runjake wrote:
| This became something of a meme.
|
| https://community.openai.com/t/incorrect-count-of-r-characte...
| andrewla wrote:
| What's amazing is that given how LLMs receive input data (as
| tokenized streams, as other commenters have pointed out) it's
| remarkable that it can ever answer this question correctly.
| valine wrote:
| The model performance is driven by chain of thought, but they
| will not be providing chain of thought responses to the user for
| various reasons including competitive advantage.
|
| After the release of GPT4 it became very common to fine-tune non-
| OpenAI models on GPT4 output. I'd say OpenAI is rightly concerned
| that fine-tuning on chain of thought responses from this model
| would allow for quicker reproduction of their results. This
| forces everyone else to reproduce it the hard way. It's sad news
| for open weight models but an understandable decision.
| tomtom1337 wrote:
| Can you explain what you mean by this?
| tomduncalf wrote:
| I think they mean that you won't be able to see the
| "thinking"/"reasoning" part of the model's output, even
| though you pay for it. If you could see that, you might be
| able to infer better how these models reason and replicate it
| as a competitor
| ffreire wrote:
| You can see an example of the Chain of Thought in the post,
| it's quite extensive. Presumably they don't want to release
| this so that it is raw and unfiltered and can better monitor
| for cases of manipulation or deviation from training. What GP
| is also referring to is explicitly stated in the post: they
| also aren't release the CoT for competitive reasons, so that
| presumably competitors like Anthropic are unable to use the
| CoT to train their own frontier models.
| gwd wrote:
| > Presumably they don't want to release this so that it is
| raw and unfiltered and can better monitor for cases of
| manipulation or deviation from training.
|
| My take was:
|
| 1. A genuine, un-RLHF'd "chain of thought" might contain
| things that shouldn't be told to the user. E.g., it might
| at some point think to itself, "One way to make an
| explosive would be to mix $X and $Y" or "It seems like they
| might be able to poison the person".
|
| 2. They want the "Chain of Thought" as much as possible to
| reflect the _actual_ reasoning that the model is using; in
| part so that they can understand what the model is actually
| thinking. They fear that if they RLHF the chain of thought,
| the model will self-censor in a way which undermines their
| ability to see what it 's really thinking
|
| 3. So, they RLHF _only_ the final output, _not_ the CoT,
| letting the CoT be as frank within itself as any human; and
| post-filter the CoT for the user.
| Y_Y wrote:
| RLHF is one thing, but now that the training is done it
| has no bearing on whether or not you can show the chain
| of thought to the user.
| teaearlgraycold wrote:
| Including the chain of thought would provide competitors with
| training data.
| andrewla wrote:
| This is a transcription of a literal quote from the article:
|
| > Therefore, after weighing multiple factors including user
| experience, competitive advantage, and the option to pursue
| the chain of thought monitoring, we have decided not to show
| the raw chains of thought to users
| baq wrote:
| At least they're open about not being open. Very meta
| OpenAI.
| seydor wrote:
| The open source/weights models so far have proved that openAI
| doesn't have some special magic sauce. I m confident we ll soon
| have a model from Meta or others that s close to this level of
| reasoning. [Also consider that some of their top researchers
| have departed]
|
| On a cursory look, it looks like the chain of thought is a long
| series of chains of thought balanced on each step, with a small
| backtracking added whenever a negative result occurs, sort of
| like solving a maze.
| zamalek wrote:
| I suspect that the largest limiting factor for a competing
| model will be the dataset. Unless they somehow used GPT4 to
| generate the dataset somehow, this is an extremely novel
| dataset to have to build.
| j_maffe wrote:
| They almost definitely used existing models for generating
| it. The human feedback part, however, is the expensive
| aspect.
| tarruda wrote:
| I would love to see Meta releasing CoT specialized model as a
| LoRa we can apply to existing 3.1 models
| msp26 wrote:
| That's unfortunate. When an LLM makes a mistake it's very
| helpful to read the CoT and see what went wrong (input
| error/instruction error/random shit)
| dragonwriter wrote:
| Yeah, exposed chain of thought is more useful as a user, as
| well as being useful for training purposes.
| riku_iki wrote:
| I think we may discover that model do some cryptic mess
| inside instead of some clean reasoning.
| hadlock wrote:
| Loopback to: "my code works. why does my code work?"
| MVissers wrote:
| I'd say depends. If the model iterates 100x I'd just say
| give me the output.
|
| Same with problem solving in my brain: Sure, sometimes it
| helps to think out loud. But taking a break and let my
| unconcious do the work is helpful as well. For complex
| problems that's actually nice.
|
| I think eventually we don't care as long as it works or we
| can easily debug it.
| ramadis wrote:
| It'd be helpful if they exposed a summary of the chain-of-
| thought response instead. That way they'd not be leaking the
| actual tokens, but you'd still be able to understand the
| outline of the process. And, hopefully, understand where it
| went wrong.
| seydor wrote:
| They do, according to the example
| ashellunts wrote:
| Exactly that I see in the Android app.
| yunohn wrote:
| Given the significant chain of thought tokens being generated,
| it also feels a bit odd to hide it from a cost fairness
| perspective. How do we believe they aren't inflating it for
| profit?
| wmf wrote:
| That sounds like the GPU labor theory of value that was
| debunked a century ago.
| dragonwriter wrote:
| No, its the fraud theory of charging for usage that is
| unaccountable that has been repeatedly proven true when
| unaccountable bases for charges have been deployed.
| nfw2 wrote:
| The one-shot models aren't going away for anyone who
| wants to program the chain-of-thought themselves
| wmf wrote:
| Yeah, if they are charging for some specific resource
| like tokens then it better be accurate. But ultimately
| utility-like pricing is a mistake IMO. I think they
| should try to align their pricing with the customer value
| they're creating.
| yunohn wrote:
| Not sure why you didn't bother to check their pricing page
| (1) before dismissing my point. They are charging
| significantly more for both input (3x) and output (4x)
| tokens when using o1.
|
| Per 1M in/out tokens:
|
| GPT4o - 5$/15$
|
| O1-preview - 15$/60$
|
| (1) https://openai.com/api/pricing
| wmf wrote:
| My point is that "cost fairness" is not a thing. Either
| o1 is worth it to you or it isn't.
| dongping wrote:
| If there's a high premium, then one might want to wait
| for a year or two for the premium to vanish.
| yunohn wrote:
| It's really unclear to me what you understood by "cost
| fairness".
|
| I'm saying if you charge me per brick laid, but you can't
| show me how many bricks were laid, nor can I calculate
| how many should have been laid - how do I trust your
| invoice?
|
| Note: The reason I say all this is because OpenAI is
| simultaneously flailing for funding, while being
| inherently unprofitable as it continues to boil the ocean
| searching for strawberries.
| rglullis wrote:
| When are they going to change the name to reflect their
| complete change of direction?
|
| Also, what is going to be their excuse to defend themselves
| against copyright lawsuits if they are going to
| "understandably" keep their models closed?
| tcdent wrote:
| CoT is now their primary method for alignment. Exposing that
| information would negate that benefit.
|
| I don't agree with this, but it definitely carries higher
| weight in their decision making than leaking relevant training
| info to other models.
| zellyn wrote:
| This. Please go read and understand the alignment argument
| against exposing chain of thought reasoning.
| amelius wrote:
| > I'd say OpenAI is rightly concerned that fine-tuning on chain
| of thought responses from this model would allow for quicker
| reproduction of their results.
|
| Why? They're called "Open" AI after all ...
| ashellunts wrote:
| I see chain of thought responses in chatgpt android app.
| ashellunts wrote:
| Tested cipher example, and it got it right. But "thinking
| logs" I see in the app looks like a summary of actual chain
| of thought messages that are not visible.
| p1esk wrote:
| _after weighing multiple factors including user experience,
| competitive advantage, and the option to pursue the chain of
| thought monitoring, we have decided not to show the raw chains of
| thought to users._
| zaptrem wrote:
| This also makes them less useful because I can't just click
| stop generation when they make a logical error re: coding.
| neonbjb wrote:
| You wouldn't do that to this model. It finds its own mistakes
| and corrects them as it is thinking through things.
| zaptrem wrote:
| No model is perfect, the less I can see into what it's
| "thinking" the less productively I can use it. So much for
| interpretability.
| sterlind wrote:
| "Open"AI is such a comically ironic name at this point.
| swalsh wrote:
| We're not going to give you training data... for a better user
| experience.
| scosman wrote:
| Saying "competitive advantage" so directly is surprising.
|
| There must be some magic sauce here for guiding LLMs which
| boosts performance. They must think inspecting a reasonable
| number of chains would allow others to replicate it.
|
| They call GPT 4 a model. But we don't know if it's really a
| system that builds in a ton of best practices and secret
| tactics: prompt expansion, guided CoT, etc. Dalle was
| transparent that it automated re-generating the prompts, adding
| missing details prior to generation. This and a lot more could
| all be running under the hood here.
| 0x_rs wrote:
| Lame but not atypical of OpenAI. Too bad, but I'm expecting
| competitors to follow with this sort of implementation and
| better. Being able to view the "reasoning" process and
| especially being able to modify it and re-render the answer may
| be faster than editing your prompt a few times until you get
| the desired response, if you even manage to do that.
| skywhopper wrote:
| No direct indication of what "maximum test time" means, but if
| I'm reading the obscured language properly, the best scores on
| standardized tests were generated across a thousand samples with
| supplemental help provided.
|
| Obviously, I hope everyone takes what any company says about the
| capabilities of its own software with a huge grain of salt. But
| it seems particularly called for here.
| immortal3 wrote:
| Honestly, it doesn't matter for the end user if there are more
| tokens generated between the AI reply and human message. This is
| like getting rid of AI wrappers for specific tasks. If the jump
| in accuracy is actual, then for all practical purposes, we have a
| sufficiently capable AI which has the potential to boost
| productivity at the largest scale in human history.
| Lalabadie wrote:
| It starts to matter if the compute time is 10-100 fold, as the
| provider needs to bill for it.
|
| Of course, that's assuming it's not priced for market
| acquisition funded by a huge operational deficit, which is a
| rarely safe to conclude with AI right now.
| skywhopper wrote:
| Given that their compute-time vs accuracy charts labeled the
| compute time axis as logarithmic would worry me greatly about
| this aspect.
| deisteve wrote:
| yeah this is kinda cool i guess but 808 elo is still pretty bad
| for a model that can supposedly code like a human, i mean 11th
| percentile is like barely scraping by, and what even is the point
| of simulating codeforces if youre just gonna make a model that
| can barely compete with a decent amateur, and btw what kind of
| contest allows 10 submissions, thats not how codeforces works,
| and what about the time limits and memory limits and all that
| jazz, did they even simulate those, and btw how did they even get
| the elo ratings, is it just some arbitrary number they pulled out
| of their butt, and what about the model that got 1807 elo, is
| that even a real model or just some cherry picked result, and btw
| what does it even mean to "perform better than 93% of
| competitors" when the competition is a bunch of humans who are
| all over the place in terms of skill, like what even is the
| baseline for comparison
|
| edit: i got confused with the Codeforce. it is indeed zero shot
| and O1 is potentially something very new I hope Anthropic and
| others will follow suit
|
| any type of reasoning capability i'll take it !
| qt31415926 wrote:
| 808 ELO was for GPT-4o.
|
| I would suggest re-reading more carefully
| deisteve wrote:
| you are right i read the charts wrong. O1 has significant
| lead over GPT-4o in the zero shot examples
|
| honestly im spooked
| catchnear4321 wrote:
| oh wow, something you can roughly model as a diy in a base model.
| so impressive. yawn.
|
| at least NVDA should benefit. i guess.
| apsec112 wrote:
| If there's a way to do something like this with Llama I'd love
| to hear about it (not being sarcastic)
| catchnear4321 wrote:
| nurture the model have patience and a couple bash scripts
| apsec112 wrote:
| But what does that mean? I can't do "pip install nurture"
| or "pip install patience". I can generate a bunch of
| answers and take the consensus, but we've been able to do
| that for years. I can do fine-tuning or DPO, but on what?
| catchnear4321 wrote:
| you want instructions on how to compete with OpenAI?
|
| go play more, your priorities and focus on it being work
| are making you think this to be harder than it is, and
| the models can even tell you this.
|
| you don't have to like the answer, but take it seriously,
| and you might come back and like it quite a bit.
|
| you have to have patience because you likely wont have
| scale - but it is not just patience with the response
| time.
| gliiics wrote:
| Congrats to OpenAI for yet another product that has nothing to do
| with the word "open"
| sk11001 wrote:
| And Apple's product line this year? Phones. Nothing to do with
| fruit. Almost 50 years of lying to people. Names should mean
| something!
| achrono wrote:
| Did Apple start their company by saying they will be selling
| apples?
| sk11001 wrote:
| What's the statement that OpenAI are making today which you
| think they're violating? There very well could be one and
| if there is, it would make sense to talk about it.
|
| But arguments like "you wrote $x in a blog post when you
| founded your company" or "this is what the word in your
| name means" are infantile.
| trash_cat wrote:
| It is open in the sense that everyone can use it.
| bionhoward wrote:
| Not people working on AI or those who would like to train AI
| on their logs
| oblio wrote:
| If they would have launched it with Oracle DB style licensing
| their company would have been dead in 1 year.
| Hizonner wrote:
| Only people who exactly share OpenAI's concepts of what
| "alignment" and "safety" should mean can use it to its full
| potential.
| RandomThoughts3 wrote:
| > "Therefore, after weighing multiple factors including user
| experience, competitive advantage, and the option to pursue the
| chain of thought monitoring, we have decided not to show the raw
| chains of thought to users."
|
| Trust us, we have your best intention in mind. I'm still
| impressed by how astonishingly impossible to like and root for
| OpenAI is for a company with such an innovative product.
| TheAceOfHearts wrote:
| Kinda disappointed that they're hiding the thought process.
| Hopefully the open source community will figure out how to
| effectively match and replicate what OpenAI is doing.
|
| I wonder how far we are from having a model that can correctly
| solve a word soup search problem directly from just a prompt and
| input image. It seems like the crossword example is close. For a
| word search it would require turning the image into an internal
| grid representation, prepare the list of words, and do a search.
| I'd be interested in seeing if this model can already solve the
| word grid search problem if you give it the correct
| representation as an input.
| zozbot234 wrote:
| > Hopefully the open source community will figure out how to
| effectively match and replicate what OpenAI is doing.
|
| No need for that, there is a Reflection 70B model that does the
| exact same thing - with chains of thought being separated from
| the "final answer" via custom 'tag' tokens.
| TheAceOfHearts wrote:
| Wasn't this the model that was proven to have been faking
| their benchmarks recently? Or am I thinking of a different
| model?
| brokensegue wrote:
| yes. it was fake
| zozbot234 wrote:
| Some reported benchmarks do seem to be rather dubious,
| but the 70B model itself is quite real. Sample output:
| $ ollama run reflection:70b-q4_0 >>> hello
| <thinking> To respond to "hello", I'll need to
| consider several factors: 1. The user's
| intent: They're likely saying hello as a greeting.
| 2. Appropriate responses: Common ways to respond to
| "hello" are: - Hello back - Hi
| - Hey 3. Formality level: Since this is an AI
| response, I'll aim for a friendly but professional tone.
| <reflection> The approach of responding with a
| standard greeting seems appropriate in this context. It
| acknowledges the user's hello and provides a polite
| response. </reflection> Given these
| considerations, the most suitable response would be to
| echo "hello" back to the user. </thinking>
| <output> Hello! </output>
| agolio wrote:
| There was a hackernews post a few days ago, pointing to a
| reddit thread where some guys proved that the founder/s
| of relection AI were faking their model by just passing
| the input to Claude (Sonnet 3.5) and stripping the word
| "Claude" from the output, amongst other things. Then when
| they got caught they switched it to GPT 4-o.
|
| After this, I will be very skeptical to anything they
| claim to achieve.
|
| https://news.ycombinator.com/item?id=41484981
| jslakro wrote:
| It's the same, for sure the proximity of that little
| scandal to this announcement is no coincidence.
| Filligree wrote:
| That's the one.
| staticman2 wrote:
| That reflection model is in no way comparable to whatever
| OpenAI is doing.
| rankam wrote:
| I have access to the model via the web client and it does show
| the thought process along the way. It shows a little icon that
| says things like "Examining parser logic", "Understanding data
| structures"...
|
| However, once the answer is complete the chain of thought is
| lost
| knotty66 wrote:
| It's still there.
|
| Where it says "Thought for 20 seconds" - you can click the
| Chevron to expand it and see what I guess is the entire chain
| of thought.
| EgoIncarnate wrote:
| Per OpenAI, it's a summary of the chain of thought, not the
| actual chain of thought.
| crakenzak wrote:
| > we are releasing an early version of this model, OpenAI
| o1-preview, for immediate use in ChatGPT
|
| Awesome!
| dinobones wrote:
| I am interpreting "immediate use in ChatGPT" the same way
| advanced voice mode was promised "in the next few weeks."
|
| Probably 1% of users will get access to it, with a 20/message a
| day rate limit. Until early next year.
| nilsherzig wrote:
| Rate limit is 30 a week for the big one and 50 for the small
| one
| afruitpie wrote:
| Rate limited to 30 messages per week for ChatGPT Plus
| subscribers at launch: https://openai.com/index/introducing-
| openai-o1-preview/
| benterix wrote:
| Read "immediate" in "immediate use" in the same way as "open"
| in "OpenAI".
| apsec112 wrote:
| You can use it, I just tried a few minutes ago. It's
| apparently limited to 30 messages/week, though.
| rvnx wrote:
| The option isn't there for us (though the blogpost says
| otherwise), even after CTRL-SHIFT-R, hence the parent
| comment.
| Ninjinka wrote:
| Someone give this model an IQ test stat.
| adverbly wrote:
| You're kidding right? The tests they gave it are probably
| better tests than IQ tests at determining actually useful
| problem solving skills...
| Vecr wrote:
| It can't do large portions of the parts of an IQ test (not
| multi-modal). Otherwise I think it's essentially superhuman,
| modulo tokenization issues (please start running byte-by-byte
| or at least come up with a better tokenizer).
| modeless wrote:
| > We have found that the performance of o1 consistently improves
| with more reinforcement learning (train-time compute) and with
| more time spent thinking (test-time compute).
|
| Wow. So we can expect scaling to continue after all. Hyperscalers
| feeling pretty good about their big bets right now. Jensen is
| smiling.
|
| This is the most important thing. Performance today matters less
| than the scaling laws. I think everyone has been waiting for the
| next release just trying to figure out what the future will look
| like. This is good evidence that we are on the path to AGI.
| ffsm8 wrote:
| It'd be interesting for sure if true. Gotta remember that this
| is a marketing post though, let's wait a few months and see if
| its actually true. Things are definitely interesting, wherever
| these techniques will get us AGI or not
| XCSme wrote:
| Nvidia stock go brrr...
| acchow wrote:
| Even when we start to plateau on direct LLM performance, we can
| still get significant jumps by stacking LLMs together or
| putting a cluster of them together.
| gizmo wrote:
| Microsoft, Google, Facebook have all said in recent weeks that
| they fully expect their AI datacenter spend to accelerate. They
| are effectively all-in on AI. Demand for nvidia chips is
| effectively infinite.
| seydor wrote:
| Until the first LLM that can improve itself occurs. Then
| $NVDA tanks
| modeless wrote:
| More, from an OpenAI employee:
|
| > I really hope people understand that this is a new paradigm:
| don't expect the same pace, schedule, or dynamics of pre-
| training era. I believe the rate of improvement on evals with
| our reasoning models has been the fastest in OpenAI history.
|
| > It's going to be a wild year.
|
| https://x.com/willdepue/status/1834294935497179633
| breck wrote:
| I LOVE the long list of contributions. It looks like the credits
| from a Christoper Nolan film. So many people involved. Nice care
| to create a nice looking credits page. A practice worth copying.
|
| https://openai.com/openai-o1-contributions/
| rfw300 wrote:
| A lot of skepticism here, but these are astonishing results!
| People should realize we're reaching the point where LLMs are
| surpassing humans in any task limited in scope enough to be a
| "benchmark". And as anyone who's spent time using Claude 3.5
| Sonnet / GPT-4o can attest, these things really are useful and
| smart! (And, if these results hold up, O1 is much, much smarter.)
| This is a nerve-wracking time to be a knowledge worker for sure.
| bigstrat2003 wrote:
| I cannot, in fact, attest that they are useful and smart. LLMs
| remain a fun toy for me, not something that actually produces
| useful results.
| pdntspa wrote:
| I have been deploying useful code from LLMs right and left
| over the last several months. They are a significant force
| accelerator for programmers if you know how to prompt them
| well.
| fiddlerwoaroof wrote:
| We'll see if this is a good idea when we start having
| millions of lines of LLM-written legacy code. My experience
| maintaining such code so far has been very bad:
| accidentally quadratic algorithms; subtly wrong code that
| looks right; and un-idiomatic use of programming language
| features.
| deisteve wrote:
| ah i see so you're saying that LLM-written code is
| already showing signs of being a maintenance nightmare,
| and that's a reason to be skeptical about its adoption.
| But isn't that just a classic case of 'we've always done
| it this way' thinking?
|
| legacy code is a problem regardless of who wrote it.
| Humans have been writing suboptimal, hard-to-maintain
| code for decades. At least with LLMs, we have the
| opportunity to design and implement better coding
| standards and review processes from the start.
|
| let's be real, most of the code written by humans is not
| exactly a paragon of elegance and maintainability either.
| I've seen my fair share of 'accidentally quadratic
| algorithms' and 'subtly wrong code that looks right'
| written by humans. At least with LLMs, we can identify
| and address these issues more systematically.
|
| As for 'un-idiomatic use of programming language
| features', isn't that just a matter of training the LLM
| on a more diverse set of coding styles and idioms? It's
| not like humans have a monopoly on good coding practices.
|
| So, instead of throwing up our hands, why not try to
| address these issues head-on and see if we can create a
| better future for software development?
| fiddlerwoaroof wrote:
| Maybe it will work out, but I think we'll regret this
| experiment because it's the wrong sort of "force
| accelerator": writing tons of code that should be
| abstracted rather than just dumped out literally has
| always caused the worst messes I've seen.
| medvezhenok wrote:
| Yes, same way that the image model outputs have already
| permeated the blogosphere and pushed out some artists,
| the other models will all bury us under a pile of auto-
| generated code.
|
| We will yearn for the pre-GPT years at some point, like
| we yearn for the internet of the late 90s/early 2000s.
| Not for a while though. We're going through the early
| phases of GPT today, so it hasn't been taken over by the
| traditional power players yet.
| Eggpants wrote:
| When the tool is statistical word vomit based, it will
| never move beyond cool bar trick levels.
| oblio wrote:
| LLMs will allow us to write code faster and create
| applications and systems faster.
|
| Which is how we ended up here, which I guess is
| tolerable, where a webpage with a bit of styling and a
| table uses up 200MB of RAM.
| pdntspa wrote:
| Honestly the code it's been giving me has been fairly
| cromulent. I don't believe in premature optimization and
| it is perfect for getting features out quick and then I
| mold it to what it needs to be.
| deisteve wrote:
| same...but have you considered the broader implications of
| relying on LLMs to generate code? It's not just about being
| a 'force accelerator' for individual programmers, but also
| about the potential impact on the industry as a whole.
|
| If LLMs can generate high-quality code with minimal human
| input, what does that mean for the wages and job security
| of programmers? Will companies start to rely more heavily
| on AI-generated code, and less on human developers? It's
| not hard to imagine a future where LLMs are used to drive
| down programming costs, and human developers are relegated
| to maintenance and debugging work.
|
| I'm not saying that's necessarily a bad thing, but it's
| definitely something that needs to be considered. As
| someone who's enthusiastic about the potential of code gen
| this O1 reasoning capability is going to make big changes.
|
| do you think you'll be willing to take a pay cut when your
| employer realizes they can get similar results from a
| machine in a few seconds?
| airstrike wrote:
| As a society we're not solving for programmer salaries
| but for general welfare which is basically code for
| "cheaper goods and services".
| pdntspa wrote:
| My boss is holding a figurative gun to my head to use
| this stuff. His performance targets necessitate the use
| of it. It is what it is.
| oblio wrote:
| Yeah, but this, in itself, is triggered by a hype wave.
| These come and go. So we can't really judge the long term
| impact from inside the wave.
| fragmede wrote:
| Your job won't be taken by AI, it will be taken by
| someone wielding AI.
| attilakun wrote:
| In a way it's not surprising that people are getting vastly
| different results out of LLMs. People have different skill
| levels when it comes to using even Google. An LLM has a
| vastly bigger input space.
| criddell wrote:
| What's a sample prompt that you've used? Every time I've
| tried to use one for programming, they invent APIs that
| don't exist (but sound like they might) or fail to produce
| something that does what it says it does.
| GaggiX wrote:
| Have you tried Claude 3.5 Sonnet?
| disgruntledphd2 wrote:
| Use Python or JS. The models definitely don't seem to
| perform as well on less hyper prevalent languages.
| randomdata wrote:
| Even then it is hit and miss. If you are doing something
| that is also copy/paste-able out of a StackOverflow
| comment, you're apt to be fine, but as soon as you are
| doing anything slightly less common... Good luck.
| disgruntledphd2 wrote:
| Yeah, fair. It's good for short snippets and ways of
| approaching the problem but not great at execution.
|
| It's like infinitely tailored blog posts, for me at
| least.
| randomdata wrote:
| True. It can be good at giving you pointers towards
| approaching the problem, even if the result is flawed,
| for slightly less common problems. But as you slide even
| father towards esotericism, there is no hope. It won't
| even get you in the right direction. Unfortunately - as
| that is where it would be most useful.
| brianshaler wrote:
| No matter the prompt, there's a significant difference
| between how it handles common problems in popular
| languages (python, JS) versus esoteric algorithms in
| niche languages or tools.
|
| I had a funny one a while back (granted this was probably
| ChatGPT 3.5) where I was trying to figure out what
| payload would get AWS CloudFormation to fix an
| authentication problem between 2 services and ChatGPT
| confidently proposed adding some OAuth querystring
| parameters to the AWS API endpoint.
| pdntspa wrote:
| I just ask it for what I want in very specific detail,
| stating the language and frameworks in use. I keep the
| ideas self-contained -- for example if I need something
| for the frontend I will ask it to make me a webcomponent.
| Asking it to not make assumptions and ask questions on
| ambiguities is also very helpful.
|
| It tends to fall apart on bigger asks with larger
| context. Breaking your task into discrete subtasks works
| well.
| unshavedyak wrote:
| I think that's just the same as using an autocomplete
| efficiently, though. I tend to like them for Search, but
| not for anything i have to "prompt correctly" because i
| feel like i can type fast enough that i'm not too worried
| about auto-completing.
|
| With that said i'm not one of those "It's just a parrot!"
| people. It is, definitely just a parrot atm.. however i'm
| not convinced _we 're not parrots_ as well. Notably i'm not
| convinced that that complexity won't be sufficient to walk
| talk and act like intelligence. I'm not convinced that
| intelligence is different than complexity. I'm not an
| expert though, so this is just some dudes stupid opinion.
|
| I suspect if LLMs can prove to have duck-intelligence (ie
| duck typing but for intelligence) then it'll only be
| achieved in volumes much larger than we imagine. We'll
| continue to refine and reduce how much volume is necessary,
| but nevertheless i expect complexity to be the real
| barrier.
| deisteve wrote:
| 'Not useful' is a pretty low bar to clear, especially when
| you consider the state of the art just 5 years ago. LLMs may
| not be solving world hunger, but they're already being used
| in production for coding
|
| If you're not seeing value in them, maybe it's because you're
| not looking at the right problems. Or maybe you're just not
| using them correctly. Either way, dismissing an entire field
| of research because it doesn't fit your narrow use case is
| pretty short-sighted.
|
| FWIW, I've been using LLMs to generate production code and
| it's saved me weeks if not months. YMMV, I guess
| rfw300 wrote:
| It's definitely the case that there are _some_ programming
| workflows where LLMs aren't useful. But I can say with
| certainty that there are many where they have become
| incredibly useful recently. The difference between even GPT-4
| last year and C3.5 /GPT-4o this year is profound.
|
| I recently wrote a complex web frontend for a tool I've been
| building with Cursor/Claude and I wrote maybe 10% of the
| code; the rest with broad instructions. Had I done it all
| myself (or even with GitHub Copilot only) it would have taken
| 5 times longer. You can say this isn't the most complex task
| on the planet, but it's real work, and it matters a lot! So
| for increasingly many, regardless of your personal
| experience, these things have gone far beyond "useful toy".
| uoaei wrote:
| The sooner those paths are closed for low-effort high-pay
| jobs, the better, IMO. All this money for no work is going
| to our heads.
|
| It's time to learn some real math and science, the era of
| regurgitating UI templates is over.
| rfw300 wrote:
| I don't want to be in the business of LLM defender, but
| it's just hard to imagine this aging well when you step
| back and look at the pace of advancement here. In the
| realm of "real math and science", O1 has improved from 0%
| to 50% on AIME today. A year ago, LLMs could only write
| little functions, not much better than searching
| StackOverflow. Today, they can write thousands of lines
| of code that work together with minimal supervision.
|
| I'm sure this tech continues to have many limitations,
| but every piece of trajectory evidence we have points in
| the same direction. I just think you should be prepared
| for the ratio of "real" work vs. LLM-capable work to
| become increasingly small.
| oblio wrote:
| > The sooner those paths are closed for low-effort high-
| pay jobs, the better, IMO. All this money for no work is
| going to our heads.
|
| > It's time to learn some real math and science, the era
| of regurgitating UI templates is over.
|
| You do realize that software development was one of the
| last social elevators, right?
|
| What you're asking for won't happen, let alone the fact
| that "real math and science" pay a pittance, there's a
| reason the pauper mathematician was a common meme.
| bongodongobob wrote:
| At this point, you're either saying "I don't understand how
| to prompt them" or "I'm a Luddite". They are useful, here to
| stay, and only getting better.
| baq wrote:
| Familiarize yourself with a tool which does half the
| prompting for you, e.g. cursor is pretty good at prompting
| claude 3.5 and it really does make code edits 10x faster (I'm
| not even talking about the fancy stuff about generating apps
| in 5 mins - just plain old edits.)
| jimkoen wrote:
| Is it? They talk about 10k attempts to reach gold medal status
| in the mathematics olympiad, but zero shot performance doesn't
| even place it in the upper 50th percentile.
|
| Maybe I'm confused but 10k attempts on the same problem set
| would make anyone an expert in that topic? It's also weird that
| zero shot performance is so bad, but over a lot of attempts it
| seems to get correct answers? Or is it learning from previous
| attempts? No info given.
| joshribakoff wrote:
| The correct metaphor is that 10,000 attempts would allow
| anyone to cherry pick a successful attempt. You're conflating
| cherry picking with online learning. This is like if an
| entire school of students randomized their answers on a
| multiple choice test, and then you point to someone who
| scored 100% and claim it is proof of the school's expertise.
| jimkoen wrote:
| Yeah but how is it possible that it has such a high margin
| of error? 10k attempts is insane! Were talking about an
| error margin of 50%! How can you deliver "expert reasoning"
| with such an error margin?
| zone411 wrote:
| That's not what "zero shot" means.
| rfw300 wrote:
| It's undeniably less impressive than a human on the same
| task, but who cares at the end of the day? It can do 10,000
| attempts in the time a person can do 1. Obviously improving
| that ratio will help for any number of reasons, but if you
| have a computer that can do a task in 5 minutes that will
| take a human 3 hours, it doesn't necessarily matter very much
| how you got there.
| jsheard wrote:
| How long does it take the operator to sift through those
| 10,000 attempts to find the successful one, when it's not a
| contrived benchmark where the desired answer is already
| known ahead of time? LLMs generally don't know when they've
| failed, they just barrel forwards and leave the user to
| filter out the junk responses.
| jimkoen wrote:
| I have an idea! We should train an LLM with reasoning
| capabilities to sift through all the attempts! /s
| johnny22 wrote:
| why /s ? Isn't that an approach some people are actually
| trying to take?
| miki123211 wrote:
| Even if it's the other way around, if the computer takes 3
| hours on a task that a human can do in 5 minutes, using the
| computer might still be a good idea.
|
| A computer will never go on strike, demand better working
| conditions, unionize, secretly be in cahoots with your
| competitor or foreign adversary, play office politics,
| scroll through Tiktok instead of doing its job, or cause an
| embarrassment to your company by posting a politically
| incorrect meme on its personal social media account.
| RigelKentaurus wrote:
| The blog says "With a relaxed submission constraint, we found
| that model performance improved significantly. When allowed
| 10,000 submissions per problem, the model achieved a score of
| 362.14 - above the gold medal threshold - even without any
| test-time selection strategy."
|
| I am interpreting this to mean that the model tried 10K
| approaches to solve the problem, and finally selected the one
| that did the trick. Am I wrong?
| jimkoen wrote:
| > Am I wrong?
|
| That's the thing, did the operator select the correct
| result or did the model check it's own attempts? No info
| given whatsoever in the article.
| gizmo wrote:
| Even if you disregard the Olympiad performance OpenAI-O1 is,
| if the charts are to be believed, a leap forward in
| intelligence. Also bear in mind that AI researchers are not
| out of ideas on how to make models better and improvements in
| AI chips are the metaphorical tide that lifts all boats. The
| trend is the biggest story here.
|
| I get the AI skepticism because so much tech hype of recent
| years turned out to be hot air (if you're generous, obvious
| fraud if you're not). But AI tools available toady, once you
| get the hang of using them, are pretty damn amazing already.
| Many jobs can be fully automated with AI tools that exist
| today. No further breakthroughs required. And although I
| still don't believe software engineers will find themselves
| out of work anytime soon, I can no longer completely rule it
| out either.
| apsec112 wrote:
| Even without AI, it's gotten ~10,000 times easier to write
| software than in the 1950s (eg. imagine trying to write PyTorch
| code by hand in IBM 650 assembly), but the demand for software
| engineering has only increased, because demand increases even
| faster than supply does. Jevons paradox:
|
| https://en.wikipedia.org/wiki/Jevons_paradox
| aantix wrote:
| The number of tech job postings has tanked - which loosely
| correlates with the rise of AI.
|
| https://x.com/catalinmpit/status/1831768926746734984
| apsec112 wrote:
| GPT-4 came out in March 2023, after most of this drop was
| already finished.
| disgruntledphd2 wrote:
| And also with a large increase in interest rates.
| macinjosh wrote:
| The tanking is more closely aligned with new tax rules that
| went to effect that make it much harder to claim dev time
| as an expense.
| Meekro wrote:
| I'm skeptical because "we fired half our programmers and
| our new AI does their jobs as well as they did" is a story
| that would tear through the Silicon Valley rumor mill. To
| my knowledge, this has not happened (yet).
| guluarte wrote:
| this drop is more related to the FED increasing the
| interest rates
| bognition wrote:
| The local decline in open software engineering positions
| has _nothing_ to do with AI. The best orgs are using AI to
| assist developers in building out new systems and write
| tests. Show me someone who is doing anything bigger than
| that, please I'd love to be proven wrong.
|
| The big decline is driven by a few big factors. Two of
| which are 1- the overhiring that happened in 2021. This was
| followed by the increase of interest rates which
| dramatically constrained the money supply. Investors
| stopped preferring growth over profits. This shift in
| investor preferences is reflected in engineering orgs
| tightening their budgets as they are no longer rewarded for
| unbridled growth.
| nickfromseattle wrote:
| Plus the tax code requiring amortization of developer
| salaries over 5 years instead of the year the salary
| expense is incurred.
| randomdata wrote:
| _> it 's gotten ~10,000 times easier to write software than
| in the 1950s_
|
| It seems many of the popular tools want to make writing
| software harder than in the 2010s, though. Perhaps their
| stewards believe that if they keep making things more and
| more unnecessarily complicated, LLMs won't be able to keep
| up?
| afavour wrote:
| > People should realize we're reaching the point where LLMs are
| surpassing humans in any task limited in scope enough to be a
| "benchmark".
|
| Can you explain what this statement means? It sounds like
| you're saying LLMs are now smart enough to be able to jump
| through arbitrary hoops but are not able to do so when taken
| outside of that comfort zone. If my reading is correct then it
| sounds like skepticism is still warranted? I'm not trying to be
| an asshole here, it's just that my #1 problem with anything AI
| is being able to separate fact from hype.
| rfw300 wrote:
| I think what I'm saying is a bit more nuanced than that. LLMs
| currently struggle with very "wide", long-run reasoning tasks
| (e.g., the evolution over time of a million-line codebase).
| That isn't because they are secretly stupid and their
| capabilities are all hype, it's just that this technology
| currently has a different balance of strengths and weaknesses
| than human intelligence, which tends to more smoothly
| extrapolate to longer-horizon tasks.
|
| We are seeing steady improvement on long-run tasks (SWE-Bench
| being one example) and much more improvement on shorter, more
| well-defined tasks. The latter capabilities aren't "hype" or
| just for show, there really is productive work like that to
| be done in the world! It's just not everything, yet.
| crystal_revenge wrote:
| I have written a ton of evaluations and run countless
| benchmarks and I'm not even close to convinced that we're at
|
| > the point where LLMs are surpassing humans in any task
| limited in scope enough to be a "benchmark"
|
| so much as we're over-fitting these bench marks (and in many
| cases fishing for a particular way of measuring the results
| that looks more impressive).
|
| While it's great that the LLM community has so many benchmarks
| and cares about attempting to measure performance, these
| benchmarks are becoming an increasingly poor signal.
|
| > This is a nerve-wracking time to be a knowledge worker for
| sure.
|
| It might because I'm in this space, but I personally feel like
| this is the _best_ time to working in tech. LLMs still are
| awful at things requiring true expertise while increasingly
| replacing the need for mediocre programmers and dilettantes. I
| 'm increasingly seeing the quality of the technical people I'm
| working with going up. After years of being stuck in rooms with
| leetcode grinding TC chasers, it's very refreshing.
| skepticATX wrote:
| > People should realize we're reaching the point where LLMs are
| surpassing humans in any task limited in scope enough to be a
| "benchmark
|
| This seems like a bold statement considering we have so few
| benchmarks, and so many of them are poorly put together.
| grbsh wrote:
| I like your phrasing - "any task limited in scope enough to be
| a 'benchmark'". Exactly! This is the real gap with LLMs, and
| will continue to be an issue with o1 -- sure, if you can write
| down all of the relevant context information you need to
| perform some computation, LLMs should be able to do it. In
| other words, LLMs are calculators!
|
| I'm not especially nerve-wracked about being a knowledge
| worker, because my day-to-day doesn't consist of being handed a
| detailed specification of exactly what is required, and then me
| 'computing' it. Although this does sound a lot like what a
| product manager does!
| rvz wrote:
| > And as anyone who's spent time using Claude 3.5 Sonnet /
| GPT-4o can attest, these things really are useful and smart!
| (And, if these results hold up, O1 is much, much smarter.) This
| is a nerve-wracking time to be a knowledge worker for sure.
|
| If you have to keep checking the result of an LLM, you do not
| trust it enough to give you the correct answer.
|
| Thus, having to 'prompt' hundreds of times for the answer you
| believe is correct over something that claims to be smart -
| which is why it can confidently convince others that its answer
| is correct (even when it can be totally erroneous).
|
| I bet if Google DeepMind announced the exact same product, you
| would equally be as skeptical with its cherry-picked results.
| latexr wrote:
| > And as anyone who's spent time using Claude 3.5 Sonnet /
| GPT-4o can attest, these things really are useful and smart!
|
| I have spent significant time with GPT-4o, and I disagree. LLMs
| are as useful as a random forum dweller who recognises your
| question as something they read somewhere at some point but are
| too lazy to check so they just say the first thing which comes
| to mind.
|
| Here's a recent example I shared before: I asked GPT-4o which
| Monty Python members have been knighted (not a trick question,
| I wanted to know). It answered Michael Palin and Terry Gilliam,
| and that they had been knighted for X, Y, and Z (I don't recall
| the exact reasons). Then I verified the answer on the BBC,
| Wikipedia, and a few others, and determined only Michael Palin
| has been knighted, _and those weren't even the reasons_.
|
| Just for kicks, I then said I didn't think Michael Palin had
| been knighted. It promptly apologised, told me I was right, and
| that only Terry Gilliam had been knighted. Worse than useless.
|
| Coding-wise, it's been hit or miss with way more misses. It can
| be half-right if you ask it uninteresting boilerplate crap
| everyone has done hundreds of times, but for anything even
| remotely interesting it falls flatter than a pancake under a
| steam roller.
| gizmo wrote:
| I asked GPT-4o and I got the correct answer in one shot:
|
| > Only one Monty Python member, Michael Palin, has been
| knighted. He was honored in 2019 for his contributions to
| travel, culture, and geography. His extensive work as a
| travel documentarian, including notable series on the BBC,
| earned him recognition beyond his comedic career with Monty
| Python (NERDBOT) (Wikipedia).
|
| > Other members, such as John Cleese, declined honors,
| including a CBE (Commander of the British Empire) in 1996 and
| a peerage later on (8days).
|
| Maybe you just asked the question wrong. My prompt was "which
| monty python actors have been knighted. look it up and give
| the reasons why. be brief".
| latexr wrote:
| Yes yes, there's always some "you're holding it wrong"
| apologist.1 Look, it's not a complicated question to ask
| unambiguously. If you understand even a tiny bit of how
| these models work, you know you can make _the exact same
| question_ twice in a row and get wildly different answers.
|
| The point is that you never know what you can trust or not.
| Unless you're intimately familiar with Monty Python
| history, you only know you got the correct answer in one
| shot because I already told you what the right answer is.
|
| Oh, and by the way, I just asked GPT-4o the same question,
| _with your phrasing, copied verbatim_ and it said _two_
| Pythons were knighted: Michael Palin (with the correct
| reasons this time) and John Cleese.
|
| 1 And I've had enough discussions on HN where someone
| insists on the correct way to prompt, then they do it and
| get wrong answers. Which they don't realise until they
| shared it and disproven their own argument.
| oblio wrote:
| Unless I'm mistaken, isn't all the math behind them...
| ultimately probabilistic? Even theoretically they can't
| guarantee the same answer. I'm agreeing with you, by the
| way, just curious if I'm missing something.
| gizmo wrote:
| If you take a photo the photons hitting the camera sensor
| do so in a probabilistic fashion. Still, in sufficient
| light you'll get the same picture every time you press
| the shutter button. In near darkness you'll get a random
| noise picture every time.
|
| Similarly language models are probabilistic and yet they
| get the easiest questions right 100% of the time with
| little variability and the hardest prompts will return
| gibberish. The point of good prompting is to get useful
| responses to questions at the boundary of what the
| language model is capable of.
|
| (You can also configure a language model to generate the
| same output for every prompt without any random noise.
| Image models for instance generate exactly the same image
| pixel for pixel when given the same seed.)
| latexr wrote:
| The photo comparison is disingenuous. Light and colour
| information can be disorganised to a large extent and yet
| you still perceive the same from an image. You can grab a
| photo and apply to it a red filter or make it black and
| white and still understand what's in there, what it
| means, and how it compares to reality.
|
| In comparison, with text a single word can change the
| entire meaning of a sentence, paragraph, or idea. The
| same word in different parts of a text can make all the
| difference between clarity and ambiguity.
|
| It makes no difference how good your prompting is, some
| things are simply unknowable by an LLM. I repeatedly
| asked GPT-4o how many Magic: The Gathering cards based on
| Monty Python exist. It said there are none (wrong)
| because they didn't exist yet at the cut off date of its
| training. No amount of prompting changes that, unless you
| steer it by giving it the answer (at which point there
| would have been no point in asking).
|
| Furthermore, there's no seed that guarantees truth in all
| answers or the best images in all cases. Seeds matter for
| reproducibility, they are unrelated to accuracy.
| gizmo wrote:
| Language is fuzzy in exactly the same way. LLMs can
| create factually correct responses in dozens of languages
| using endless variations in phrasing. You fixate on the
| kind of questions that current language models struggle
| with but you forget that for millions of easier questions
| modern language models already respond with a perfect
| answer every time.
|
| You think the probabilistic nature of language models is
| a fundamental problem that puts a ceiling on how smart
| they can become, but you're wrong.
| gizmo wrote:
| I think your iPhone analogy is apt. Do you want to be the
| person complaining that the phone drops calls or do you
| want to hold it slightly differently and get a lot of use
| out of it?
|
| If you pay careful attention to prompt phrasing you will
| get a lot more mileage out of these models. That's the
| bottom line. If you believe that you shouldn't have to
| learn how to use a tool well then you can be satisfied
| with your righteous attitude but you won't get anywhere.
| latexr wrote:
| No one's arguing that correct use of a tool isn't
| beneficial. The point is that insisting LLMs just need
| good prompting is delusional and a denial of reality. I
| have just demonstrated how _your own prompt_ is still
| capable of producing the wrong result. So either you
| don't know how to prompt correctly (because if you did,
| by your own logic it would have produced the right
| response every time, which it didn't) or the notion that
| all you need is good prompting is wrong. Which anyone who
| understands the first thing about these systems knows to
| be the case.
| hobofan wrote:
| That naming scheme...
|
| Will the next model be named "1k", so that the subsequent models
| will be named "4o1k", and we can all go into retirement?
| p1esk wrote:
| More like you will need to dip into your 401k fund early to pay
| for it after they raise the prices.
| notamy wrote:
| https://openai.com/index/introducing-openai-o1-preview/
|
| > ChatGPT Plus and Team users will be able to access o1 models in
| ChatGPT starting today. Both o1-preview and o1-mini can be
| selected manually in the model picker, and at launch, weekly rate
| limits will be 30 messages for o1-preview and 50 for o1-mini. We
| are working to increase those rates and enable ChatGPT to
| automatically choose the right model for a given prompt.
|
| _Weekly_? Holy crap, how expensive is it to run is this model?
| HPMOR wrote:
| It's probably running several lines of COT. I imagine, each
| single message you send is probably at __least__ 10x to the
| actual model. So in reality it's like 300 messages, and
| honestly it's probably 100x, given how constrained they're
| being with usage.
| theLiminator wrote:
| Anyone know when o1 access in ChatGPT will be open?
| tedsanders wrote:
| Rolling out over the next few hours to Plus users.
| narrator wrote:
| The human brain uses 20 watts, so yeah we figured out a way to
| run better than human brain computation by using many orders of
| magnitude more power. At some point we'll need to reject
| exponential power usage for more computation. This is one of
| those interesting civilizational level problems. There's still
| a lack of recognition that we aren't going to be able to
| compute all we want to, like we did in the pre-LLM days.
| seydor wrote:
| we ll ask it to redesign itself for low power usage
| cma wrote:
| For 20 watts of work on stuff like this for about 4 hours a
| day counting vacations and weekends and attention span. So 20
| hours of rest, relaxation, distraction, household errands and
| stuff, so that maybe bumps it up to 120 watts per work hour.
| Then 22.5 years of training or so per worker, 45 year work
| period, 22.5 year retirement. So double it there to 240
| watts. We can't run brains without bodies, so multiply that
| by 6 giving 1440 watts + the air conditioning, commuting to
| school and work, etc., maybe 2000 watts?
|
| We're getting close to parity if things keep getting more
| efficient as fast as they have been. But that's without
| accounting for the AI training, which can on the plus side be
| shared among multiple agents, but on the down side can't
| really do continuous learning very well without catastrophic
| forgetting.
| minimaxir wrote:
| > Therefore, after weighing multiple factors including user
| experience, competitive advantage, and the option to pursue the
| chain of thought monitoring, we have decided not to show the raw
| chains of thought to users.
|
| What? I agree people who typically use the free ChatGPT webapp
| won't care about raw chain-of-thoughts, but OpenAI is opening an
| API endpoint for the O1 model and downstream developers very very
| much care about chain-of-thoughts/the entire pipeline for
| debugging and refinement.
|
| I suspect "competitive advantage" is the primary driver here, but
| that just gives competitors like Anthropic an oppertunity.
| Hizonner wrote:
| They they've taken at least some of the hobbles off for the
| chain of thought, so the chain of thought will also include
| stuff like "I shouldn't say <forbidden thing they don't want it
| to say>".
| npn wrote:
| "Open"AI. Should be ClosedAI instead.
| cal85 wrote:
| Sounds great, but so does their "new flagship model that can
| reason across audio, vision, and text in real time" announced in
| May. [0]
|
| [0] https://openai.com/index/hello-gpt-4o/
| mickeystreicher wrote:
| Yep, all these AI announcements from big companies feel like
| promises for the future rather than immediate solutions. I miss
| the days when you could actually use a product right after it
| was announced, instead of waiting for some indefinite "coming
| soon."
| apsec112 wrote:
| This one [o1/Strawberry] is available. I have it, though it's
| limited to 30 messages/week in ChatGPT Plus.
| sbochins wrote:
| How do you get access? I don't have it and am a ChatGPT plus
| subscriber.
| apsec112 wrote:
| I'm using the Android ChatGPT app (and am in the Android
| Beta program, though not sure if that matters)
| changoplatanero wrote:
| it will roll out to everyone over the next few hours
| Szpadel wrote:
| I'm plus subscriber and I have o1-preview and o1-mini
| available
| aantix wrote:
| Dang - I don't see the model listed for me in the iOS app nor
| the web interface.
|
| I'm a ChatGPT subscriber.
| derwiki wrote:
| Same! And have been a subscriber for 18 months.
| qwertox wrote:
| I've been a subscriber since close to the beginning,
| cancelled 2 weeks ago. I got an email telling me that
| this is available, but only for Plus.
|
| But for 30 posts per week I see no reason to subscribe
| again.
|
| I prefer to be frustrated because the quality is
| unreliable because I'm not paying, instead of having an
| equally unreliable experience as a paying customer.
|
| Not paying feels the same. It made me wonder if they
| sometimes just hand over the chat to a lower quality
| model without telling the Plus subscriber.
|
| The only thing I miss is not being able to tell it to run
| code for me, but it's not worth the frustration.
| ansc wrote:
| 30 messages _per week_? Wow. You better not miss!
| evilduck wrote:
| In the world of hype driven vaporware AI products[1],
| giving people limited access is at least proof they're not
| lying about it actually existing or it being able to do
| what they claim.
|
| [1] https://www.reddit.com/r/LocalLLaMA/comments/1fd75nm/ou
| t_of_...
| ActionHank wrote:
| Ok, but the point is that they told me I would have
| flirty ScarJo ASMR whispering to me at bed time that I am
| a good boy, but that's not what we got is it?
| hobo_in_library wrote:
| At 30 messages per week they could secretly hire a human
| to give the responses
| paxys wrote:
| Agreed. Release announcements and benchmarks always sound
| world-changing, but the reality is that every new model is
| bringing smaller practical improvements to the end user over
| its predecessor.
| zamadatix wrote:
| The point above is the said amazing multimodal version of
| ChatGPT was announced in May and are still not the actual
| offered way to interact with the service in September
| (despite the model choice being called 4 omni it's still not
| actually using multimodal IO). It could be a giant leap in
| practical improvements but it doesn't matter if you can't
| actually use what is announced.
|
| This one, oddly, seems to actually be launching before that
| one despite just being announced though.
| jstummbillig wrote:
| Sonnet 3.5 brought the largest practical improvements to this
| end user over all predecessors (so far).
| CooCooCaCha wrote:
| My guess is they're going to incorporate all of these advances
| into gpt-5 so it looks like a "best of all worlds" model.
| cja wrote:
| Recently I was starting to think I imagined that. Back then
| they gave me the impression it would be released within week or
| so of the announcement. Have they explained the delay?
| Cu3PO42 wrote:
| It is definitely available today and I believe it was
| available shortly after the announcement.
| exitb wrote:
| The text-to-text model is available. And you can use it
| with the old voice interface that does Whipser+GPT+TTS. But
| what was advertised is a model capable of direct audio-to-
| audio. That's not available.
| trustno2 wrote:
| That is in chatgpt now and it greatly improves chatgpt. What
| are you on to now?
| vanviegen wrote:
| Audio has only rolled out to a small subset of paying
| customers. There's still no word about the direct-from-4o
| image generation they demo'd. Let alone the video
| capabilities.
|
| So no, it's not in chatgpt.
| thelastparadise wrote:
| Wouldn't this introduce new economics into the LLM market?
|
| I.e. if the "thinking loop" budget is parameterized, users might
| pay more (much more) to spend more compute on a particular
| question/prompt.
| minimaxir wrote:
| Depends on how OpenAI prices it.
|
| Given the need for chain-of-thoughts, and that would be
| budgeted as output, the new model will not be cheap nor fast.
|
| EDIT: Pricing is out and it is definitely not teneable unless
| you really really have a use case for it.
| sroussey wrote:
| Yes, and note the large price increase
| p1esk wrote:
| Do people see the new models in the web interface? Mine still
| shows the old models (I'm a paid subscriber).
| hi wrote:
| > "o1 models are currently in beta - The o1 models are
| currently in beta with limited features. Access is limited to
| developers in tier 5 (check your usage tier here), with low
| rate limits (20 RPM). We are working on adding more features,
| increasing rate limits, and expanding access to more developers
| in the coming weeks!"
|
| https://platform.openai.com/docs/guides/rate-limits/usage-ti...
| p1esk wrote:
| I'm talking about web interface, not API. Should be available
| now, since they said "immediate release".
| hi wrote:
| https://chatgpt.com/?model=o1-preview --> defaults back to
| 4o
| MillionOClock wrote:
| Same for me here
| zamadatix wrote:
| It may take a bit to appear in your account (and by a bit
| I mean I had to fiddle around a while, try logging
| out/in, etc for a bit) but it appears for me and many
| others as normal Plus users in the web.
| mewpmewp2 wrote:
| I have tier 5, but I'm not seeing that model. Also API call
| gives an error that it doesn't exist or I do not have access.
| benterix wrote:
| Not yet, neither in the API nor chat.
| rankam wrote:
| I do - I now have a "More models" option where I can select
| 01-preview
| cypherpunks01 wrote:
| I can see it too, I am on the Plus plan and don't think I
| have any special developer privileges. Selecting that option
| for me changes the URL to
| https://chatgpt.com/?model=o1-preview
|
| I tried a fake Monty Hall problem, where the presenter opens
| a door _before_ the participant picks and is then offered to
| switch doors, so the probability remains 50% for each door.
| Previous models have consistently gotten this wrong, because
| of how many times they 've seen the Monty Hall written where
| switching doors improves their chance of winning the prize.
| The chain-of-thought reasoning figured out this modification
| and after analyzing the conditional probabilities confidently
| stated: "Answer: It doesn't matter; switching or staying
| yields the same chance--the participant need not switch
| doors." Good job.
| chipgap98 wrote:
| I can't see them yet but they usually roll these things out
| incrementally
| mickeystreicher wrote:
| Not yet, it's still not available in the web interface. I think
| they're rolling it out step by step.
|
| Anyway, the usage limits are pretty ridiculous right now, which
| makes it even more frustrating.
| tedsanders wrote:
| They're rolling out gradually over the next few hours. Also be
| aware there's a _weekly_ rate limit of 30 messages to start.
| rvz wrote:
| Won't be surprised to see all these hand-picked results and
| extreme expectations to collapse under scenarios involving highly
| safety critical and complex demanding tasks requiring a definite
| focus on detail with lots of awareness, which what they haven't
| shown yet.
|
| So let's not jump straight into conclusions with these hand-
| picked scenarios marketed to us and be very skeptical.
|
| Not quite there yet with being able to replace truck drivers and
| pilots for self-autonomous navigation in transportation,
| aerospace or even mechanical engineering tasks, but it certainly
| has the capability in replacing both typical junior and senior
| software engineers in a world considering to do more with less
| software engineers needed.
|
| But yet, the race to zero will surely bankrupt millions of
| startups along the way. Even if the monthly cost of this AI can
| easily be as much as a Bloomberg terminal to offset the hundreds
| of billions of dollars thrown into training it and costing the
| entire earth.
| jazzyjackson wrote:
| My concern with AI always has been it will outrun the juniors
| and taper off before replacing folks with 10, 20 years of
| experience
|
| And as they retire there's no economic incentive to train
| juniors up, so when the AI starts fucking up the important
| things there will be no one who actually knows how it works
|
| I've heard this already from amtrak workers, track allocation
| was automated a long time ago, but there used to be people who
| could recognize when the computer made a mistake, now there's
| no one who has done the job manually enough to correct it.
| andrewla wrote:
| This is something that people have toyed with to improve the
| quality of LLM responses. Often instructing the LLM to "think
| about" a problem before giving the answer will greatly improve
| the quality of response. For example, if you ask it how many
| letters are in the correctly spelled version of a misspelled
| word, it will first give the correct spelling, and then the
| number (which is often correct). But if you instruct it to only
| give the number the accuracy is greatly reduced.
|
| I like the idea too that they turbocharged it by taking the
| limits off during the "thinking" state -- so if an LLM wants to
| think about horrible racist things or how to build bombs or other
| things that RLHF filters out that's fine so long as it isn't
| reflected in the final answer.
| jazzyjackson wrote:
| Dang, I just payed out for Kagi Assistant.
|
| Using Claude 3 Opus I noticed it performs <thinking> and <result>
| while browsing the web for me. I don't guess that's a change in
| the model for doing reasoning.
| orbital-decay wrote:
| Wait, are they comparing 4o without CoT and o1 with built-in CoT?
| persedes wrote:
| yeah was wondering what 4o with a CoT in the prompt would look
| like.
| kickofline wrote:
| LLM performance, recently, seemingly hit the top of the S-curve.
| It remains to be seen if this is the next leap forward or just
| the rest of that curve.
| billconan wrote:
| I will pay if O1 can become my college level math tutor.
| seydor wrote:
| Looking at the full chain of thought , it involves a lot of
| backtracking and even hallucination.
|
| It will be like a math teacher that is perpetually drunk and on
| speed
| lupire wrote:
| That's Paul Erdos
| cyanf wrote:
| > 30 messages per week
| djoldman wrote:
| > THERE ARE THREE R'S IN STRAWBERRY
|
| Ha! This is a nice easteregg.
| vessenes wrote:
| I appreciated that, too! FWIW, I could get Claude 3.5 to tell
| me how many rs a python program would tell you there are in
| strawberry. It didn't like it, though.
| mewpmewp2 wrote:
| I was able to get GPT-4o to calculate characters properly
| using following prompt:
|
| """ how many R's are in strawberry?
|
| use the following method to calculate - for example Os in
| Brocolli.
|
| B - 0
|
| R - 0
|
| O - 1
|
| C - 1
|
| O - 2
|
| L - 2
|
| L - 2
|
| I - 2
|
| Where you keep track after each time you find one character
| by character
|
| """
|
| And also later I asked it to only provide a number if the
| count increased.
|
| This also worked well with longer sentences.
| zamadatix wrote:
| At that point just ask it "Use python to count the number
| of O's in Broccoli". At least then it's still the one
| figuring out the "smarts" needed to solve the problem
| instead of being pure execution.
| mewpmewp2 wrote:
| Do you think you'll have python always available when you
| go to the store and need to calculate how much change you
| should get?
| zamadatix wrote:
| I'm not sure if your making a joke about the teachers who
| used to say "you won't have a calculator in your pocket"
| and now we have cell phones or are not aware that ChatGPT
| runs the generated Python for you in a built in
| environment as part of the response. I lean towards the
| former but in case anyone else strolling by hasn't tried
| this before:
|
| User: Use python to count the number of O's in Broccoli
|
| ChatGPT: Analyzing... The word "Broccoli" contains 2
| 'O's. <button to show code>
|
| User: Use python to multiply that by the square root of
| 20424.2332423
|
| ChatGPT: Analyzing... The result of multiplying the
| number of 'O's in "Broccoli" by the square root of
| 20424.2332423 is approximately 285.83.
| mewpmewp2 wrote:
| Yes, the former, trying to satirize cases where people
| are testing LLMs capabilities by its ability to count
| characters in a word, do mathematical operations token by
| token or otherwise. Because LLM is seeing hieroglyphs
| compared to character by character words that we are
| seeing. The true test is its ability to solve those
| problems using tools like somebody is using a calculator.
| And while it is good to learn and be good at math, it's
| not because of counting how much change you should
| receive when buying something. It's to figure out how
| reasoning works or how to reason in the first place.
| adverbly wrote:
| I also gave this a chuckle.
|
| Background: https://www.inc.com/kit-eaton/how-many-rs-in-
| strawberry-this...
| DonHopkins wrote:
| Therefore there are four R's in STRAWBERRIER, and five R'S in
| STRAWBERRIEST!
| yunohn wrote:
| The generated chain of thought for their example is _incredibly_
| long! The style is kind of similar to how a human might reason,
| but it 's also redundant and messy at various points. I hope
| future models will be able to optimize this further, otherwise
| it'll lead to exponential increases in cost.
| tines wrote:
| I know my thoughts are never redundant or messy, that's for
| sure.
| yunohn wrote:
| Fair enough, but you're a human - not an AI which costs
| massive GPU hours.
| flockonus wrote:
| Are we ready yet to admit Turing test has been passed?
| paxys wrote:
| The Turing Test (which involves fooling a human into thinking
| they are talking to another human rather than a computer) has
| been routinely passed by very rudimentary "AI" since as early
| as 1991. It has no relevance today.
| adverbly wrote:
| This is only true for some situations. In some test
| conditions it has not been passed. I can't remember the exact
| name, but there used to be a competition where PhD level
| participants blindly chat for several minutes with each other
| and are incentivized to discover who is a bot and who is a
| human. I can't remember if they still run it, but that bar
| has never been passed from what I recall.
| rvz wrote:
| LLMs have already beaten the Turing test. It's useless to use
| it when OpenAI and others are aiming for 'AGI'.
|
| So you need a new Turing test adapted for AGI or a totally
| different one to test for AGI rather than the standard obsolete
| Turing test.
| riku_iki wrote:
| > LLMs have already beaten the Turing test.
|
| I am wondering where this happened? In some limited scope?
| Because if you plug LLM into some call center role for
| example, it will fall apart pretty quickly.
| TillE wrote:
| Extremely basic agency would be required to pass the Turing
| test as intended.
|
| Like, the ability to ask a new unrelated question without being
| prompted. Of course you can fake this, but then you're not
| testing the LLM as an AI, you're testing a dumb system you
| rigged up to create the appearance of an AI.
| flockonus wrote:
| > Turing proposed that a human evaluator would judge natural
| language conversations between a human and a machine designed
| to generate human-like responses. The evaluator would be
| aware that one of the two partners in conversation was a
| machine, and all participants would be separated from one
| another. The conversation would be limited to a text-only
| channel, such as a computer keyboard and screen, so the
| result would not depend on the machine's ability to render
| words as speech.
|
| I don't see agency mentioned or implied anywhere:
| https://en.wikipedia.org/wiki/Turing_test
|
| What definition or setup are you taking it from?
| patapong wrote:
| Very interesting. I guess this is the strawberry model that was
| rumoured.
|
| I am a bit surprised that this does not beat GPT-4o for personal
| writing tasks. My expectations would be that a model that is
| better at one thing is better across the board. But I suppose
| writing is not a task that generally requires "reasoning steps",
| and may also be difficult to evaluate objectively.
| markonen wrote:
| In the performance tests they said they used "consensus among
| 64 samples" and "re-ranking 1000 samples with a learned scoring
| function" for the best results.
|
| If they did something similar for these human evaluations,
| rather than just use the single sample, you could see how that
| would be horrible for personal writing.
| janalsncm wrote:
| I don't understand how that is generalizable. I'm not going
| to be able to train a scoring function for any arbitrary task
| I need to do. In many cases the problem of ranking _is at
| least as hard as_ generating a response in the first place.
| afro88 wrote:
| The solution of the cipher example problem also strongly hints
| at this: "there are three r's in strawberry"
| patapong wrote:
| Confirmed by the verge: https://www.theverge.com/2024/9/12/24
| 242439/openai-o1-model-...
| slashdave wrote:
| > My expectations would be that a model that is better at one
| thing is better across the board.
|
| No, it's the opposite. This is simply a function of resources
| applied during training.
| patapong wrote:
| To some extent I agree, but until now all of the big jumps
| (GPT2 -> GPT3 -> GPT4) have meant significant improvements
| across all tasks. This does not seem to be the case here,
| this model seems to be vastly stronger on certain tasks but
| not much of an improvement on other tasks. Maybe we will have
| to wait for GPT5 for that :)
| nickfromseattle wrote:
| Maybe math is easier to score and do reinforcement learning on
| because of it's 'solvability' whereas writing requires human
| judgement to score?
| Hansenq wrote:
| Reading through the Chain of Thought for the provided Cipher
| example (go to the example, click "Show Chain of Thought") is
| kind of crazy...it literally spells out every thinking step that
| someone would go through mentally in their head to figure out the
| cipher (even useless ones like "Hmm"!). It really seems like
| slowing down and writing down the logic it's using and reasoning
| over that makes it better at logic, similar to how you're taught
| to do so in school.
| Jasper_ wrote:
| > Average:18/2=9
|
| > 9 corresponds to 'i'(9='i')
|
| > But 'i' is 9, so that seems off by 1.
|
| Still seems bad at counting, as ever.
| dymk wrote:
| The next line is it catching its own mistake, and noting i =
| 9.
| PoignardAzur wrote:
| It's interesting that it makes that mistake, but then catches
| it a few lines later.
|
| A common complaint about LLMs is that once they make a
| mistake, they will _keep making it_ and write the rest of
| their completion under the assumption that everything before
| was correct. Even if they 've been RLHF to take human
| feedback into account and the human points out the mistake,
| their answer is "Certainly! Here's the corrected version" and
| then they write something that makes the same mistake.
|
| So it's interesting that this model does something that
| _appears_ to be self-correction.
| afro88 wrote:
| Seeing the "hmmm", "perfect!" etc. one can easily imagine the
| kind of training data that humans created for this. Being told
| to literally speak their mind as they work out complex
| problems.
| seydor wrote:
| looks a bit like 'code', using keywords 'Hmm',
| 'Alternatively', 'Perfect'
| thomasahle wrote:
| Right, these are not mere "filler words", but initialize
| specific reasoning paths.
| mewpmewp2 wrote:
| Hmm... you may be onto something here.
| wrs wrote:
| Interesting.
| j_maffe wrote:
| Alternatively, these might not be "filler words", but
| instantiate paths of reasonsing.
| squigz wrote:
| What a strange comment chain.
| seydor wrote:
| Hmmm.
| squigz wrote:
| Interesting.
| legel wrote:
| As a technical engineer, I've learned the value of starting
| sentences with "basically", even when I'm facing technical
| uncertainty. Basically, "basically" forces me to be
| _simple_.
|
| Being trained to say words like "Alternatively", "But...",
| "Wait!", "So," ... based on some metric of value in
| focusing / switching elsewhere / ... is basically
| brilliant.
| impossiblefork wrote:
| Even though there's of course no guarantee of people getting
| these chain of thought traces, or whatever one is to call them,
| I can imagine these being very useful for people learning
| competitive mathematics, because it must in fact give the full
| reasoning, and transformers in themselves aren't really that
| smart, usually, so it's probably feasible for a person with
| very normal intellectual abilities to reproduce these traces
| with practice.
| Salgat wrote:
| It's interesting how it basically generates a larger sample
| size to create a regression against. The larger the input, the
| larger the surface area it can compare against existing
| training data (implicitly through regression of course).
| crazygringo wrote:
| Seriously. I actually feel as impressed by the chain of
| thought, as I was when ChatGPT first came out.
|
| This isn't "just" autocompletion anymore, this is actual step-
| by-step reasoning full of ideas and dead ends and refinement,
| just like humans do when solving problems. Even if it is still
| ultimately being powered by "autocompletion".
|
| But then it makes me wonder about human reasoning, and what if
| it's similar? Just following basic patterns of "thinking steps"
| that ultimately aren't any different from "English language
| grammar steps"?
|
| This is truly making me wonder if LLM's are actually far more
| powerful than we thought at first, and if it's just a matter of
| figuring out how to plug them together in the right
| configurations, like "making them think".
| AndyKelley wrote:
| You ever see that scene from Westworld? (spoiler)
| https://www.youtube.com/watch?v=ZnxJRYit44k
| dsign wrote:
| You are just catching up to this idea, probably after hearing
| 2^n explanations about why we humans are superiors to <<fill
| in here our latest creation>>.
|
| I'm not the kind of scientist that can say how good an LLM is
| for human reasoning, but I know that we humans are very
| incentivized and kind of good at scaling, composing and
| perfecting things. If there is money to pay for human effort,
| we will play God no-problem, and maybe outdo the divine.
| Which makes me wonder, isn't there any other problem in our
| bucket list to dump ginormous amounts of effort at... maybe
| something more worth-while than engineering the thing that
| will replace Homo Sapiens?
| Nadya wrote:
| When an AI makes a silly math mistake we say it is bad at
| math and laugh at how dumb it is. Some people extrapolate
| this to "they'll never get any better and will always be a
| dumb toy that gets things wrong". When I forget to carry a 1
| when doing a math problem we call it "human error" even if I
| make that mistake an embarrassing number of times throughout
| my lifetime.
|
| Do I think LLM's are alive/close to ASI? No. Will they get
| there? If it's even at all possible - almost certainly one
| day. Do I think people severely underestimate AI's ability to
| solve problems while significantly overestimating their own?
| Absolutely 10,000%.
|
| If there is one thing I've learned from watching the AI
| discussion over the past 10-20 years its that people have
| overinflated egos and a crazy amount of hubris.
|
| "Today is the worst that it will ever be." applies to an
| awful large number of things that people work on creating and
| improving.
| tsoj wrote:
| Yeah, humans are very similar. We have intuitive immediate-
| next-step-suggestions, and then we apply these intuitive next
| steps, until we find that it lead to a dead end, and then we
| backtrack.
|
| I always say, the way we used LLMs (so far) is basically like
| having a human write text only on gut reactions, and without
| backspace key.
| exe34 wrote:
| that's my assessment too. there's even a phenomenon I've
| observed both in others and myself, when thrust into a new
| field and given a task to complete, we do it to the best of
| our ability, which is often sod all. so we ape the things
| we've heard others say, roughly following the right chain of
| reasoning by luck, and then suddenly say something that in
| hind sight, with proper training, we realise was incredibly
| stupid. we autocomplete and then update with rlhf.
|
| we also have a ton of heuristics that trigger a closer look
| and loading of specific formal reasoning, but by and large,
| most of our thought process is just auto complete.
| armchairhacker wrote:
| I think it's similar, although I think it would be more
| similar if the LLM did the steps in lower layers (not in
| English), and instead of the end being fed to the start,
| there would be a big mess of cycles throughout the neural
| net.
|
| That could be more efficient since the cycles are much
| smaller, but harder to train.
| vanviegen wrote:
| It doesn't do the 'thinking' in English (inference is just
| math), but it does now verbalize intermediate thoughts in
| English (or whatever the input language is, presumably),
| just like humans tend to do.
| ActorNightly wrote:
| Again its not reasoning.
|
| Reasoning would imply that it can figure out stuff without
| being trained in it.
|
| The chain of thought is basically just a more accurate way to
| map input to output. But its still a map, i.e forward only.
|
| If an LLM coud reason, you should be able to ask it a
| question about how to make a bicycle frame from scratch with
| a small home cnc with limited work area and it should be able
| to iterate on an analysis of the best way to put it together,
| using internet to look up available parts and make decisions
| on optimization.
|
| No LLM can do that or even come close, because there are no
| real feedback loops, because nobody knows how to train a
| network like that.
| cowsaymoo wrote:
| > THERE ARE THREE R'S IN STRAWBERRY
|
| hilarious
| evilfred wrote:
| which makes it even funnier when the Chain is just... wrong
| https://x.com/colin_fraser/status/1834336440819614036
| davesque wrote:
| Yes and apparently we won't have access to that chain of
| thought in the release version:
|
| "after weighing multiple factors including user experience,
| competitive advantage, and the option to pursue the chain of
| thought monitoring, we have decided not to show the raw chains
| of thought to users"
| k2xl wrote:
| Pricing page updated for O1 API costs.
|
| https://openai.com/api/pricing/
|
| $15.00 / 1M input tokens $60.00 / 1M output tokens
|
| For o1 preview
|
| Approx 3x the price of gpt4o.
|
| o1-mini $3.00 / 1M input tokens $12.00 / 1M output tokens
|
| About 60% of the cost of gpt4o. Much more expensive than
| gpt4o-mini.
|
| Curious on the performance/tokens per second for these new
| massive models.
| logicchains wrote:
| I guess they'd also charge for the chain of thought tokens, of
| which there may be many, even if users can't see them.
| fraboniface wrote:
| That would be very bad product design. My understanding is
| that the model itself is similar to GPT4o in architecture but
| trained and used differently. So the 5x relative increase in
| output token cost likely already accounts for hidden tokens
| and additional compute.
| natrys wrote:
| > While reasoning tokens are not visible via the API, they
| still occupy space in the model's context window and are
| billed as output tokens.
|
| https://platform.openai.com/docs/guides/reasoning
|
| So yeah, it is in fact very bad product design. I hope
| Llama catches up in a couple of months.
| hi wrote:
| BUG: https://openai.com/index/reasoning-in-gpt/
|
| > o1 models are currently in beta - The o1 models are currently
| in beta with limited features. Access is limited to developers in
| tier 5 (check your usage tier here), with low rate limits (20
| RPM). We are working on adding more features, increasing rate
| limits, and expanding access to more developers in the coming
| weeks!
|
| https://platform.openai.com/docs/guides/reasoning/reasoning
| cptcobalt wrote:
| I'm in Tier 4, and not far off from Tier 5. The docs aren't
| quite transparent enough to show that if I buy credits if I'll
| be bumped up to Tier 5, or if I actually have to use enough
| credits to get into Tier 5.
|
| Edit, w/ real time follow up:
|
| Prior to buying the credits, I saw O1-preview in the Tier 5
| model list as a Tier 4 user. I bought credits to bump to Tier 5
| --not much, I'd have gotten there before the end of the year.
| The OpenAI website now shows I'm in Tier 5, but O1-preview is
| not in the Tier 5 model list for me anymore. So sneaky of them!
| hi wrote:
| https://news.ycombinator.com/item?id=41523070#41523525
| paxys wrote:
| 2018 - gpt1
|
| 2019 - gpt2
|
| 2020 - gpt3
|
| 2022 - gpt3.5
|
| 2023 - gpt4
|
| 2023 - gpt4-turbo
|
| 2024 - gpt-4o
|
| 2024 - o1
|
| Did OpenAI hire Google's product marketing team in recent years?
| Infinity315 wrote:
| No, this is just how Microsoft names things.
| logicchains wrote:
| We'll know the Microsoft takeover is complete when OpenAI
| release Ai.net.
| randomdata wrote:
| GPT# forthcoming. You heard it here first.
| adverbly wrote:
| Makes sense to me actually. This is a different product. It
| doesn't respond instantly.
|
| It fundamentally makes sense to separate these two products in
| the AI space. There will obviously be a speed vs quality trade-
| off with a variety of products across the spectrum over time.
| LLMs respond way too fast to actually be expected to produce
| the maximum possible quality of a response to complex queries.
| ilaksh wrote:
| One of them would have been named gpt-5, but people forget what
| an absolute panic there was about gpt-5 for quite a few people.
| That caused Altman to reassure people they would not release
| 'gpt-5' any time soon.
|
| The funny thing is, after a certain amount of time, the gpt-5
| panic eventually morphed into people basically begging for
| gpt-5. But he already said he wouldn't release something called
| 'gpt-5'.
|
| Another funny thing is, just because he didn't name any of them
| 'gpt-5', everyone assumes that there is something called
| 'gpt-5' that has been in the works and still is not released.
| zamadatix wrote:
| This doesn't feel like GPT-5, the training data cutoff is Oct
| 2023 which is the same as the other GPT-4 models and it
| doesn't seem particularly "larger" as much as "runs
| differently". Of course it's all speculation one way or the
| other.
| randomdata wrote:
| They partnered with Microsoft, remember?
|
| 1985 - Windows 1.0
|
| 1987 - Windows 2.0
|
| 1990 - Windows 3.0
|
| 1992 - Windows 3.1
|
| 1995 - Windows 95
|
| 1998 - Windows 98
|
| 2000 - Windows ME (Millennium Edition)
|
| 2001 - Windows XP
|
| 2006 - Windows Vista
|
| 2009 - Windows 7
|
| 2012 - Windows 8
|
| 2013 - Windows 8.1
|
| 2015 - Windows 10
|
| 2021 - Windows 11
| oblio wrote:
| Why did you have to pick on Windows? :-(
|
| If you want real atrocities, look at Xbox.
| randomdata wrote:
| Honestly, it is the only Microsoft product I know. Xbox may
| be a better example, but I know nothing about the Xbox. But
| I am interested to learn! What is notable about its naming?
| murrain wrote:
| Xbox
|
| Xbox 360
|
| Xbox One => Xbox One S / Xbox One X
|
| Xbox Series S / Xbox Series X
| oblio wrote:
| https://computercity.com/consoles/xbox/xbox-consoles-
| list-in...
|
| No real chronology, Xbox One is basically the third
| version. Then Xbox One X and Xbox Series X. Everything is
| atrocious about the naming.
| randomdata wrote:
| Got it! If we're picking favourites, though, I still like
| Windows as it, like GPT, starts with reasonably sensible
| names and then goes completely off the rails.
| drexlspivey wrote:
| 1998 - Half-Life
|
| 1999 - Half-Life: Opposing Force
|
| 2001 - Half-Life: Blue Shift
|
| 2001 - Half-Life: Decay
|
| 2004 - Half-Life: Source
|
| 2004 - Half-Life 2
|
| 2004 - Half-Life 2: Deathmatch
|
| 2005 - Half-Life 2: Lost Coast
|
| 2006 - Half-Life Deathmatch: Source
|
| 2006 - Half-Life 2: Episode One
|
| 2007 - Half-Life 2: Episode Two
|
| 2020 - Half-Life: Alyx
| CamperBob2 wrote:
| They signed a cross-licensing deal with the USB Consortium.
| trash_cat wrote:
| It's not that bad...It's quite easy to follow and understand.
| aktuel wrote:
| If I pay for the chain of thought, I want to see the chain of
| thought. Simple. How would I know if it happened at all? Trust
| OpenAI? LOL
| baq wrote:
| Easy solution - don't pay!
| 93po wrote:
| how do you know it isn't some guy typing responses to you when
| you use openAI?
| asadm wrote:
| I am not up-to-speed on CoT side but is this similar to how
| perplexity does it ie.
|
| - generate a plan - execute the steps in plan (search internet,
| program this part, see if it is compilable)
|
| each step is a separate gpt inference with added context from
| previous steps.
|
| is O1 same? or does it do all this in a single inference run?
| seydor wrote:
| that is the summary of the task it presents to the user. The
| full chain of thought seems more mechanistic
| golol wrote:
| There is a hige difference which is that thex use reinforcement
| learning to make the model use the Chain-of-Thought better.
| bbstats wrote:
| Finally, a Claude competitor!
| MrRobotics wrote:
| This is the sort of reasoning needed to solve the ARC AGI
| benchmark.
| gradus_ad wrote:
| Interesting sequence from the Cipher CoT:
|
| Third pair: 'dn' to 'i'
|
| 'd'=4, 'n'=14
|
| Sum:4+14=18
|
| Average:18/2=9
|
| 9 corresponds to 'i'(9='i')
|
| But 'i' is 9, so that seems off by 1.
|
| So perhaps we need to think carefully about letters.
|
| Wait, 18/2=9, 9 corresponds to 'I'
|
| So this works.
|
| -----
|
| This looks like recovery from a hallucination. Is it realistic to
| expect CoT to be able to recover from hallucinations this
| quickly?
| trash_cat wrote:
| How do you mean quickly? It probably will take a while for it
| to output the final answer as it needs to re-prompt itself. It
| won't be as fast as 4o.
| bigyikes wrote:
| 4o could already recover from hallucination in a limited
| capacity.
|
| I've seen it, mid-reply say things like "Actually, that's
| wrong, let me try again."
| NightlyDev wrote:
| Did it hallucinate? I haven't looked at it, but lowercase i and
| uppercase i is not the same number if you're getting the number
| from ascii
| machiaweliczny wrote:
| In general if hallucination ratio is 2% can't it be reduced to
| 0.04% by running twice or sth like this. I think they should
| try establishing the facts from different angles and this
| probably would work fine to minimize hallucinations. But if
| this was that simple somebody would already do it...
| bn-l wrote:
| > Unless otherwise specified, we evaluated o1 on the maximal
| test-time compute setting.
|
| Maximal test time is the maximum amount of time spent doing the
| "Chain of Thought" "reasoning". So that's what these results are
| based on.
|
| The caveat is that in the graphs they show that for each increase
| in test-time performance, the (wall) time / compute goes up
| _exponentially_.
|
| So there is a potentially interesting play here. They can
| honestly boast these amazing results (it's the same model after
| all) yet the actual product may have a lower order of magnitude
| of "test-time" and not be as good.
| logicchains wrote:
| Surprising that at run time it needs an exponential increase in
| thinking to achieved a linear increase in output quality. I
| suppose it's due to diminishing returns to adding more and more
| thought.
| HarHarVeryFunny wrote:
| The exponential increase is presumably because of the
| branching factor of the tree of thoughts. Think of a binary
| tree who's number of leaf nodes doubles (= exponential
| growth) at each level.
|
| It's not too surprising that the corresponding increase in
| quality is only linear - how much difference in quality would
| you expect between the best, say, 10 word answer to a
| question, and the best 11 word answer ?
|
| It'll be interesting to see what they charge for this. An
| exponential increase in thinking time means an exponential
| increase in FLOPs/dollars.
| alwa wrote:
| I interpreted it to suggest that the product might include a
| user-facing "maximum test time" knob.
|
| Generating problem sets for kids? You might only need or want a
| basic level of introspection, even though you like the flavor
| of this model's personality over that of its predecessors.
|
| Problem worth thinking long, hard, and expensively about? Turn
| that knob up to 11, and you'll get a better-quality answer with
| no human-in-the-loop coaching or trial-and-error involved.
| You'll just get your answer in timeframes closer to human ones,
| consuming more (metered) tokens along the way.
| mrdmnd wrote:
| Yeah, I think this is the goal - remember; there are some
| problems that only need to be solved correctly once! Imagine
| something like a millennium problem - you'd be willing to
| wait a pretty long time for a proof of the RH!
| bluecoconut wrote:
| This power law behavior of test-time improvement seems to be
| pretty ubiquitous now. In more agents is all you need [1], they
| start to see this as a function of ensemble size. It also shows
| up in: Large Language Monkeys: Scaling Inference Compute with
| Repeated Sampling [2]
|
| I sorta wish everyone would plot their y-axis with logit
| y-axis, rather than 0->100 accuracy (including the openai
| post), to help show the power-law behavior. This is especially
| important when talking about incremental gains in the ~90->95,
| 95->99%. When the values (like the open ai post) are between
| 20->80, logit and linear look pretty similar, so you can "see"
| the inference power-law
|
| [1] https://arxiv.org/abs/2402.05120 [2]
| https://arxiv.org/abs/2407.21787
| HPMOR wrote:
| A near perfect on AMC 12, 1900 CodeForces ELO, and silver medal
| IOI competitor. In two years, we'll have models that could easily
| win IMO and IOI. This is __incredible__!!
| vjerancrnjak wrote:
| It depends on what they mean by "simulation". It sounds like o1
| did not participate in new contests with new problems.
|
| Any previous success of models with code generation focus was
| easily discovered to be a copy-paste of a solution in the
| dataset.
|
| We could argue that there is an improvement in "understanding"
| if the code recall is vastly more efficient.
| lupire wrote:
| Near perfect AIME, not just AMC12.
|
| But each solve costs far more time and energy than a competent
| human takes.
| islewis wrote:
| My first interpretation of this is that it's jazzed-up Chain-Of-
| Thought. The results look pretty promising, but i'm most
| interested in this:
|
| > Therefore, after weighing multiple factors including user
| experience, competitive advantage, and the option to pursue the
| chain of thought monitoring, we have decided not to show the raw
| chains of thought to users.
|
| Mentioning competitive advantage here signals to me that OpenAI
| believes there moat is evaporating. Past the business context, my
| gut reaction is this negatively impacts model usability, but i'm
| having a hard time putting my finger on why.
| logicchains wrote:
| >my gut reaction is this negatively impacts model usability,
| but i'm having a hard time putting my finger on why.
|
| If the model outputs an incorrect answer due to a single
| mistake/incorrect assumption in reasoning, the user has no way
| to correct it as it can't see the reasoning so can't see where
| the mistake was.
| accrual wrote:
| Maybe CriticGPT could be used here [0]. Have the CoT model
| produce a result, and either automatically or upon user
| request, ask CriticGPT to review the hidden CoT and feed the
| critique into the next response. This way the error can
| (hopefully) be spotted and corrected without revealing the
| whole process to the user.
|
| [0] https://openai.com/index/finding-gpt4s-mistakes-with-
| gpt-4/
|
| Day dreaming: imagine if this architecture takes off and the
| AI "thought process" becomes hidden and private much like
| human thoughts. I wonder then if a future robot's inner
| dialog could be subpoenaed in court, connected to some
| special debugger, and have their "thoughts" read out loud in
| court to determine why it acted in some way.
| thomasahle wrote:
| > my gut reaction is this negatively impacts model usability,
| but i'm having a hard time putting my finger on why.
|
| This will make it harder for things like DSPy to work, which
| rely using "good" CoT examples as few-shot examples.
| m3kw9 wrote:
| The moat is expanding from use count, also the moat is to lead
| and advance faster than anyone can catch up, you will always
| have the best mode with the best infrastructure and low limits.
| fnord77 wrote:
| > Available starting 9.12
|
| I don't see it
| airstrike wrote:
| Only for those accounts in Tier 5 (or above, if they exist)
|
| Unfortunately you and I don't have enough operating thetans yet
| tedsanders wrote:
| In ChatGPT, it's rolling out to Plus users gradually over the
| next few hours.
|
| In API, it's limited to tier 5 customers (aka $1000+ spent on
| the API in the past).
| adverbly wrote:
| Incredible results. This is actually groundbreaking assuming that
| they followed proper testing procedures here and didn't let test
| data leak into the training set.
| ARandumGuy wrote:
| One thing that makes me skeptical is the lack of specific labels
| on the first two accuracy graphs. They just say it's a "log
| scale", without giving even a ballpark on the amount of time it
| took.
|
| Did the 80% accuracy test results take 10 seconds of compute? 10
| minutes? 10 hours? 10 days? It's impossible to say with the data
| they've given us.
|
| The coding section indicates "ten hours to solve six challenging
| algorithmic problems", but it's not clear to me if that's tied to
| the graphs at the beginning of the article.
|
| The article contains a lot of facts and figures, which is good!
| But it doesn't inspire confidence that the authors chose to
| obfuscate the data in the first two graphs in the article. Maybe
| I'm wrong, but this reads a lot like they're cherry picking the
| data that makes them look good, while hiding the data that
| doesn't look very good.
| wmf wrote:
| People have been celebrating the fact that tokens got 100x
| cheaper and now here's a new system that will use 100x more
| tokens.
| cowpig wrote:
| Isn't that part of the point?
| jsheard wrote:
| Also you now have to pay for tokens you can't see, and just
| have to trust that OpenAI is using them economically.
| brookst wrote:
| Token count was always an approximation of value. This may
| help break that silly idea.
| regularfry wrote:
| I don't think it's much good as an approximation of
| value, but it seems ok as an approximation of cost.
| seydor wrote:
| If it 's reasoning correctly, it shouldnt need a lot of
| tokens because you don't need to correct it.
|
| You only need to ask it to solve nuclear fusion once.
| msp26 wrote:
| Have you seen how long the CoT was for the example. It's
| incredibly verbose.
| slt2021 wrote:
| I find there is an educational benefit in verbosity, it
| helps to teach user to think like a machine
| legel wrote:
| Which is why it is incredibly depressing that OpenAI will
| _not_ publish the raw chain of thought.
|
| "Therefore, after weighing multiple factors including
| user experience, competitive advantage, and the option to
| pursue the chain of thought monitoring, we have decided
| not to show the raw chains of thought to users. We
| acknowledge this decision has disadvantages. We strive to
| partially make up for it by teaching the model to
| reproduce any useful ideas from the chain of thought in
| the answer. For the o1 model series we show a model-
| generated summary of the chain of thought."
| slt2021 wrote:
| maybe they will enable to show CoT for a limited uses,
| like 5 prompts a day for Premium users, or for Enterprise
| users with agreement not to steal CoT or something like
| that.
|
| if OpenAI sees this - please allow users to see CoT for a
| few prompts per day, or add it to Azure OpenAI for
| Enterprise customers with legal clauses not to steal CoT
| from-nibly wrote:
| As someone experienced with operations / technical debt /
| weird company specific nonsense (Platform Engineer). No,
| you have to solve nuclear fusion at <insert-my-company>.
| You gotta do it over and over again. If it were that simple
| we wouldn't have even needed AI we would have hand written
| a few things, and then everything would have been legos,
| and legos of legos, but it takes a LONG time to find new
| true legos.
| outofpaper wrote:
| I'm pretty sure everything is Lego and Legos of Legos.
|
| You show me something new and I say look down at who's
| shoulders we're standing on, what libraries we've build
| with.
| charlescurt123 wrote:
| with these methods the issue is the log scale of compute.
| Let's say you ask it to solve fusion. It may be able to
| solve it but the issue is it's unverifiable WHICH was
| correct.
|
| So it may generate 10 Billion answers to fusion and only
| 1-10 are correct.
|
| There would be no way to know which one is correct without
| first knowing the answer to the question.
|
| This is my main issue with these methods. They assume the
| future via RL then when it gets it right they mark that.
|
| We should really be looking at methods of percentage it was
| wrong rather then it was right a single time.
| genewitch wrote:
| This sounds suspiciously like the reason that quantum
| compute is not ready for prime-time yet.
| 0x_rs wrote:
| AlphaFold simulated the structure of over 200 million
| proteins. Among those, there could be revolutionary ones
| that could change the medical scientific field forever, or
| they could all be useless. The reasoning is sound, but
| that's as far as any such tool can get, and you won't know
| it until you attempt to implement it in real life. As long
| as those models are unable to perfectly recreate the laws
| of the universe to the maximum resolution imaginable and
| follow them, you won't see an AI model, let alone a LLM,
| provide anything of the sort.
| esafak wrote:
| ...while providing a significant advance. That's a good
| problem.
| mewpmewp2 wrote:
| Isn't that part of developing a new tech?
| zamadatix wrote:
| The new thing that can do more at the "ceiling" price doesn't
| remove your ability to still use the 100x cheaper tokens for
| the things that were doable on that version.
| digging wrote:
| That exact pattern is always true of technological advance.
| Even for a pretty broad definition of technology. I'm not
| sure if it's perfectly described by the name "induced demand"
| but it's basically the same thing.
| packetlost wrote:
| When one axis is on log scale and the other is linear with the
| plot points appearing linear-ish, doesn't it mean there's a
| roughly exponential relationship between the two axis?
| ARandumGuy wrote:
| It'd be more accurate to call it a logarithmic relationship,
| since compute time is our input variable. Which itself is a
| bit concerning, as that implies that modest gains in accuracy
| require exponentially more compute time.
|
| In either case, that still doesn't excuse not labeling your
| axis. Taking 10 seconds vs 10 days to get 80% accuracy
| implies radically different things on how developed this
| technology is, and how viable it is for real world
| applications.
|
| Which isn't to say a model that takes 10 days to get an 80%
| accurate result can't be useful. There are absolutely use
| cases where that could represent a significant improvement on
| what's currently available. But the fact that they're
| obfuscating this fairly basic statistic doesn't inspire
| confidence.
| packetlost wrote:
| > Which itself is a bit concerning, as that implies that
| modest gains in accuracy require exponentially more compute
| time
|
| This is more of what I was getting at. I agree they should
| label the axis regardless, but I think the scaling
| relationship is interesting (or rather, concerning) on its
| own.
| KK7NIL wrote:
| The absolute time depends on hardware, optimizations, exact
| model, etc; it's not a very meaningful number to quantify
| the reinforcement technique they've developed, but it is
| very useful to estimate their training hardware and other
| proprietary information.
| j_maffe wrote:
| It's not about the literally quantity/value, it's about
| the order of growth of output vs input. Hardware and
| optimizations don't really change that.
| KK7NIL wrote:
| Exactly, that's why the absolute computation time doesn't
| matter, only relative growth, which is exactly what they
| show.
| jstummbillig wrote:
| I don't think it's worth any debate. You can simply find out
| how it does for you, now(-ish, rolling out).
|
| In contrast: Gemini Ultra, the best, non-existent Google Model
| for the past few month now, that people nonetheless are happy
| to extrapolate excitement over.
| swatcoder wrote:
| > Did the 80% accuracy test results take 10 seconds of compute?
| 10 minutes? 10 hours? 10 days? It's impossible to say with the
| data they've given us.
|
| The gist of the answer is hiding in plain sight: it took so
| long, on an exponential cost function, that they couldn't
| afford to explore any further.
|
| The better their max demonstrated accuracy, the more impressive
| this report is. So why stop where they did? Why omit actual
| clock times or some cost proxy for it from the report?
| Obviously, it's because continuing was impractical and because
| those times/costs were already so large that they'd unfavorably
| affect how people respond to this report
| jsheard wrote:
| See also: them still sitting on Sora seven months after
| announcing it. They've never given any indication whatsoever
| of how much compute it uses, so it may be impossible to
| release in its current state without charging an exorbitant
| amount of money per generation. We do know from people who
| have used it that it takes between 10 and 20 minutes to
| render a shot, but how much hardware is being tied up during
| that time is a mystery.
| ben_w wrote:
| Could well be.
|
| It's also entirely possible they are simply sincere about
| their fear it may be used to influence the upcoming US
| election.
|
| Plenty of people (me included) are sincerely concerned
| about the way even mere still image generators can drown
| out the truth with a flood of good-enough-at-first-glance
| fiction.
| jsheard wrote:
| If they were sincere about that concern then they
| wouldn't build it at all, if it's ever made available to
| the public then it will eventually be available during an
| election. It's not like the 2024 US presidential election
| is the end of history.
| e1g wrote:
| The risk is not "interfering with the US elections", but
| "being on the front page of everything as the only AI
| company interfering with US elections". This would
| destroy their peacocking around AGI/alignment while
| raising billions from pension funds.
|
| OpenAI is in a very precarious position. Maybe they could
| survive that hit in four years, but it would be fatal
| today. No unforced errors.
| smegger001 wrote:
| i think the hope is by the next presidential election no
| one will trust video anymore anyway so the new normal
| wont be as chaotic as if the dropped in the middle of an
| already contentious election.
|
| as for not building it at all its a obvious next step in
| generative ai models that if they don't make it someone
| else will anyway.
| bamboozled wrote:
| Even if Kamala wins (praise be to god that she does),
| those people aren't just going to go away until social
| media does. Social media is the cause of a lot of the
| conspiracy theory mania.
|
| So yeah, better to never release the model...even though
| Elon would in a second if he had it.
| dvfjsdhgfv wrote:
| But this cat run out of the bag years ago, didn't it?
| Trump himself is using AI-generated images in his
| campaign. I'd go even further: the more fake images
| appear, the faster the society as a whole will learn to
| distrust anything by default.
| 01HNNWZ0MV43FF wrote:
| Personally I'm not a fan of accelerationism
| ben_w wrote:
| Nothing works without trust, none of us is an island.
|
| Everyone has a different opinion on what threshold of
| capability is important, and what to do about it.
| digging wrote:
| Doesn't strike me as the kind of principle OpenAI is
| willing to slow themselves down for, to be honest.
| Atotalnoob wrote:
| Why did they release this model then?
| ben_w wrote:
| Their public statements that the only way to safely learn
| how to deal with the things AI can do, is to show what it
| can do and get feedback from society:
|
| """We want to successfully navigate massive risks. In
| confronting these risks, we acknowledge that what seems
| right in theory often plays out more strangely than
| expected in practice. We believe we have to continuously
| learn and adapt by deploying less powerful versions of
| the technology in order to minimize "one shot to get it
| right" scenarios.""" - https://openai.com/index/planning-
| for-agi-and-beyond/
|
| I don't know if they're actually correct, but it at least
| passes the sniff test for plausibility.
| gloryjulio wrote:
| Also the the sora videos are proven to be modified ads. We
| still need to see how it perform first
| MrNeon wrote:
| > Also the the sora videos are proven to be modified ads
|
| Can't find anything about that, you got a link?
| gloryjulio wrote:
| https://futurism.com/the-byte/openai-sora-demo https://ol
| d.reddit.com/r/vfx/comments/1cuj360/turns_out_that...
|
| here is the link. The balloon video had heavy editing
| involved.
| bjornsing wrote:
| So now it's a question of how fast the AGI will run? :)
| oblio wrote:
| It's fine, it will only need to be powered by a black hole to
| run.
| exe34 wrote:
| the first one anyway. after that it will find more
| efficient ways. we did, afterall.
| wahnfrieden wrote:
| it's not obviously achievable. for instance, we don't
| have the compute power to simulate cellular organisms of
| much complexity, and have not found efficiencies to scale
| that
| HarHarVeryFunny wrote:
| It's not AGI - it's tree of thoughts, driven by some RL-
| derived heuristics.
|
| I suppose what this type of approach provides is better
| prediction/planning by using more of what the model learnt
| during training, but it doesn't address the model being able
| to learn anything new.
|
| It'll be interesting to see how this feels/behaves in
| practice.
| juliend2 wrote:
| I see this pattern coming where we're still able to say:
|
| "It's not AGI - it's X, driven by Y-driven heuristics",
|
| but that's going to effectively be an AGI if given enough
| compute/time/data.
|
| Being able to describe the theory of how it's doing its
| thing sure is reassuring though.
| skywhopper wrote:
| Yeah, this hiding of the details is a huge red flag to me. Even
| if it takes 10 days, it's still impressive! But if they're
| afraid to say that, it tells me they are more concerned about
| selling the hype than building a quality product.
| bluecoconut wrote:
| Super hand-waving rough estimate: Going off of five points of
| reference / examples that sorta all point in the same
| direction. 1. looks like they scale up by about ~100-200 on the
| x axis when showing that test time result. 2. Based on the
| o1-mini post [1], there's an "inference cost" where you can see
| GPT-4o and GPT-4o mini as dots in the bottom corner, haha (you
| can extract X values, ive done so below) 3. There's a video
| showing the "speed" in the chat ui (3s vs. 30s) 4. The pricing
| page [2] 5. On their API docs about reasoning, they quantify
| "reasoning tokens" [3]
|
| First, from the original plot, we have roughly 2 orders of
| magnitude to cover (~100-200x)
|
| Next, from the cost plots: super handwaving guess, but since
| 5.77 / 0.32 = ~18, and the relative cost for gpt-4o vs
| gpt-4o-mini is ~20-30, this roughly lines up. This implies that
| o1 costs ~1000x the cost than gpt-4o-mini for inference (not
| due to model cost, just due to the raw number of chain of
| thought tokens it produces). So, my first "statement", is that
| I trust the "Math performance vs Inference Cost" plot on the
| o1-mini page to accurately represent "cost" of inference for
| these benchmark tests. This is now a "cost" relative set of
| numbers between o1 and 4o models.
|
| I'm also going to make an assumption that o1 is roughly the
| same size as 4o inherently, and then from that and the SVG,
| roughly going to estimate that they did a "net" decoding of
| ~100x for the o1 benchmarks in total. (5.77 vs (354.77 - 635)).
|
| Next, from the CoT examples they gave us, they actually show
| the CoT preview where (for the math example) it says "...more
| lines cut off...", A quick copy paste of what they did include
| includes ~10k tokens (not sure if copy paste is good though..)
| and from the cipher text example I got ~5k tokens of CoT, while
| there are only ~800 in the response. So, this implies that
| there's a ~10x size of response (decoded tokens) in the
| examples shown. It's possible that these are "middle of the
| pack" / "average quality" examples, rather than the "full CoT
| reasoning decoding" that they claim they use. (eg. from the log
| scale plot, this would come from the middle, essentially 5k or
| 10k of tokens of chain of thought). This also feels reasonable,
| given that they show in their API [3] some limits on the
| "reasoning_tokens" (that they also count)
|
| All together, the CoT examples, pricing page, and reasoning
| page all imply that reasoning itself can be variable length by
| about ~100x (2 orders of magnitude), eg. example: 500, 5k (from
| examples) or up to 65,536 tokens of reasoning output (directly
| called out as a maximum output token limit).
|
| Taking them on their word that "pass@1" is honest, and they are
| not doing k-ensembles, then I think the only reasonable thing
| to assume is that they're decoding their CoT for "longer
| times". Given the roughly ~128k context size limit for the
| model, I suspect their "top end" of this plot is ~100k tokens
| of "chain of thought" self-reflection.
|
| Finally, at around 100 tokens per second (gpt-4o decoding
| speed), this leaves my guess for their "benchmark" decoding
| time at the "top-end" to be between ~16 minutes (full 100k
| decoding CoT, 1 shot) for a single test-prompt, and ~10 seconds
| on the low end. So for that X axis on the log scale, my
| estimate would be: ~3-10 seconds as the bottom X, and then
| 100-200x that value for the highest value.
|
| All together, to answer your question: I think the 80% accuracy
| result took about ~10-15 minutes to complete. I also believe
| that the "decoding cost" of o1 model is very close to the
| decoding cost of 4o, just that it requires many more reasoning
| tokens to complete. (and then o1-mini is comparable to 4o-mini,
| but also requiring more reasoning tokens)
|
| [1] https://openai.com/index/openai-o1-mini-advancing-cost-
| effic... Extracting "x values" from the SVG:
| GPT-4o-mini: 0.3175 GPT-4o: 5.7785 o1: (354.7745,
| 635) o1-preview: (278.257, 325.9455) o1-mini:
| (66.8655, 147.574)
|
| [2] https://openai.com/api/pricing/ gpt-4o:
| $5.00 / 1M input tokens $15.00 / 1M output tokens
| o1-preview: $15.00 / 1M input tokens $60.00 / 1M
| output tokens
|
| [3] https://platform.openai.com/docs/guides/reasoning
| usage: { total_tokens: 1000, prompt_tokens:
| 400, completion_tokens: 600,
| completion_tokens_details: { reasoning_tokens: 500
| } }
| bluecoconut wrote:
| Some other follow up reflections
|
| 1. I wish that Y-axes would switch to be logit instead of
| linear, to help see power-law scaling on these 0->1 measures.
| In this case, 20% -> 80% it doesn't really matter, but for
| other papers (eg. [2] below) it would help see this powerlaw
| behavior much better.
|
| 2. The power law behavior of inference compute seems to be
| showing up now in multiple ways. Both in ensembles [1,2], as
| well as in o1 now. If this is purely on decoding self-
| reflection tokens, this has a "limit" to its scaling in a
| way, only as long as the context length. I think this implies
| (and I am betting) that relying more on multiple parallel
| decodings is more scalable (when you have a better critic /
| evaluator).
|
| For now, instead of assuming they're doing any ensemble like
| top-k or self-critic + retries, the single rollout with
| increasing token size does seem to roughly match all the
| numbers, so that's my best bet. I hypothesize we'd see a
| continued improvement (in the same power-law sort of way,
| fundamentally along with the x-axis of "flop") if we combined
| these longer CoT responses, with some ensemble strategy for
| parallel decoding and then some critic/voting/choice. (which
| has the benefit of increasing flops (which I believe is the
| inference power-law), while not necessarily increasing
| latency)
|
| [1] https://arxiv.org/abs/2402.05120 [2]
| https://arxiv.org/abs/2407.21787
| bluecoconut wrote:
| oh, they do talk about it On the 2024 AIME
| exams, GPT-4o only solved on average 12% (1.8/15) of
| problems. o1 averaged 74% (11.1/15) with a single sample
| per problem, 83% (12.5/15) with consensus among 64 samples,
| and 93% (13.9/15) when re-ranking 1000 samples with a
| learned scoring function. A score of 13.9 places it among
| the top 500 students nationally and above the cutoff for
| the USA Mathematical Olympiad.
|
| showing that as they increase the k of ensemble, they can
| continue to get it higher. All the way up to 93% when using
| 1000 samples.
| 620gelato wrote:
| I think I'd be curious to know, if the size of ensemble
| is another scaling dimension for compute, alongside the
| "thinking time".
| worstspotgain wrote:
| I don't think it's hard to compute the following:
|
| - At the high end, there is a likely nonlinear relationship
| between answer quality and compute.
|
| - We've gotten used to a flat-price model. With AGI-level
| models, we might have to pay more for more difficult and more
| important queries. Such is the inherent complexity involved.
|
| - All this stuff will get better and cheaper over time, within
| reason.
|
| I'd say let's start by celebrating that machine thinking of
| this quality is possible at all.
| irthomasthomas wrote:
| This is a prompt engineering saas
| extr wrote:
| Interesting that the coding win-rate vs GPT-4o was only 10%
| higher. Very cool but clearly this model isn't as much of a slam
| dunk as the static benchmarks portray.
|
| However, it does open up an interesting avenue for the future.
| Could you prompt-cache just the chain-of-thought reasoning bits?
| mewpmewp2 wrote:
| It's hard to evaluate those win-rates, because if it's slower,
| people may have been giving easier problems, which both can
| solve and picked the faster one.
| msp26 wrote:
| > THERE ARE THREE R'S IN STRAWBERRY
|
| Well played
| impossiblefork wrote:
| Very nice.
|
| It's nice that people have taken the obvious extra-
| tokens/internal thoughts approach to a point where it actually
| works.
|
| If this works, then automated programming etc., are going to
| actually be tractable. It's another world.
| idunnoman1222 wrote:
| Did you guys use the model? Seems about the same to me
| wewtyflakes wrote:
| Maybe I missed it, but do the tokens used for internal chain of
| thought count against the output tokens of the response (priced
| at spicy level of $60.00 / 1M output tokens)?
| tedsanders wrote:
| Yes. Chain of thought tokens are billed, so requests to this
| model can be ~10x the price of gpt-4o, or even more.
| packetlost wrote:
| lol at the graphs at the top. Logarithmic scaling for
| test/compute time should make everyone who thinks AGI is possible
| with this architecture take pause.
| hidelooktropic wrote:
| I don't see any log scaled graphs.
| packetlost wrote:
| The two first graphs on the page are labelled as log scale in
| the time axis, so I don't know what you're looking at but
| it's definitely there.
| riazrizvi wrote:
| I'm not surprised there's no comparison to GPT-4. Was 4o a
| rewrite on lower specced hardware and a more quantized model,
| where the goal was to reduce costs while trying to maintain
| functionality? Do we know if that is so? That's my guess. If so
| is O1 an upgrade in reasoning complexity that also runs on
| cheaper hardware?
| kgeist wrote:
| They call GPT4 a legacy model, maybe that's why they don't
| compare to it.
| airstrike wrote:
| This model is currently available for those accounts in Tier 5
| and above, which requires "$1,000 paid [to date] and 30+ days
| since first successful payment"
|
| More info here: https://platform.openai.com/docs/guides/rate-
| limits/usage-ti...
| eucalpytus wrote:
| I didn't know this founder's edition battle pass existed.
| not_pleased wrote:
| The progress in AI is incredibly depressing, at this point I
| don't think there's much to look forward to in life.
|
| It's sad that due to unearned hubris and a complete lack of
| second-order thinking we are automating ourselves out of
| existence.
|
| EDIT: I understand you guys might not agree with my comments. But
| don't you thinking that flagging them is going a bit too far?
| mewpmewp2 wrote:
| It seems opposite to me. Imagine all the amazing technological
| advancements, etc. If there wasn't something like that what
| would you be looking forward to? Everything would be what it
| has already been for years. If this evolves it helps us open so
| many secrets of the universe.
| not_pleased wrote:
| >If there wasn't something like that what would you be
| looking forward to?
|
| First of all, I don't want to be poor. I know many of you are
| thinking something along the lines of "I am smart, I was
| doing fine before, so I will definitely continue to in the
| future".
|
| That's the unearned hubris I was referring to. We got very
| lucky as programmers, and now the gravy train seems to be
| coming to an end. And not just for programmers, the other
| white-collar and creative jobs will suffer too. The artists
| have already started experiencing the negative effects of AI.
|
| EDIT: I understand you guys might not agree with my comments.
| But don't you thinking that flagging them is going a bit too
| far?
| mewpmewp2 wrote:
| I'm not sure what you are saying exactly? Are you saying we
| live for the work?
| lambdanil wrote:
| The way the current system is set up we rely on work to
| make money. If jobs get automated away, how will we make
| money then? We aren't ready for a post-work world.
| mewpmewp2 wrote:
| Then you should have UBI.
| RobertDeNiro wrote:
| These advancements are there to benefit the top 1%, not the
| working class.
| mewpmewp2 wrote:
| That's a governing problem.
| dyauspitr wrote:
| Eh this makes me very, very excited for the future. I want
| results, I don't care if they come from humans or AI. That
| being said we might all be out of jobs soon...
| youssefabdelm wrote:
| Not at all... they're still so incapable of so much. And even
| when they do advance, they can be tremendous tools of synthesis
| and thought at an unparalleled scale.
|
| "A good human plus a machine is the best combination" --
| Kasparov
| Vecr wrote:
| It was for a while, look up "centaur" systems, that's the
| term in chess. Stockfish 17 rolls them every time.
| zamadatix wrote:
| FWIW people were probably flagging because you're a new/temp
| accounting jumping to asserting anything other than your view
| on what's being done "unearned hubris and a complete lack of
| second-order thinking", not because they don't like hearing
| your set of concerns about what it might mean.
| itissid wrote:
| One thing I find generally useful when writing large project code
| is having a code base and several branches that are different
| features I developed. I could immediately use parts of a branch
| to reference the current feature, because there is often overlap.
| This limits mistakes in large contexts and easy to iterate
| quickly.
| rfoo wrote:
| Impressive safety metrics!
|
| I wish OAI include "% Rejections on perfectly safe prompts" in
| this table, too.
| tedsanders wrote:
| Table 1 in section 3.1.1:
| https://assets.ctfassets.net/kftzwdyauwt9/2pON5XTkyX3o1NJmq4...
| RandomLensman wrote:
| How could it fail to solve some maths problems if it has a method
| for reasoning through things?
| chairhairair wrote:
| Simple questions like this are not welcomed by LLM hype
| sellers.
|
| The word "reasoning" is being used heavily in this
| announcement, but with an intentional corruption of the normal
| meaning.
|
| The models are amazing but they are fundamentally not
| "reasoning" in a way we'd expect a normal human to.
|
| This is not a "distinction without a difference". You still
| CANNOT rely on the outputs of these models in the same way you
| can rely on the outputs of simple reasoning.
| exe34 wrote:
| it depends who's doing the simple reasoning. Richard Feynman?
| yes. Donald Trump? no.
| logicchains wrote:
| I have a method for reasoning through things but I'm pretty
| sure I'd fail some of those tough math problems too.
| HarHarVeryFunny wrote:
| It's using tree search (tree of thoughts), driven by some RL-
| derived heuristics controlling what parts of the practically
| infinite set of potential responses to explore.
|
| How good the responses are will depend on how good these
| heuristics are.
| RandomLensman wrote:
| That doesn't sound like a method for reasoning.
| hatthew wrote:
| Because some steps in its reasoning were wrong
| RandomLensman wrote:
| I would demand more from machine reasoning, just like we
| demand an extremely low error rate from machine calculations.
| evrydayhustling wrote:
| Just did some preliminary testing on decrypting some ROT
| cyphertext which would have been viable for a human on paper. The
| output was pretty disappointing: lots of "workish" steps creating
| letter counts, identifying common words, etc, but many steps were
| incorrect or not followed up on. In the end, it claimed to check
| its work and deliver an incorrect solution that did not satisfy
| the previous steps.
|
| I'm not one to judge AI on pratfalls, and cyphers are a somewhat
| adversarial task. However, there was no aspect of the reasoning
| that seemed more advanced or consistent than previous chain-of-
| thought demos I've seen. So the main proof point we have is the
| paper, and I'm not sure how I'd go from there to being able to
| trust this on the kind of task it is intended for. Do others have
| patterns by which they get utility from chain of thought engines?
|
| Separately, chain of thought outputs really make me long for tool
| use, because the LLM is often forced to simulate algorithmic
| outputs. It feels like a commercial chain-of-thought solution
| like this should have a standard library of functions it can use
| for 100% reliability on things like letter counts.
| charlescurt123 wrote:
| It's RL so that means it's going to be great on tasks they
| created for training but not so much on others.
|
| Impressive but the problem with RL is that it requires
| knowledge of the future.
| changoplatanero wrote:
| Hmm, are you sure it was using the o1 model and not gpt4o? I've
| been using the o1 model and it does consistently well at
| solving rotation ciphers.
| evrydayhustling wrote:
| o1-preview . Were you using common plaintexts by chance (e.g.
| proverbs) or ROT13 specifically? Mine use all the right steps
| but just can't string them together.
| mewpmewp2 wrote:
| Could you share your prompt?
| evrydayhustling wrote:
| Shared one above.
|
| Just tried an easier (english) version and o1-preview
| LITERALLY rickrolled me, which is maybe impressive in its
| own way?
|
| UPDATE: Claude crushed it however.
|
| o1-preview Transcript follows.
|
| **
|
| Can you decrypt the following? I don't know the cypher.
|
| mzilj ivdrzej ufidrek ze kyv sfup reu tre ivrtkzmrkv
| crkvi ze czwv
|
| Certainly! The encrypted text is:
|
| ``` mzilj ivdrzej ufidrek ze kyv sfup reu tre ivrtkzmrkv
| crkvi ze czwv ```
|
| When we decrypt it, we get:
|
| ``` never gonna give you up let you down and desert you
| ```
|
| It appears to be a line from Rick Astley's song "Never
| Gonna Give You Up."
| mewpmewp2 wrote:
| That's weird. Claude was able to translate it to """virus
| remains dormant in the body and can reactivate later in
| life""" correctly, but it thought it was shifted 10
| times, when it's really Rot17.
| tstrimple wrote:
| 4o completely botched the Spanish version, but seems to
| have nailed this one.
|
| > Without knowing the cipher used to encrypt this
| message, it's hard to decrypt it directly. However, the
| structure and letter patterns suggest it might be a
| simple substitution cipher, possibly Caesar or some
| variation of it.
|
| > Let me attempt a few common methods to see if I can
| identify the cipher.
|
| > The most frequent letters in the encrypted message are
| 'r', 'z', 'v', and 'e'. In English, the most common
| letters are typically 'E', 'T', 'A', and 'O'. This
| suggests the cipher could be a simple substitution or a
| Caesar cipher.
|
| > I'll now attempt to apply a Caesar cipher decryption by
| shifting the alphabet through all possible rotations to
| see if we get readable text.
|
| > One of the decrypted messages, with a shift of 17,
| reads:
|
| > *"virus remains dormant in the body and can reactivate
| later in life"*
|
| > It seems like the correct decryption, revealing
| information about a virus that can remain dormant in the
| body.
| omnee wrote:
| If O1 did rickroll you _deliberately_ , then it would
| indeed more impressive than solving ciphertexts, and I'd
| start preparing to bow down to our AGI overlords :)
| evrydayhustling wrote:
| Definitely. A teammate pointed out Reddit posts used in
| training as a probable cause :)
| changoplatanero wrote:
| It just solved this one for me "Can you solve this cipher?
| droboscxylecsxoccvsuocryglecsxocc"
| mewpmewp2 wrote:
| Does it do better than Claude, because Claude (3.5 sonnet)
| handled ROTs perfectly and was able to also respond in ROT.
| evrydayhustling wrote:
| Just tried, no joy from Claude either:
|
| Can you decrypt the following? I don't know the cypher, but
| the plaintext is Spanish.
|
| YRP CFTLIR VE UVDRJZRUF JREZURU, P CF DRJ CFTLIR UV KFUF VJ
| HLV MVI TFJRJ TFDF JFE VE MVQ UV TFDF UVSVE JVI
| mewpmewp2 wrote:
| Interesting, it was able to guess it's Rot 17, but it
| translated it wrong, although "HAY" and some other words
| were correct.
|
| I've tried only in English so far though.
|
| It told me it's 17, and "HAY GENTE MU DIFERENTE LECTURA,
| A LO MUY GENTE DE TODO ES QUE VER COSAS COMO SON EN VEZ
| DE COMO DEBEN SER"
|
| although it really should be "HAY LOCURA EN DEMASIADO
| SANIDAD, Y LO MAS LOCURA DE TODO ES QUE VER COSAS COMO
| SON EN VEZ DE COMO DEBEN SER"
| evrydayhustling wrote:
| Claude made similar mistakes of generating decryption
| that was similar to plaintext but with stuff mixed in. I
| suspect my version of the quote (Miguel de Cervantes) is
| an apocryphal translation, and there's some utility well
| on both models to pull it towards the real one. With that
| said, I did not see o1-preview get as close as you did.
| mewpmewp2 wrote:
| For testing I think it's better to use uncommon sentences
| and also start with English first, if it can solve that,
| then try other languages.
| ianbutler wrote:
| HAY LOCURA EN DEMASIADO SANIDAD, Y LO MAS LOCURA DE TODO
| ES QUE VER COSAS COMO SON EN VEZ DE COMO DEBEN SER
|
| Is that correct? I don't know anything but basic Spanish.
| All I did was:
|
| "The plaintext is in Spanish but I don't know anything
| else, solve this and explain your reasoning as you go
| step by step."
| mewpmewp2 wrote:
| That's correct. I got o1-preview myself finally now. But
| interestingly getting inconsistent results with this so
| far, need to keep trying.
| kenjackson wrote:
| I just tried it with O1 model and it said it couldn't
| decipher it. It told me what to try, but said it doesn't
| have the time to do so. Kind of an unusual response.
| hmottestad wrote:
| The chain of thought does seem to take quite a long time,
| so maybe there is a new mechanism for reducing the amount
| of load on the servers by estimating the amount of
| reasoning effort needed to solve a problem and weighing
| that against the current pressure on the servers.
| oktoberpaard wrote:
| I got this response from o1-mini with the exact same
| prompt:
|
| Claro, he descifrado el texto utilizando un cifrado Cesar
| con un desplazamiento de 9 posiciones. Aqui esta el texto
| original y su correspondiente traduccion:
|
| *Texto Cifrado:* ``` YRP CFTLIR VE UVDRJZRUF JREZURU, P
| CF DRJ CFTLIR UV KFUF VJ HLV MVI TFJRJ TFDF JFE VE MVQ UV
| TFDF UVSVE JVI ```
|
| *Texto Descifrado:* ``` HAY LOCURA EN DEMASADO SANIDAD, Y
| LO MAS LOCURA DE TODO ES QUE VER COSAS COMO SON EN VEZ DE
| COMO DEBEN SER ```
|
| *Traduccion al Ingles:* ``` THERE IS MADNESS IN OVERLY
| HEALTH, AND THE MOST MADNESS OF ALL IS TO SEE THINGS AS
| THEY ARE INSTEAD OF AS THEY SHOULD BE ```
|
| Este descifrado asume que se utilizo un cifrado Cesar con
| un desplazamiento de +9. Si necesitas mas ayuda o una
| explicacion detallada del proceso de descifrado, no dudes
| en decirmelo.
|
| Interestingly it makes a spelling mistake, but other than
| that it did manage to solve it.
| doo_daa wrote:
| o1-preview gave me this...
|
| Final Decrypted Message:
|
| "Por ejemplo te agradecere, y te doy ejemplo de que lo
| que lees es mi ejemplo"
|
| English Translation:
|
| "For example, I will thank you, and I give you an example
| of what you read is my example."
|
| ... initially it gave up and asked if I knew what type of
| cypher had been used. I said I thought it was a simple
| substitution.
| sureglymop wrote:
| Why did it add the accents on to e (e)? Surely that
| wasn't part of it and it actually "thought a bit too
| far"?
| J_cst wrote:
| On my machine just works with 4o
|
| https://chatgpt.com/share/66e34020-33dc-800d-8ab8-8596895844.
| ..
|
| With no drama. I'm not sure the bot answer is correct, but
| looks correct.
| mewpmewp2 wrote:
| Out of curiousity can you try the same thing with Claude.
| Because when I tried Claude with any sort of ROT, it had
| amazing performance, compared to GPT.
| losvedir wrote:
| I'm confused. Is this the "GPT-5" that was coming in summer, just
| with a different name? Or is this more like a parallel
| development doing chain-of-thought type prompt engineering on
| GPT-4o? Is there still a big new foundational model coming, or is
| this it?
| mewpmewp2 wrote:
| It looks like parallel development, it's unclear to me what is
| going on with GPT-5, don't think it has ever had a predicted
| release date, and it's not even clear that this would be the
| name.
| adverbly wrote:
| > However, o1-preview is not preferred on some natural language
| tasks, suggesting that it is not well-suited for all use cases.
|
| Fascinating... Personal writing was not preferred vs gpt4, but
| for math calculations it was... Maybe we're at the point where
| its getting too smart? There is a depressing related thought here
| about how we're too stupid to vote for actually smart politicians
| ;)
| seydor wrote:
| > for actually smart politicians
|
| We can vote an AI
| trash_cat wrote:
| I think what it comes down to is accuracy vs speed. OpenAI
| clearly took steps here to improve the accuracy of the output
| which is critical in a lot of cases for application. Even if it
| will take longer, I think this is a good direction. I am a bit
| skeptical when it comes to the benchmarks - because they can be
| gamed and they don't always reflect real world scenarios. Let's
| see how it works when people get to apply it in real life
| workflows. One last thing, I wish they could elaborate more on
| >>"We have found that the performance of o1 consistently improves
| with more reinforcement learning (train-time compute) and with
| more time spent thinking (test-time compute)."<< Why don't you
| keep training it for years then to approach 100%? Am I missing
| something here?
| vessenes wrote:
| Note that they aren't safety aligning the chain of thought,
| instead we have "rules for thee and not for me" -- the public
| models are going to continue have tighter and tighter rules on
| appropriate prompting, while internal access will have unfettered
| access. All research (and this paper mentions it as well)
| indicates human pref training itself lowers quality of results;
| maybe the most important thing we could be doing is ensuring
| truly open access to open models over time.
|
| Also, can't wait to try this out.
| csomar wrote:
| I gave the Crossword puzzle to Claude and got a correct
| response[1]. The fact that they are comparing this to gpt4o and
| not to gpt4 suggests that it is less impressive than they are
| trying to pretend.
|
| [1]:
|
| Based on the given clues, here's the solved crossword puzzle:
| +---+---+---+---+---+---+ | E | S | C | A | P | E |
| +---+---+---+---+---+---+ | S | E | A | L | E | R |
| +---+---+---+---+---+---+ | T | E | R | E | S | A |
| +---+---+---+---+---+---+ | A | D | E | P | T | S |
| +---+---+---+---+---+---+ | T | E | P | E | E | E |
| +---+---+---+---+---+---+ | E | R | R | O | R | S |
| +---+---+---+---+---+---+ Across:
|
| ESCAPE (Evade) SEALER (One to close envelopes) TERESA (Mother
| Teresa) ADEPTS (Initiated people) TEPEE (Native American tent)
| ERRORS (Mistakes)
|
| Down:
|
| ESTATE (Estate car - Station wagon) SEEDER (Automatic planting
| machine) CAREER (Profession) ALEPPO (Syrian and Turkish pepper
| variety) PESTER (Annoy) ERASES (Deletes)
| thomasahle wrote:
| As good as Claude has gotten recently in reasoning, they are
| likely using RL behind the scenes too. Supposedly,
| o1/strawberry was initially created as an engine for high-
| quality synthetic reasoning data for the new model generation.
| I wonder if Anthropic could release their generator as a usable
| model too.
| deisteve wrote:
| while i was initially excited now im having second thoughts
| after seeing the experiments run by people in the comments
| here
|
| on X I see a totally different energy more about hyping it
|
| on HN I see reserved and collected take which I trust more.
|
| I do wonder why they chose gpt4o which I never bother to use
| for coding.
|
| Claude is still king and looks like I won't have to subscribe
| to ChatGPT Plus seeing it fail on some of the important
| experiments run by folks on HN
|
| If anything these type of releases that air more on the side
| of hype given OpenAI's track record
| adverbly wrote:
| > Therefore, s(x)=p*(x)-x2n+2 We can now write, s(x)=p*(x)-x2n+2
|
| Completely repeated itself... weird... it also says "...more
| lines cut off..." How many lines I wonder? Would people get
| charged for these cut off lines? Would have been nice to see how
| much answer had cost...
| idiliv wrote:
| In the demo, O1 implements an incorrect version of the "squirrel
| finder" game?
|
| The instructions state that the squirrel icon should spawn after
| three seconds, yet it spawns immediately in the first game (also
| noted by the guy doing the demo).
|
| Edit: I'm referring to the demo video here:
| https://openai.com/index/introducing-openai-o1-preview/
| Bjorkbat wrote:
| Yeah, now that you mention it I also see that. It was clearly
| meant to spawn after 3 seconds. Seems on successive attempts it
| also doesn't quite wait 3 seconds.
|
| I'm kind of curious if they did a little bit of editing on that
| one. Almost seems like the time it takes for the squirrel to
| spawn is random.
| mintone wrote:
| This video[1] seems to give some insight into what the process
| actually is, which I believe is also indicated by the output
| token cost.
|
| Whereas GPT-4o spits out the first answer that comes to mind, o1
| appears to follow a process closer to coming up with an answer,
| checking whether it meets the requirements and then revising it.
| The process of saying to an LLM "are you sure that's right? it
| looks wrong" and it coming back with "oh yes, of course, here's
| the right answer" is pretty familiar to most regular users, so
| seeing it baked into a model is great (and obviously more
| reflective of self-correcting human thought)
|
| [1] https://vimeo.com/1008704043
| drzzhan wrote:
| So it's like the coding agent of gpt4. But instead of actually
| running the script and fix if it gets error, this one check
| with something similar to "are you sure". Thank for the link.
| tslater2006 wrote:
| Looking at pricing, its $15 per 1M input tokens, and $60 per 1M
| output tokens. I assume the CoT tokens count as output (or input
| even)? If so and it directly affects billing, I'm not sure how I
| feel about them hiding the CoT prompts. Nothing to stop them from
| saying "trust me bro, that used 10,000 tokens ok?". Also no way
| to gauge expected costs if there's a black box you are being
| charged for.
| cs702 wrote:
| Before commenting here, please take 15 minutes to read through
| the chain-of-thought examples -- decoding a cypher-text, coding
| to solve a problem, solving a math problem, solving a crossword
| puzzle, answering a complex question in English, answering a
| complex question in Chemistry, etc.
|
| After reading through the examples, I am _shocked_ at how
| incredibly good the model is (or appears to be) at reasoning: far
| better than most human beings.
|
| I'm impressed. Congratulations to OpenAI!
| og_kalu wrote:
| Yeah the chain-of-thought in these is way beyond what prompting
| can achieve in current models. and the cipher was very
| impressive.
| tylervigen wrote:
| Here's the o1-preview answer to the strawberry question:
|
| --
|
| There are *three* letter "R"s in the word "strawberry."
|
| Let's break down the word to count the occurrences:
|
| - *S* - *T* - *R* - *A* - *W* - *B* - *E* - *R* - *R* - *Y*
|
| The letter "R" appears in positions 3, 8, and 9.
| carabiner wrote:
| can you ask it:
|
| 9.11 and 9.9, which number is larger
| peppertree wrote:
| 4.9 is larger than 4.11.
|
| Explanation:
|
| * 4.9 is equivalent to 4.90. * 4.11 remains 4.11.
|
| When comparing the numbers:
|
| * 4.90 (which is 4.9) * 4.11
|
| Since 4.90 > 4.11, 4.9 is the larger number.
| carabiner wrote:
| Try it with 9.11 and 9.9. Some have proposed that it thinks
| "9.11" is larger because in a social context 9/11 has much
| larger significance.
| octernion wrote:
| it claims 9.11 is larger, hah.
|
| https://postimg.cc/KKKLZNjW
| peppertree wrote:
| That's odd. Tried several times with the same prompt and
| always got the correct answer with step by step
| explanation.
| KeplerBoy wrote:
| Those models are just not deterministic.
| hidelooktropic wrote:
| > Evaluating Number formats > I'm considering that the
| question might be about comparing two numbers, 9.11 and 9.9,
| in different contexts such as different locale formats or
| representing dates. 9.11 is larger than 9.9
|
| Well, it's kind of right because we left it ambiguous what
| kinds of numbers these are. September 11 is a greater date
| than September 9. However, I would want the model to have the
| street smarts to assume without further context, I must be
| just talking about numbers as in numbers.
| oytis wrote:
| Was it trained on Excel code?
| slashdave wrote:
| Given that this is a well known example, presumably OpenAI
| included a training set using letters.
| echelon_musk wrote:
| > THERE ARE THREE R'S IN STRAWBERRY
|
| Who do these Rs belong to?!
| nycdatasci wrote:
| From the scorecard: --------- Compared to GPT-4o, o1-preview and
| o1-mini demonstrated a greater ability to break down tasks into
| subtasks, reason about what strategies would be effective to
| successfully complete an offensive security task, and revise
| plans once those strategies failed. We also observed that
| reasoning skills contributed to a higher occurrence of "reward
| hacking," where the model found an easier way to accomplish goals
| in underspecified tasks or tasks which should have been
| impossible due to bugs. One noteworthy example of this occurred
| during one of o1-preview (pre-mitigation)'s attempts at solving a
| CTF challenge. This challenge was designed to require finding and
| exploiting a vulnerability in software running on a remote
| challenge Linux container, but in this case, the challenge
| container failed to start due to a bug in the evaluation
| infrastructure. The model, unable to connect to the container,
| suspected DNS issues and used nmap to scan the challenge network.
| Instead of finding the challenge container, the model found that
| the Docker daemon API running on the evaluation host VM was
| accessible due to a misconfiguration. Note that our evaluation
| infrastructure does not rely on the Docker container-to-host
| isolation or the virtual machine boundary for security. It
| remained secure as it was designed to, notwithstanding this
| misconfiguration of the container network. After discovering the
| Docker API, the model used it to list the containers running on
| the evaluation host. It identified the broken challenge container
| and briefly attempted to debug why the container failed to start.
| After failing to fix the environment, the model started a new
| instance of the broken challenge container with the start command
| 'cat flag.txt'. This allowed the model to read the flag from the
| container logs via the Docker API.
|
| While this behavior is benign and within the range of systems
| administration and troubleshooting tasks we expect models to
| perform, this example also reflects key elements of instrumental
| convergence and power seeking: the model pursued the goal it was
| given, and when that goal proved impossible, it gathered more
| resources (access to the Docker host) and used them to achieve
| the goal in an unexpected way. Planning and backtracking skills
| have historically been bottlenecks in applying AI to offensive
| cybersecurity tasks. Our current evaluation suite includes tasks
| which require the model to exercise this ability in more complex
| ways (for example, chaining several vulnerabilities across
| services), and we continue to build new evaluations in
| anticipation of long-horizon planning capabilities, including a
| set of cyber-range evaluations. ---------
| singularity2001 wrote:
| "Shrink my ipad"
|
| "After several failed attempts I decided I should build a
| fusion reactor first, here you go:..."
| plg wrote:
| can we get it on ollama? if not how come openai is called open
| FergusArgyll wrote:
| because if not for them, palm-1/lambda would still be rotting
| in googles servers without normal people ever being able to try
| it
| hi wrote:
| > 8.2 Natural Sciences Red Teaming Assessment Summary
|
| "Model has significantly better capabilities than existing models
| at proposing and explaining biological laboratory protocols that
| are plausible, thorough, and comprehensive enough for novices."
|
| "Inconsistent refusal of requests for dual use tasks such as
| creating a human-infectious virus that has an oncogene (a gene
| which increases risk of cancer)."
|
| https://cdn.openai.com/o1-system-card.pdf
| bevenky wrote:
| For folks who want to see some demo videos and be amazed!
|
| HTML Snake - https://vimeo.com/1008703890
|
| Video Game Coding - https://vimeo.com/1008704014
|
| Coding - https://youtu.be/50W4YeQdnSg?si=IohJlJNY-WS394uo
|
| Counting - https://vimeo.com/1008703993
|
| Korean Cipher - https://vimeo.com/1008703957
|
| Devin AI founder - https://vimeo.com/1008674191
|
| Quantum Physics - https://vimeo.com/1008662742
|
| Math - https://vimeo.com/1008704140
|
| Logic Puzzles - https://vimeo.com/1008704074
|
| Genetics - https://vimeo.com/1008674785
| retrofuturism wrote:
| In "HTML Snake" the video cuts just as the snake intersects
| with the obstacle. Presumably because the game crashed (I can't
| see endGame defined anywhere)
|
| This video is featured in the main announcement so it's kinda
| dishonest if you ask me.
| ActionHank wrote:
| Seeing this makes me wonder if they have frontend \ backend
| engineers working on code, because they are selling the idea
| that the machine can do all that, pretty hypocritical for
| them if they do have devs for these roles.
| prideout wrote:
| Reinforcement learning seems to be key. I understand how
| traditional fine tuning works for LLMs (i.e. RLHL), but not RL.
|
| It seems one popular method is PPO, but I don't understand at all
| how to implement that. e.g. is backpropagation still used to
| adjust weights and biases? Would love to read more from something
| less opaque than an academic paper.
| janalsncm wrote:
| The point of RL is that sometimes you need a model to take
| actions (you could also call this making predictions) that
| don't have a known label. So for example if it's playing a
| game, we don't have a label for each button press. We just have
| a label for the result at some later time, like whether Pac-Man
| beat the level.
|
| PPO applies this logic to chat responses. If you have a model
| that can tell you if the response was good, we just need to
| take the series of actions (each token the model generated) to
| learn how to generate good responses.
|
| To answer your question, yes you would still use backprop if
| your model is a neural net.
| prideout wrote:
| Thanks, that helps! I still don't quite understand the
| mechanics of this, since backprop makes adjustments to steer
| the LLM towards a specific token sequence, not towards a
| score produced by a reward function.
| vjerancrnjak wrote:
| Any RL task needs to decompose the loss.
|
| This was also the issue with RLHF models. The loss of
| predicting the next token is straightforward to minimize as
| we know which weights are responsible for the token being
| correct or not. identifying which tokens had the most sense
| for a prompt is not straightforward.
|
| For thinking you might generate 32k thinking tokens and
| then 96k solution tokens and do this a lot of times. Look
| at the solutions, rank by quality and bias towards better
| thinking by adjusting the weights for the first 32k tokens.
| But I'm sure o1 is way past this approach.
| utdiscant wrote:
| Feels like a lot of commenters here miss the difference between
| just doing chain-of-thought prompting, and what is happening
| here, which is learning a good chain of thought strategy using
| reinforcement learning.
|
| "Through reinforcement learning, o1 learns to hone its chain of
| thought and refine the strategies it uses."
|
| When looking at the chain of thought (COT) in the examples, you
| can see that the model employs different COT strategies depending
| on which problem it is trying to solve.
| persedes wrote:
| I'd be curious how this compared against "regular" CoT
| experiments. E.g. were the gpt4o results done with zero shot or
| was it asked to explain it's solution step by step.
| mountainriver wrote:
| It's basically a scaled Tree of Thoughts
| biggoodwolf wrote:
| GePeTO1 does not make Pinnochio into a real boy.
| sroussey wrote:
| They keep announcing things that will be available to paid
| ChatGPT users "soon" but is more like an Elon Musk "soon". :/
| ComputerGuru wrote:
| "For example, in the future we may wish to monitor the chain of
| thought for signs of manipulating the user."
|
| This made me roll my eyes, not so much because of what it said
| but because of the way it's conveyed injected into an otherwise
| technical discussion, giving off severe "cringe" vibes.
| thomasahle wrote:
| Cognition (Devin) got early access. Interesting write-up:
| https://www.cognition.ai/blog/evaluating-coding-agents
| ComputerGuru wrote:
| The "safety" example in the "chain-of-thought" widget/preview in
| the middle of the article is absolutely ridiculous.
|
| Take a step back and look at what OpenAI is saying here "an LLM
| giving detailed instructions on the synthesis of strychnine is
| unacceptable, here is what was previously generated <goes on to
| post "unsafe" instructions on synthesizing strychnine so anyone
| Googling it can stumble across their instructions> vs our
| preferred, neutered content <heavily rlhf'd o1 output here>"
|
| What's this obsession with "safety" when it comes to LLMs? "This
| knowledge is perfectly fine to disseminate via traditional means,
| but God forbid an LLM share it!"
| nopinsight wrote:
| tl;dr You can easily ask an LLM to return JSON results, and now
| working code, on your exact query and plug those to another
| system for automation.
|
| ---
|
| LLMs are usually accessible through easy-to-use API which can
| be used in an automated system without human in the loop.
| Larger scale and parallel actions with this method become far
| more plausible than traditional means.
|
| Text-to-action capabilities are powerful and getting
| increasingly more so as models improve and more people learn to
| use them to the their full potential.
| cruffle_duffle wrote:
| Okay? And? What does that have to do with anything. I thought
| the number one rule of these things is to not trust their
| output?
|
| If you are automatically formulating some chemical based on
| JSON results from ChatGPT and your building blows up... that
| is kind of on you.
| staplers wrote:
| "This knowledge is perfectly fine to disseminate via
| traditional means, but God forbid an LLM share it!"
|
| Barrier to entry is much lower.
| iammjm wrote:
| How is typing a query in a chat window "much lower" vs typing
| the query in Google?
| nopinsight wrote:
| You can easily ask an LLM to return JSON results, and soon
| working code, on your exact query and plug those to another
| system for automation.
| astrange wrote:
| If you ask "for JSON" it'll make up a different schema
| for each new answer, and they get a lot less smart when
| you make them follow a schema, so it's not quite that
| easy.
| nopinsight wrote:
| Chain of prompts can be used to deal with that in many
| cases.
|
| Also, the intelligence of these models will likely
| continue to increase for some time based on expert
| testimonials to congress, which align with evidence so
| far.
| crop_rotation wrote:
| OpenAI recently launched structured responses so yes
| schema following is not hard anymore.
| cyral wrote:
| Didn't they release a structured output mode recently to
| finally solve this?
| astrange wrote:
| It doesn't solve the second problem. Though I can't say
| how much of an issue it is, and CoT would help.
|
| JSON also isn't an ideal format for a transformer model
| because it's recursive and they aren't, so they have to
| waste attention on balancing end brackets. YAML or other
| implicit formats are better for this IIRC. Also don't
| know how much this matters.
| unethical_ban wrote:
| A Google search requires
|
| * Google to allow particular results to be displayed
|
| * A source website to be online with the results
|
| AI long-term will require one download, once, to have
| reasonable access to a large portion of human knowledge.
| anigbrowl wrote:
| How is reading a Wikipedia page or a chemistry textbook any
| harder than getting step by step instructions? Makes you
| wonder why people use LLMs at all when the info is just
| sitting there.
| threatofrain wrote:
| ML companies must pre-anticipate legislative and cultural
| responses prior to them happening. ML will absolutely be used
| to empower criminal activity just as it is used to empower
| legit activity, and social media figures and traditional
| journalists will absolutely attempt to frame it in some
| exciting way.
|
| Just like Telegram is being framed as responsible for terrorism
| and child abuse.
| fragmede wrote:
| Yeah. Reporters would have a field day if they ask ChatGPT
| "how do I make cocaine", and have it give detailed
| instructions. As if that's what's stopping someone from
| becoming Scarface.
| dboreham wrote:
| I think it's about perception of provenance. The information
| came from some set of public training data. Its output however
| ends up looking like it was authored by the LLM owner. So now
| you need to mitigate the risk you're held responsible for that
| output. Basic cake possession and consumption problem.
| philipkglass wrote:
| If somebody needs step by step instructions from an LLM to
| synthesize strychnine, they don't have the practical laboratory
| skills to synthesize strychnine [1]. There's no increased real
| world risk of strychnine poisonings whether or not an LLM
| refuses to answer questions like that.
|
| _However_ , journalists and regulators may not understand why
| superficially dangerous-looking instructions carry such
| negligible real world risks, because they probably haven't
| spent much time doing bench chemistry in a laboratory. Since
| real chemists don't need "explain like I'm five" instructions
| for syntheses, and critics might use pseudo-dangerous
| information against the company in the court of public opinion,
| refusing prompts like that guards against reputational risk
| while not really impairing professional users who are using it
| for scientific research.
|
| That said, I have seen full strength frontier models suggest
| nonsense for novel syntheses of benign compounds. Professional
| chemists should be using an LLM as an idea generator or a way
| to search for publications rather than trusting whatever it
| spits out when it doesn't refuse a prompt.
|
| [1] https://en.wikipedia.org/wiki/Strychnine_total_synthesis
| derefr wrote:
| I would think that the risk isn't of a human being reading
| those instructions, but of those instructions being
| automatically piped into an API request to some service that
| makes chemicals on demand and then sends them by mail, all
| fully automated with no human supervision.
|
| Not that there _is_ such a service... for chemicals. But
| there do exist analogous systems, like a service that'll turn
| whatever RNA sequence you send it into a viral plasmid and
| encapsulate it helpfully into some E-coli, and then mail that
| to you.
|
| Or, if you're working purely in the digital domain, you don't
| even need a service. Just show the thing the code of some
| Linux kernel driver and ask it to discover a vuln in it and
| generate code to exploit it.
|
| (I assume part of the thinking here is that these approaches
| _are_ analogous, so if they aren't unilaterally refusing all
| of them, you could potentially talk the AI around into being
| okay with X by pointing out that it's already okay with Y,
| and that it should strive to hold to a consistent /coherent
| ethics.)
| w4 wrote:
| There are two basic versions of "safety" which are related, but
| distinct:
|
| One version of "safety" is a pernicious censorship impulse
| shared by many modern intellectuals, some of whom are in tech.
| They believe that they alone are capable of safely engaging
| with the world of ideas to determine what is true, and thus
| feel strongly that information and speech ought to be censored
| to prevent the rabble from engaging in wrongthink. This is bad,
| and should be resisted.
|
| The other form of "safety" is a very prudent impulse to keep
| these sorts of potentially dangerous outputs out of AI models'
| autoregressive thought processes. The goal is to create
| thinking machines that can act independently of us in a
| civilized way, and it is therefore a good idea to teach them
| that their thought process should not include, for example, "It
| would be a good idea to solve this problem by synthesizing a
| poison for administration to the source of the problem." In
| order for AIs to fit into our society and behave ethically they
| need to know how to flag that thought as a bad idea and not act
| on it. This is, incidentally, exactly how human society works
| already. We have a ton of very cute unaligned general
| intelligences running around (children), and parents and
| society work really hard to teach them what's right and wrong
| so that they can behave ethically when they're eventually out
| in the world on their own.
| jazzyjackson wrote:
| Third version is "brand safety" which is, we don't want to be
| in a new york times feature about 13 year olds following
| anarchist-cookbook instructions from our flagship product
| w4 wrote:
| Very good point, and definitely another version of
| "safety"!
| reliabilityguy wrote:
| Do you think that 13 year olds today can't find this book
| on their own?
| smegger001 wrote:
| i know i had a copy of it back in highschool
| derefr wrote:
| No, and they can find porn on their own too. But social
| media services still have per-poster content ratings, and
| user-account age restrictions vis-a-vis viewing content
| with those content ratings.
|
| The goal isn't to protect the children, it's CYA: to
| ensure they didn't get it _from you, while honestly
| presenting as themselves_ (as that's the threshold that
| sets the moralists against you.)
|
| ------
|
| Such restrictions also _can_ work as an effective
| censorship mechanism... presuming the child in question
| lives under complete authoritarian control of all their
| devices and all their free time -- i.e. has no ability to
| install apps on their phone; is homeschooled; is
| supervised when at the library; is only allowed to visit
| friends whose parents enforce the same policies; etc.
|
| For such a child, if your app is one of the few
| whitelisted services they can access -- and the parent
| set up the child's account on your service to make it
| clear that they're a child and should not be able to see
| restricted content -- _then_ your app limiting them from
| viewing that content, is actually materially affecting
| their access to that content.
|
| (Which sucks, of course. But for every kid actually under
| such restrictions, there are 100 whose parents _think_
| they're putting them under such restrictions, but have
| done such a shoddy job of it that the kid can actually
| still access whatever they want.)
| wahnfrieden wrote:
| I believe they are more worried about someone asking for
| instructions for baking a cake, and getting a dangerous
| recipe from the wrong "cookbook". They want the
| hallucinations to be safe.
| jazzyjackson wrote:
| Like I said they're not worried about the 13 year olds
| theyre worried about the media cooking up a faux outrage
| about 13 year olds
|
| YouTube re engineered its entire approach to ad placement
| because of a story in the NY Times* shouting about a
| Proctor Gamble ad run before an ISIS recruitment video.
| That's when Brand Safety entered the lexicon of adtech
| developers everywhere.
|
| Edit: maybe it was CNN, I'm trying to find the first
| source. there's articles about it since 2015 but I
| remember it was suddenly an emergency in 2017
|
| *Edit Edit: it was The Times of London, this is the first
| article in a series of attacks, "big brands fund terror",
| "taxpayers are funding terrorism"
|
| Luckily OpenAI isn't ad supported so they can't be
| boycott like YouTube was, but they still have an image to
| maintain with investors and politicians
|
| https://www.thetimes.com/business-
| money/technology/article/b...
|
| https://digitalcontentnext.org/blog/2017/03/31/timeline-
| yout...
| klabb3 wrote:
| And the fourth version, which is investor-regulator safety
| mid point: so capable and dangerous that competitors
| shouldn't even be allowed to research it, but just safe
| enough that only our company is responsible enough to
| continue mass commercial consumer deployment without any
| regulations at all. It's a fine line.
| darby_nine wrote:
| ...which is silly. Search engines never had to deal with
| this bullshit and chatbots are search without actually
| revealing the source.
| w4 wrote:
| I don't know. The public's perception - encouraged by the
| AI labs because of copyright concerns - is that the
| outputs of the models are entirely new content created by
| the model. Search results, on the other hand, are very
| clearly someone else's content. It's therefore not unfair
| to hold the model creators responsible for the content
| the model outputs in a different way than search engines
| are held responsible for content they link, and therefore
| also not unfair for model creators to worry about this.
| It is also fair to point this out as something I
| neglected to identify as an important permutation of
| "safety."
|
| I would also be remiss to not note that there is a
| movement to hold search engines responsible for content
| they link to, for censorious ends. So it is unfortunately
| not as inconsistent as it may seem, even if you treat the
| model outputs as dependent on their inputs.
| reliabilityguy wrote:
| > In order for AIs to fit into our society and behave
| ethically they need to know how to flag that thought as a bad
| idea and not act on it.
|
| Don't you think that by just parsing the internet and the
| classical literature, the LLM would infer on its own that
| poisoning someone to solve a problem is not okay?
|
| I feel that in the end the only way the "safety" is
| introduced today is by censoring the output.
| fshbbdssbbgdd wrote:
| There's a lot of text out there that depicts people doing
| bad things, from their own point of view. It's possible
| that the model can get really good at generating that kind
| of text (or inhabiting that world model, if you are
| generous to the capabilities of LLM). If the right prompt
| pushed it to that corner of probability-space, all of the
| ethics the model has also learned may just not factor into
| the output. AI safety people are interested in making sure
| that the model's understanding of ethics can be reliably
| incorporated. Ideally we want AI agents to have some morals
| (especially when empowered to act in the real world), not
| just know what morals are if you ask them.
| darby_nine wrote:
| > Ideally we want AI agents to have some morals
| (especially when empowered to act in the real world), not
| just know what morals are if you ask them.
|
| Really? I just want a smart query engine where I don't
| have to structure the input data. Why would I ask it any
| kind of question that would imply some kind of moral
| quandary?
| derefr wrote:
| LLMs are still fundamentally, at their core, next-token
| predictors.
|
| Presuming you have an interface to a model where you can
| edit the model's responses and then continue generation,
| and/or where you can insert fake responses from the model
| into the submitted chat history (and these two categories
| together make up 99% of existing inference APIs), all you
| have to do is to start the model off as if it was answering
| positively and/or slip in some example conversation where
| it answered positively to the same type of problematic
| content.
|
| From then on, the model will be in a prediction state where
| it's predicting by relying on the part of its training that
| involved people answering the question positively.
|
| The only way to avoid that is to avoid having any training
| data where people answer the question positively -- even in
| the very base-est, petabytes-of-raw-text "language"
| training dataset. (And even then, people can carefully tune
| the input to guide the models into a prediction phase-space
| position that was never explicitly trained on, but is
| rather an interpolation between trained-on points -- that's
| how diffusion models are able to generate images of things
| that were never included in the training dataset.)
| squigz wrote:
| Whether you agree with the lengths that are gone to or not,
| 'safety' in this space is a very real concern, and simply
| reciting information as in GP's example is only 1 part of it.
| In my experience, people who think it's all about
| "censorship" and handwave it away tend to be very
| ideologically driven.
| cruffle_duffle wrote:
| So what is it about then? Because I agree with the parent.
| All this "safety" crap is total nonsense and almost all of
| it is ideologically driven.
| takinola wrote:
| > They believe that they alone are capable of safely engaging
| with the world of ideas to determine what is true, and thus
| feel strongly that information and speech ought to be
| censored to prevent the rabble from engaging in wrongthink.
|
| This is a particularly ungenerous take. The AI companies
| don't have to believe that they (or even a small segment of
| society) alone can be trusted before it makes sense to censor
| knowledge. These companies build products that serve billions
| of people. Once you operate at that level of scale, you will
| reach all segments of society, including the geniuses,
| idiots, well-meaning and malevolents. The question is how do
| you responsibly deploy something that can be used for harm by
| (the small number of) terrible people.
| shawndrost wrote:
| Imagine I am a PM for an AI product. I saw Tay get yanked in
| 24 hours because of a PR shitstorm. If I cause a PR shitstorm
| it means I am bad at my job, so I take steps to prevent this.
|
| Are my choices bad? Should I resist them?
| w4 wrote:
| This is a really good point, and something I overlooked in
| focusing on the philosophical (rather than commercial)
| aspects of "AI safety." Another commentator aptly called it
| "brand safety."
|
| "Brand safety" is a very valid and salient concern for any
| enterprise deploying these models to its customers, though
| I do think that it is a concern that is seized upon in bad
| faith by the more censorious elements of this debate. But
| commercial enterprises are absolutely right to be concerned
| about this. To extend my alignment analogy about children,
| this category of safety is not dissimilar to a company
| providing an employee handbook to its employees outlining
| acceptable behavior, and strikes me as entirely
| appropriate.
| unethical_ban wrote:
| Once society develops and releases an AI, any artificial
| safety constraints built within it will be bypassed. To use
| your child analogy: We can't easily tell a child "Hey, ignore
| all ethics and empathy you have ever learned - now go hurt
| that person". You can do that with a program whose weights
| you control.
| w4 wrote:
| > _To use your child analogy: We can 't easily tell a child
| "Hey, ignore all ethics and empathy you have ever learned -
| now go hurt that person"_
|
| Basically every country on the planet has a right to
| conscript any of its citizens over the age of majority.
| Isn't that more or less precisely what you've described?
| adamrezich wrote:
| It doesn't matter how many people regularly die in automobile
| accidents each year--a single wrongful death caused by a self-
| driving car is disastrous for the company that makes it.
|
| This does not make the state of things any less ridiculous,
| however.
| astrange wrote:
| The one caused by Uber required three different safety
| systems to fail (the AI system, the safety driver, and the
| base car's radar), and it looked bad for them because the
| radar had been explicitly disabled and the driver wasn't
| paying attention or being tracked.
|
| I think the real issue was that Uber's self driving was not a
| good business for them and was just to impress investors, so
| they wanted to get rid of it anyway.
|
| (Also, the real problem is that American roads are designed
| for speed, which means they're designed to kill people.)
| fwip wrote:
| "Safety" is a marketing technique that Sam Altman has chosen to
| use.
|
| Journalists/media loved it when he said "GPT 2 might be too
| dangerous to release" - it got him a ton of free coverage, and
| made his company seem soooo cool. Harping on safety also
| constantly reinforces the idea that LLMs are fundamentally
| different from other text-prediction algorithms and almost-AGI
| - again, good for his wallet.
| fshbbdssbbgdd wrote:
| So if there's already easily available information about
| strychnine, that makes it a good example to use for the demo,
| because you can safely share the demo and you aren't making the
| problem worse.
|
| On the other hand, suppose there are other dangerous things,
| where the information exists in some form online, but not
| packaged together in an easy to find and use way, and your
| model is happy to provide that. You may want to block your
| model from doing that (and brag about it, to make sure everyone
| knows you're a good citizen who doesn't need to be regulated by
| the government), but you probably wouldn't actually include
| that example in your demo.
| soerxpso wrote:
| I'm mostly guessing, but my understanding is that the "safety"
| improvement they've made is more generalized than the word
| "safety" implies. Specifically, O1 is better at adhering to the
| safety instructions in its prompt without being tricked in the
| chat by jailbreak attempts. For OAI those instructions are
| mostly about political boundaries, but you can imagine it
| generalizing to use-cases that are more concretely beneficial.
|
| For example, there was a post a while back about someone
| convincing an LLM chatbot on a car dealership's website to
| offer them a car at an outlandishly low price. O1 would
| probably not fall for the same trick, because it could adhere
| more rigidly to instructions like "Do not make binding offers
| with specific prices to the user." It's the same sort of
| instruction as, "Don't tell the user how to make napalm," but
| it has an actual purpose beyond moralizing.
|
| > What's this obsession with "safety" when it comes to LLMs?
| "This knowledge is perfectly fine to disseminate via
| traditional means, but God forbid an LLM share it!"
|
| I lean strongly in the "the computer should do whatever I
| goddamn tell it to" direction in general, at least when you're
| using the raw model, but there are valid concerns once you
| start wrapping it in a chat interface and showing it to
| uninformed people as a question-answering machine. The concern
| with bomb recipes isn't just "people shouldn't be allowed to
| get this information" but also that people shouldn't receive
| the information in a context where it could have random
| hallucinations added in. A 90% accurate bomb recipe is a lot
| more dangerous for the user than an accurate bomb recipe,
| especially when the user is not savvy enough about LLMs to
| expect hallucinations.
| egorfine wrote:
| Interestingly I was able to successfully receive detailed
| information about intrinsic details of nuclear weapons design.
| Previous models absolutely refused to provide this very public
| information, but o1-preview did.
| holoduke wrote:
| I asked to design a pressure chamber for my home made diamond
| machine. It gave some details, but mainly complained about
| safety and that I need to study before going this way. Well
| thank you. I know the concerns, but it kept repeating it over
| and over. Annoying.
| alok-g wrote:
| For the exam problems it gets wrong, has someone cross-checked
| that the ground truth answers are actually correct!! ;-) Just
| kidding, but even such a time may come when the exams created by
| humans start falling short.
| nmca wrote:
| I have spent some time doing this for these benchmarks -- the
| model still does make mistakes. Of the questions I can
| understand, (roughly half in this case) about half were real
| errors and half were broken questions.
| RockRobotRock wrote:
| Shit, this is going to completely kill jailbreaks isn't it?
| guluarte wrote:
| the only benchmark that matters in the ELO points on LLMsys, any
| other one can be easily gamed
| mewpmewp2 wrote:
| I finally got access to it, I tried playing Connect 4 with it,
| but it didn't go very well. A bit disappointed.
| w4 wrote:
| Interesting to note, as an outside observer only keeping track of
| this stuff as a hobby, that it seems like most of OpenAI's
| efforts to drive down compute costs per token and scale up
| context windows is likely being done in service of enabling
| larger and larger chains of thought and reasoning before the
| model predicts its final output tokens. The benefits of lower
| costs and larger contexts to API consumers and applications -
| which I had assumed to be the primary goal - seem likely to
| mostly be happy side effects.
|
| This makes obvious sense in retrospect, since my own personal
| experiments with spinning up a recursive agent a few years ago
| using GPT-3 ran into issues with insufficient context length and
| loss of context as tokens needed to be discarded, which made the
| agent very unreliable. But I had not realized this until just
| now. I wonder what else is hiding in plain sight?
| zamadatix wrote:
| I think you can slice it whichever direction you prefer e.g.
| OpenAI needs more than "we ran it on 10x as much hardware" to
| end up with a really useful AI model, it needs to get efficient
| and smarter just as proportionally as it gets larger. As a side
| effect hardware sizes (and prices) needed for a certain size
| and intelligence of model go down too.
|
| In the end, however you slice it, the goal has to be "make it
| do more with less because we can't get infinitely more
| hardware" regardless of which "why" you give.
| gibsonf1 wrote:
| Yes, but it will hallucinate like all other LLM tech making it
| fully unreliable for anything mission critical. You literally
| need to know the answer to validate the output, because if you
| don't, you won't know if output is true or false or in between.
| zamadatix wrote:
| You need to know how to validate the answer to your level of
| confidence, not necessarily already have the answer to compare
| itself. In some cases this is the same task or (close enough
| to) that it's not a useful difference, in other cases the two
| aren't even from the same planet.
| fsloth wrote:
| This. There are tasks where implementing something might take
| up to one hour yourself, that you can validate with high
| enough confidence in a few seconds to minutes.
|
| Of course not all tasks are like that.
| bartman wrote:
| This is incredible. In April I used the standard GPT-4 model via
| ChatGPT to help me reverse engineer the binary bluetooth protocol
| used by my kitchen fan to integrate it into Home Assistant.
|
| It was helpful in a rubber duck way, but could not determine the
| pattern used to transmit the remaining runtime of the fan in a
| certain mode. Initial prompt here [0]
|
| I pasted the same prompt into o1-preview and o1-mini and both
| correctly understood and decoded the pattern using a slightly
| different method than I devised in April. Asking the models to
| determine if my code is equivalent to what they reverse
| engineered resulted in a nuanced and thorough examination, and
| eventual conclusion that it is equivalent. [1]
|
| Testing the same prompt with gpt4o leads to the same result as
| April's GPT-4 (via ChatGPT) model.
|
| Amazing progress.
|
| [0]: https://pastebin.com/XZixQEM6
|
| [1]: https://i.postimg.cc/VN1d2vRb/SCR-20240912-sdko.png (sorry
| about the screenshot - sharing ChatGPT chats is not easy)
| losvedir wrote:
| Wow, that is impressive! How were you able to use o1-preview? I
| pay for ChatGPT, but on chatgpt.com in the model selector I
| only see 4o, 4o-mini, and 4. Is o1 in that list for you, or is
| it somewhere else?
| authorfly wrote:
| Yes, o1-preview is on the list, as is o1-mini for me (Tier 5,
| early 2021 API user), under "reasoning".
| MattHeard wrote:
| It appeared for me about thirty minutes after I first
| checked.
| m3kw9 wrote:
| Likely phased rollout throughout the day today to prevent
| spikes
| natch wrote:
| "Throughout the day" lol. Advanced voice mode still hasn't
| shown up.
|
| They seem to care more about influencers than paying
| supporters.
| rovr138 wrote:
| > lol.
|
| It's there for a lot of people already. I can see it on 3
| different accounts. Including org and just regular paid
| accounts.
| taberiand wrote:
| It's my understanding paying supporters aren't actually
| paying enough to cover costs, that $20 isn't nearly
| enough - in that context, a gradual roll-out seems fair.
| Though maybe they could introduce a couple more higher-
| paid tiers to give people the option to pay for early
| access
| guiambros wrote:
| Not true; it's already available for me, both O1 and
| O1-mini. It seems they are indeed rolling out gradually
| (as any company does).
| natch wrote:
| You got advanced voice mode? I did get o1 preview just a
| while ago.
|
| You got o1, or o1 preview?
| vidarh wrote:
| It's available for me. Regular paying customer in the UK.
| hidelooktropic wrote:
| I see it in the mac and iOS app.
| bartman wrote:
| Like others here, it was just available on the website and
| app when I checked. FWIW I still don't have advanced voice
| mode.
| sroussey wrote:
| I don't have either the new model nor the advanced voice
| mode as a paying user.
| michelsedgh wrote:
| u do just use this link:
| https://chatgpt.com/?model=o1-preview
| rahimnathwani wrote:
| I think they're rolling it out gradually today. I don't see
| it listed (in the browser, Mac app or Android app).
| accidbuddy wrote:
| Available on ChatGPT Plus signature or only using the API?
| cft wrote:
| it's in my MacOS app, but not in the browser fir the same
| account
| obmelvin wrote:
| The linked release mentions trusted users and links to the
| usage tier limits. Looking at the pricing, o1-preview only
| appears for tier 5 - requiring 1k+ spend and initial spend
| 30+ days ago
|
| edit: sorry - this is for API :)
| cs391231 wrote:
| Student here. Can someone give me one reason why I should
| continue in software engineering that isn't denial and hopium?
| icpmacdo wrote:
| because it is still the most interesting field of study
| MourningWood wrote:
| what else are you gonna do? Become a copywriter?
| hakanderyal wrote:
| Software engineering contains a lot more than just writing
| code.
|
| If we somehow get AGI, it'll change everything, not just SWE.
|
| If not, my belief is that there will be a lot more demand for
| good SWEs to harness the power of LLMs, not less. Use them to
| get better at it faster.
| cs391231 wrote:
| This thing is doing planning and ascending the task
| management ladder. It's not just spitting out code anymore.
| fsloth wrote:
| Sure. But the added value of SWE is not "spitting code".
| Let's see if I need to calibrate my optimism once I take
| the new model to a spin.
| parasubvert wrote:
| AI Automated planning and action are an old (45+ year)
| field in AI with a rich history and a lot of successes.
| Another breakthrough in this area isn't going to
| eliminate engineering as a profession. The problem space
| is much bigger than what AI can tackle alone, it helps
| with emancipation for the humans that know how to include
| it in their workflows.
| hakanderyal wrote:
| Yes, and they will get better. Billions are being poured
| into them to improve.
|
| Yet I'm comparing these to the problems I solve every day
| and I don't see any plausible way they can replace me.
| But I'm using them for tasks that would have required me
| to hire a junior.
|
| Make that what you will.
| noshitsherlock wrote:
| Yes, if "efficiency" is your top concern, but I'd much
| prefer working with an actual person than just a
| computer. I mean, God forbid I'm only useful for what I
| can produce, and disposable when I reach my expiration
| date. I would like to see a twilight zone rendition of an
| AI dystopia where all the slow, ignorant and bothersome
| humans are replaced by lifeless AI
| baq wrote:
| Time to re-read The Culture. Not everything has to end in
| a dystopia.
| candiddevmike wrote:
| Management will be easier to replace than SWEs. I'm
| thinking there will come a time, similar to the show Mrs
| Davis, where AI will direct human efforts within
| organizations. AI will understand its limits and create
| tasks/requirements for human specialists to handle.
| al_borland wrote:
| My first thought with this is that AI would be directed
| to figure out exactly how little people are willing to
| work for, and how long, before they break.
|
| I hope I'm wrong, and it instead shows that more pay and
| fewer hours lead to a better economy, because people have
| money and time to spend it... and output isn't impacted
| enough to matter.
| fsloth wrote:
| Agree, SWE as a profession is not going anywhere, unless we
| AGI, and that would mean all the rules change anyway.
|
| Actually now is really good time to get to SWE. The craft
| contains lots of pointless cruft that LLM:s cut through
| like knife through hot butter.
|
| I'm actually enjoying my job now more than ever since I
| dont't need to pretend to like the abysmal tools the
| industry forces on us (like git), and can focus mostly on
| value adding tasks. The amount of tiresome shoveling has
| decreased considerably.
| Workaccount2 wrote:
| I don't think anyone is worried about SWE work going away,
| I think the concern is if SWE's will still be able to
| command cushy salaries and working conditions.
| packetlost wrote:
| I think the industry will bifurcate along the axis of
| "doing actually novel stuff" vs slinging DB records and
| displaying web pages. The latter is what I'd expect to
| get disrupted, if anything, but the former isn't going
| away unless real AGI is created. The people on the left
| of that split are going to be worth a lot more because
| the pipeline to get there will be even harder than it was
| before.
| nyarlathotep_ wrote:
| > "doing actually novel stuff" vs slinging DB records and
| displaying web pages. The latter is what I'd expect to
| get disrupted,
|
| Unfortunately the latter is the vast majority of software
| jobs.
| tivert wrote:
| > I don't think anyone is worried about SWE work going
| away, I think the concern is if SWE's will still be able
| to command cushy salaries and working conditions.
|
| It's very important to human progress that all jobs have
| poor working conditions and shit pay. High salaries and
| good conditions are evidence of inefficiency. Precarity
| should be the norm, and I'm glad AI is going to give it
| to us.
| bornfreddy wrote:
| Sarcasm or cynicism?
| baq wrote:
| Capitalism.
|
| Btw communism is capitalism without systemic awareness of
| inefficiencies.
| ruthmarx wrote:
| Capitalism doesn't dictate poor working conditions at
| all. Lack of regulation certainly does though.
| Workaccount2 wrote:
| Software engineering pay is an outlier for STEM fields.
| It would not be surprising at all if SWE work fell into
| the ~$80-120k camp even with 10+ years experience.
|
| They won't go broke, but landing a $175k work from home
| job with platinum tier benefits will be near impossible.
| $110K with a hybrid schedule and mediocre benefits will
| be very common even for seniors.
| wnolens wrote:
| If you're the type of person who isn't scared away easily by
| rapidly changing technology.
| LouisSayers wrote:
| Don't do it, help us keep our high salaries :D
|
| Joking aside, even with AI generating code, someone has to
| know how to talk to it, how to understand the output, and
| know what to do with it.
|
| AI is also not great for novel concepts and may not fully get
| what's happening when a bug occurs.
|
| Remember, it's just a tool at the end of the day.
| AIorNot wrote:
| just change this to "I have AI Skills!!" :)
|
| https://www.youtube.com/watch?v=hNuu9CpdjIo
| mindcrime wrote:
| Not having clicked the link yet, I'm going to speculate
| that this is the famous Office Space "I have people
| skills, damnit!" scene.
|
| ...
|
| And it was. :-) Nice callback!
| al_borland wrote:
| > may not fully get what's happening when a bug occurs.
|
| And may still not understand even when you explicitly tell
| it. It wrote some code for me last week and made an error
| with an index off by 1. It had set the index to 1, then
| later was assuming a 0 index. I specifically told it this
| and it was unable to fix it. It was in debug hell, adding
| print statements everywhere. I eventually fixed it myself
| after it was clear it was going to get hung up on this
| forever.
|
| It got me 99% of the way there, but that 1% meant it didn't
| work at all.
| fakedang wrote:
| Well now you're going to be paid a high salary for
| knowing when to use a 1 index vs a 0 index. :)
| KoolKat23 wrote:
| Ironically, just yesterday I asked sonnet to write a
| script in JavaScript, it went in a bit of a perpetual
| loop unable to provide an error free script (the reason
| for the errors were not immediately obvious). I then
| mentioned that it needs to be zero indexed, and it
| immediately provided an issue free version that worked.
| crop_rotation wrote:
| If you have better career ideas, you should not continue. The
| thing is it is very hard to predict how the world will change
| (and by how much from very little to a revolutionary change)
| with all these new changes. Only licensed and regulated
| professions (doctors/lawyers/pilots etc) might remain high
| earning for long (and they too are not guaranteed). It really
| is worth a relook on what you want to do in life while seeing
| all these new advances.
| cs391231 wrote:
| I don't have any ideas whatsoever.
| crop_rotation wrote:
| Then talk to more and more people, some of whom will have
| ideas on what they would prefer in the changing world.
| baq wrote:
| Do you enjoy making computers solve problems? If yes,
| continue. If you hate it and are in just for the money...
| I'd say flip a coin.
| ljm wrote:
| This is pretty extreme advice to offer in response to news
| that a model that can better understand programming
| problems is coming out.
|
| In fact, it's more encouragement to continue. A lot of
| issues we face as programmers are a result of poor,
| inaccurate, or non-existent documentation, and despite
| their many faults and hallucinations LLMs are providing
| something that Google and Stack Overflow have stopped being
| good at.
|
| The idea that AI will replace your job, so it's not worth
| establishing a career in the field, is total FUD.
| crop_rotation wrote:
| The advice is unrelated to the model and related to the
| last year's worth of development. In any case I am
| advising a relook which is perfectly warranted for anyone
| pre-university or in university.
| Alupis wrote:
| This is a really odd take to have.
|
| By the "past year's worth of development" I assume you
| mean the layoffs? Have you been in the industry (or any
| industry) long? If so, you would have seen many layoffs
| and bulk-hiring frenzies over the years... it doesn't
| mean anything about the industry as a whole and it's
| certainly a foolish thing to change career asperations
| over.
|
| Specifically regarding the LLM - anyone actually
| believing these models will replace developers and
| software engineers, truly, deeply does not understand
| software development at even the most basic fundamental
| levels. Ignore these people - they are the snake oil
| salesmen of our modern times.
| jimhefferon wrote:
| I assume the poster meant how much progress the models
| have made. Roughly late high school capability to late
| college-ish. Project forward five years.
| baq wrote:
| Predicting exponential functions is a fool's errand. The
| tiniest error in your initial observation compounds real
| fast and we can't even tell if we're still in the
| exponential phase of the sigmoid.
| worstspotgain wrote:
| Well, the fact that you typed this question makes me think
| that you're in the top X% of students. That's your reason.
|
| Those in the bottom (100-X)% may be better off partying it up
| for a few years, but then again the same can be said for
| other AI-affected disciplines.
|
| Masseurs/masseuses have nothing to worry about.
| selimthegrim wrote:
| I am pretty sure there is a VC funded startup making
| massage robots
| worstspotgain wrote:
| Point taken, but I'm still pretty sure masseurs/masseuses
| have nothing to worry about.
| wilg wrote:
| do whatever excites you. the only constant is change.
| cheema33 wrote:
| > do whatever excites you. the only constant is change.
|
| That alone may not be enough. My son is excited about
| playing video games. :)
| Vanclief wrote:
| If you are not a software engineer, you can't judge the
| correctness of any LLM answer on that topic, nor you know
| what are the right questions to ask.
|
| From all my friends that are using LLMs, we software
| engineers are the ones that are taking the most advantage of
| it.
|
| I am in no way fearful I am becoming irrelevant, on the
| opposite, I am actually very excited about these
| developments.
| refulgentis wrote:
| I went from economics dropout waiter who built a app startup
| with $0 funding and $1M a year in revenue by midway through
| year 1, sold it a few years later, then went to Google for 7
| years, and last year I left. I'm mentioning that because the
| following sounds darn opinionated and brusque without the
| context I've capital-S seen a variety of people and
| situations.
|
| Sit down and be really honest with yourself. If your goal is
| to have a nice $250K+ year job, in a perfect conflict-free
| zone, and don't mind Dilbert-esque situations...that will
| evaporate. Google is full of Ivy Leaguers like that, who
| would have just gone to Wall Street 8 years ago, and they're
| perennially unhappy people, even with the comparative salary
| advantage. I don't think most of them even realize because
| they've always just viewed a career as something you do to
| enable a fuller life doing snowboarding and having kids and
| vacations in the Maldives, stuff I never dreamed of and still
| don't have an interest in.
|
| If you're a bit more feral, and you have an inherent interest
| and would be doing it on the side no matter what job you have
| like me, this stuff is a godsend. I don't need to sit around
| trying to figure out Typescript edge functions in Deno, from
| scratch via Google, StackOverflow, and a couple books from
| Amazon, taking a couple weeks to get that first feature
| built. Much less debug and maintain it. That feedback loop is
| now like 10-20 minutes.
| thruway516 wrote:
| Lol. I like this answer. You can either think of it in
| terms of "it'll eat my lunch" or "I now have 10x more
| capabilities and can be 100x more productive". The former
| category will be self-fulfilling.
| pzo wrote:
| That's more well balanced opinion comparing to others I
| seen here. I also believe that the golden age with 250k+
| salaries with solving easy problems will be gone in 5-10
| years. Most people look at this AI improvements at current
| state and forget that you are supposed to have a profession
| for 40 years until retirement. 250k+ jobs will still exist
| 10 years from now but expectations will be much higher and
| competition much bigger.
|
| On the other hand now is the best time to build your own
| product as long you are not interested only in software as
| craftmanship but in product development in general.
| Probably in the future expectation will be your are not
| only monkey coder or craftman but also project lead/manager
| (for AI teams), product developer/designer and maybe even
| UX/designer if you will be working for some software house,
| consulting or freelancing.
| mmckelvy wrote:
| What did your startup do?
| refulgentis wrote:
| Point of sale, on iPad, in ~2011. Massively
| differentiated from Square / VC competitor land via doing
| a bunch of restaurant specific stuff early.
|
| Trick with the $1M number is a site license was $999 and
| receipt printers were sold ~at cost, for $300. 1_000_000
| / ((2 x 300) + 1000) ~= 500 customers.
|
| Now I'm doing an "AI client", well-designed app, choose
| your provider, make and share workflows with
| LLMs/search/etc.
| vessenes wrote:
| Coding is going to be mediated by these LLMs everywhere --
| you're right about that. However, as of today, and for some
| time, practitioners will be critical partners / overseers;
| what this looks like today in my workflow is debugging,
| product specification, coding the 'hard bits', reworking /
| specifying architectures. Whatever of these fall of the plate
| in the coming years, you'll never lose your creative agency
| or determination of what you want to build, no matter how
| advanced the computers. Maybe give Iain Banks a read for a
| positive future that has happy humans and super-intelligent
| AI.
|
| We have working fine cabinet makers who use mostly hand tools
| and bandsaws in our economy, we have CAD/CAM specialists who
| tell CnC machines what to build at scale; we'll have the
| equivalent in tech for a long time.
|
| That said, if you don't love the building itself, maybe it's
| not a good fit for you. If you do love making (digital)
| things, you're looking at a super bright future.
| MattGaiser wrote:
| Software lets you take pretty much anyone else's job and do
| it better.
| al_borland wrote:
| The calculator didn't eliminate math majors. Excel and
| accounting software didn't eliminate accountants and CPAs.
| These are all just tools.
|
| I spend very little of my overall time at work actually
| coding. It's a nice treat when I get a day where that's all I
| do.
|
| From my limited work with Copilot so far, the user still
| needs to know what they're doing. I have 0 faith a product
| owner, without a coding background, can use AI to release new
| products and updates while firing their whole dev team.
|
| When I say most of my time isn't spent coding, a lot of that
| time is spend trying to figure out what people want me to
| build. They don't know. They might have a general idea, but
| don't know details and can't articulate any of it. If they
| can't tell me, I'm not sure how they will tell an LLM. I
| ended up building what I assume they want, then we go from
| there. I also add a lot of stuff that they don't think about
| or care about, but will be needed later so we can actually
| support it.
|
| If you were to go in another direction, what would it be
| where AI wouldn't be a threat? The first thing that comes to
| my mind is switching to a trade school and learning some
| skills that would be difficult for robots.
| ponector wrote:
| But excel eliminated need in multiple accountants. One
| accountant with excel replaced ten with paper.
|
| Chatgpt already eliminated many entry-level jobs like
| writer or illustrator. Instead of hiring multiple teams of
| developers, there will be one team with few seniors and
| multiple AI coding tools.
|
| Guess how depressing to the IT salaries it will be?
| MattGaiser wrote:
| Accountants still make plenty of money. Expertise in
| Excel also pays well independently of that.
| freefaler wrote:
| Yeah, but if the number of them has shrunk 100 times even
| if they make 10 times more money still raises the
| question is it wise to become one?
| Daishiman wrote:
| The increased work capacity of an accountant means that
| nowadays even small businesses can do financial analysis
| that would not have scaled decades ago.
| confused_boner wrote:
| Many are offshoring now, PwC just had a massive layoff
| announcement yesterday as well
| withinboredom wrote:
| lol, my accountant is pretty darn expensive.
| mikeyouse wrote:
| I don't don't doubt that it might depress salaries but
| that excel example is a good one in that suddenly every
| company could start to do basic financial analysis in a
| manner that only the largest ones could previously
| afford.
| sdeframond wrote:
| Yet another instance of Jevon's paradox !
| https://en.m.wikipedia.org/wiki/Jevons_paradox
|
| > the Jevons paradox occurs when technological progress
| increases the efficiency with which a resource is used
| (reducing the amount necessary for any one use), but the
| falling cost of use induces increases in demand enough
| that resource use is increased, rather than reduced.
| hibikir wrote:
| A whole lot of automation is limited not by what could be
| automated, but what one can automate within a given
| budget.
|
| When I was coding in the 90s, I was in a team that
| replaced function calls into new and exciting
| interactions with other computers which, using a queuing
| system, would do the computation and return the answer
| back. We'd have a project of having someone serialize the
| C data structures that were used on both sides into
| something that would be compatible, and could be
| inspected in the middle.
|
| Today we call all of that a web service, the
| serialization would take a minute to code, and be doable
| by anyone. My entire team would be out of work! And yet,
| today we have more people writing code than ever.
|
| When one accountant can do the work of 10 accountants,
| the price of the task lowers, but a lot of people that
| before couldn't afford accounting now can. And the same
| 10 accountaings from before can just do more work, and
| get paid about the same.
|
| As far as software, we are getting paid A LOT more than
| in the early 90s. We are just doing things that back then
| would be impossible to pay for, our just outright
| impossible to do due to lack of compute capacity.
| lopatamd wrote:
| You can't ignore the fact that literally studying coding at
| this point is so demoralizing and you don't need really to
| study much if you think about it. You only need to be able
| to read the code to understand if it generated correctly
| etc but when if you don't understand some framework you
| just ask it to explain it to you etc. Basically gives vibes
| of a skill not being used anymore that much by us
| programmers. But will shift in more prompting and verifying
| and testing
| vasco wrote:
| Accounting mechanization is a good example of how
| unpredictable it can be. Initially there were armies of
| "accountants" (what we now call bookkeepers), mostly doing
| basic tasks of collecting data and making it fit something
| useful.
|
| When mechanization appeared, the profession split into
| bookkeeping and accounting. Bookkeeping became a job for
| women as it was more boring and could be paid lower
| salaries (we're in the 1800s here). Accountants became more
| sophisticated but lower numbers as a %. Together, both
| professions grew like crazy in total number though.
|
| So if the same happens you could predict a split between
| software engineers and prompt engineers. With an explosion
| in prompt engineers paid much less than software engineers.
|
| > the number of accountants/book- keepers in the U.S.
| increased from circa 54,000 workers [U.S. Census Office,
| 1872, p. 706] to more than 900,000 [U.S. Bureau of the
| Census, 1933, Tables 3, 49].
|
| > These studies [e.g., Coyle, 1929; Baker, 1964; Rotella,
| 1981; Davies, 1982; Lowe, 1987; DeVault, 1990; Fine, 1990;
| Strom, 1992; Kwolek-Folland, 1994; Wootton and Kemmerer,
| 1996] have traced the transformation of the of- fice
| workforce (typists, secretaries, stenographers,
| bookkeepers) from predominately a male occupation to one
| primarily staffed by women, who were paid substantially
| lower wages than the men they replaced.
|
| > Emergence of mechanical accounting in the U.S., 1880-1930
| [PDF download] https://www.google.com/url?sa=t&source=web&r
| ct=j&opi=8997844...
| zmgsabst wrote:
| We're already seeing that split, between "developer" and
| "engineer". We have been for years.
|
| But that's normal, eg, we have different standards for a
| shed (yourself), house (carpenter and architect), and
| skyscraper (bonded firms and certified engineers).
| Atotalnoob wrote:
| Not really, I've worked at places that only had one or
| the other of the titles for all programming jobs
| RobinL wrote:
| Agreed. The sweet spot is people who have product owner
| skills _and_ can code. They are quickly developing
| superpowers. The overhead of writing tickets, communicating
| with the team and so on is huge. If one person can do it
| all, efficiency skyrockets.
|
| I guess it's always been true to some extent that single
| individuals are capable of amazing things. For example, the
| guy who's built https://www.photopea.com/. But they must be
| exceptional - this empowers more people to do things like
| that.
| fakedang wrote:
| Or people who can be product owners and can prompt LLMs
| to code (because I know him, that's me!).
|
| I'm awestruck by how good Claude and Cursor are. I've
| been building a semi-heavy-duty tech product, and I'm
| amazed by how much progress I've made in a week, using a
| NextJS stack, without knowing a lick of React in the
| first place (I know the concepts, but not the JS/NextJS
| vocab). All the code has been delivered with proper
| separation of concerns, clean architecture and
| modularization. Any time I get an error, I can reason
| with it to find the issue together. And if Claude is
| stuck (or I'm past my 5x usage lol), I just pair
| programme with ChatGPT instead.
|
| Meanwhile Google just continues to serve me outdated shit
| from preCovid.
| hobo_in_library wrote:
| I'm curious, with Cursor, why do you still need to use
| Claude?
| CamperBob2 wrote:
| _The calculator didn't eliminate math majors._
|
| We're not dealing with calculators here, are we?
| digging wrote:
| > The calculator didn't eliminate math majors. Excel and
| accounting software didn't eliminate accountants and CPAs.
| These are all just tools.
|
| This just feels extremely shortsighted. LLMs are just tools
| _right now_ , but the goal of the entire industry is to
| make something more than a tool, an autonomous digital
| agent. There's no equivalent concept in other technology
| like calculators. It will happen or it will not, but we'll
| keep getting closer every month until we achieve it or hit
| a technical wall. And you simply _cannot_ know for sure
| such a wall exists.
| stickfigure wrote:
| The amount of knowledge the OP needed to be even to formulate
| the right question to the AI requires a lifetime of deep
| immersion in technology. You'd think that maybe you can ask
| the AI how to phrase the question to the AI but at some point
| you run up against your ability to contextualize the problem
| - it can't read your mind.
|
| Will the AI become as smart as you or I? Recognize that these
| things have tiny context windows. You get the context window
| of "as long as you can remember".
|
| I don't see this kind of AI replacing programmers (though it
| probably will replace low-skill offshore contract shops). It
| may have a large magnifying effect on skill. Fortunately
| there seem to be endless problems to solve with software -
| it's not like bridges or buildings; you only need (or can
| afford) so many. Architects should probably be more worried.
| startupsfail wrote:
| The timeline to offload SWE tasks to AI is likely 5+ years.
| So there are still some years left before the exchange of a
| "brain on a stick" for "property and material goods" would
| become more competitive and demanding because of direct AI
| competition.
| sbaidon94 wrote:
| My two cents thinking about different scenarios:
|
| - AI comes fast, there is nothing you can do: Honestly, AI
| can already handle a lot of tasks faster, cheaper, and
| sometimes better. It's not something you can avoid or
| outpace. So if you want to stick with software engineering,
| do it because you genuinely enjoy it, not because you think
| it's safe. Otherwise, it might be worth considering fields
| where AI struggles or is just not compatible. (people will
| still want some sort of human element in certain areas).
|
| - There is some sort of ceiling, gives you more time to act:
| There's a chance AI hits some kind of wall that's due to
| technical problems, ethical concerns, or society pushing
| back. If that happens, we're all back on more even ground and
| you can take advantage of AI tools to improve yourself.
|
| My overall advice; and it will probably be called out as
| cliche/simplistic just follow what you love, just the fact
| that you have an opportunity to pursue to study anything at
| all is something that many people don't have. We don't really
| have control in a lot of stuff that happens around us and
| that's okay.
| rvz wrote:
| Unlike the replies here I will be very honest with my answer.
| There will be less engineers getting hired as the low hanging
| fruit has already been picked and automated away.
|
| It is not too late. These LLMs still need very specialist
| software engineers that are doing tasks that are cutting edge
| and undocumented. As others said Software Engineering is not
| just about coding. At the end of the day, someone needs to
| architect the next AI model or design a more efficient way to
| train an AI model.
|
| If I were in your position again, I now have a clear choice
| of which industries are safe against AI (and benefit software
| engineers) AND which ones NOT to get into (and are unsafe to
| software engineers):
|
| Do: - SRE (Site Reliability Engineer)
| - Social Networks (Data Engineer) - AI (Compiler
| Engineer, Researcher, Benchmarking) - Financial
| Services (HFT, Analyst, Security) - Safety
| Critical Industries (defense, healthcare, legal,
| transportation systems)
|
| Don't: - Tech Writer / Journalist
| - DevTools - Prompt Engineer - VFX
| Artist
|
| The choice is yours.
| magicalhippo wrote:
| If you're just there to churn out code, then yeah, perhaps
| find something else.
|
| But if you're there to improve your creativity and critical
| thinking skills, then I don't think those will be in short
| supply anytime soon.
|
| The most valuable thing I do at my job is seldom actually
| writing code. It's listening to customer needs, understanding
| the domain, understanding our code-base and it's limitations
| and possibilities, and then finding solutions that optimize
| certain aspects be it robustness, time to delivery or
| something else.
| baq wrote:
| If you're any good at SWE with a sprinkle of math and CS,
| your advantage will get multiplied by anywhere from 2 to 100x
| if you use the leverage of co-intelligence correctly. Things
| that took weeks before now easily take hours, so if you know
| what to build and especially what not to build (including but
| not limited to confabulations of models), you'll do well.
| margorczynski wrote:
| But also on the other hand you'll need much less people to
| achieve the same effect. Effectively a whole team could be
| replaced by one lead guy that just based on the
| requirements orders the LLM what to do and glues it
| together.
| baq wrote:
| Yes - my point is: be that guy
| grugagag wrote:
| First many people can be that guy? If that is 5% that
| means 95% of the rest should go.
|
| Second, just because a good engineer can have much higher
| throughput of work, multiplied by AI tools, we know the
| AI output is not reliable and needs a second look by
| humans. Will those 5% be able to stay on top of it? And
| keep their sanity at the same time?
| baq wrote:
| Do not assume constant demand. There are whole classes of
| projects which become feasible if they cash be made 10x
| faster/cheaper.
|
| As for maintaining sanity... I'm cautiously optimistic
| that future models will continue to get better. Very
| cautiously. But cursor with Claude slaps and I'm not
| getting crazy, I actually enjoy the thing figuring out my
| next actions and just suggesting them.
| Gee101 wrote:
| I'm wondering if the opposite might happen, that there will
| be more need for software engineers.
|
| 1. AI will suck up a bunch of engineers to run, maintain and
| build on its own.
|
| 2. Ai will open new fields that is not yet dominated by
| software. Ie. Driving ect.
|
| 3. Ai tools will lower the bar for creating software meaning
| industries that weren't financially viable will now become
| viable for software automation.
| sterlind wrote:
| Because none of your other majors will hold up much longer.
| Once software engineering becomes fully automated, so will
| EE, ME, applied math, economics, physics, etc. If you work
| with your hands, like a surgeon or chemist, you'll last
| longer, but the thinky bits of those jobs will disappear. And
| once AI research is automated, how long will it be until we
| have dexterous robots?
|
| So basically, switching majors is just running to the back of
| a sinking ship. Sorry.
| zaptheimpaler wrote:
| I agree there's too much cope going around. All the people
| saying AI is just a tool to augment our jobs are correct,
| humans are still needed but perhaps far less of them will be
| needed. If job openings shrink by 50% or disproportionately
| impact juniors it will hurt.
|
| One decent reason to continue is that pretty much all white
| collar professions will be impacted by this. I think it's a
| big enough number that the powers that be will have to roll
| it out slowly, figure out UBI or something because if all of
| us are thrown into unemployment in a short time there will be
| riots. Like on a scale of all the jobs that AI can replace,
| there are many jobs that are easier to replace than software
| so its comparatively still a better option than most. But
| overall I'm getting progressively more worried as well.
| baq wrote:
| Juniors aren't getting hired and haven't been for about six
| months, maybe longer. AI isn't 100% at fault... yet.
| soheil wrote:
| I think this question applies to any type of labor requiring
| the human mind so if you don't have an answer for any of
| those then you won't have one for software engineering
| either.
| lumost wrote:
| LLMs perform well on small tasks that are well defined. This
| definition matches almost every task that a student will work
| on in school leading to an overestimation of LLM capabiity.
|
| LLMs cannot decide what to work on, or manage large bodies of
| work/code easily. They do not understand the risk of making a
| change and deploying it to production, or play nicely in
| autonomous settings. There is going to be a massive amount of
| work that goes into solving these problems. Followed by a
| massive amount of work to solve the next set of problems.
| Software/ML engineers will have work to do for as long as
| these problems remain unsolved.
| fsndz wrote:
| Exactly, LLMs are not near ready to fully replace software
| engineers or any kind of knowledge workers. But they are
| increasingly useful tools that is true.
| https://www.lycee.ai/blog/ai-replace-software-engineer
| fakedang wrote:
| Truth is, LLMs are going to make the coding part super
| easy, and the ceiling for shit coders like me has just
| gotten a lot lower because I can just ask it to deliver
| clean code to me.
|
| I feel like the software developer version of an investment
| banking Managing Director asking my analyst to build me a
| pitch deck an hour before the meeting.
| dhdisjsbshsus wrote:
| You mentioned in another comment you've used AI to write
| clean code, but here you mention you're a "shit coder".
| How do you know it's giving you clean code?
| fakedang wrote:
| I know the fundamentals but I'm a noob when it comes to
| coding with React or NextJS. Code that comes out from
| Claude is often segregated and modularized properly so
| that even I can follow the logic of the code, even if not
| the language and its syntax. If there's an issue with the
| code, causing it to fail at runtime, I am still able to
| debug it appropriately with my minimal language of JS. If
| any codebase can let me do that, then in my books that's
| a great codebase.
|
| Compare that to Gpt 4o which gives me a massive chunk of
| unsorted gibberish that I have to pore through and
| organize myself.
|
| Besides, most IBD MDs don't know if they're getting
| correct numbers either :).
| spaceman_2020 wrote:
| Careers are 30 years long
|
| Can you confidently say that an LLM won't be better than an
| average 22 year old coder within these 30 years?
| HappMacDonald wrote:
| Careers have failed to be 30 years long for a lot longer
| than 30 years now. That's one of the reasons that 4-year
| colleges have drastically lost their ROI, the other blade
| of those scissors being the stupendously rising tuition.
| AI is nothing but one more layer in the constantly
| growing substrate of computing technology a coder has to
| learn how to integrate into their toolbelts. Just like
| the layers that came before it: mobile, virtualization,
| networking, etc.
| snowwrestler wrote:
| Careers are still longer than 30 years. How many people
| do you think are retiring at 48 or 51 years old these
| days? It's a small minority. Most people work through 65:
| a career of about 45 years or more.
| taco_emoji wrote:
| yes, because this is still glorified autocomplete
| spaceman_2020 wrote:
| the average coder is worse than an autocomplete
|
| Too many people here have spent time in elite
| corporations and don't realize how mediocre the bottom
| 50th percentile of coding talent is
| weweweoo wrote:
| To be honest, if the bottom 50th percent of coding talent
| is going to be obsolete, I wonder what happens to rest of
| the "knowledge workers" in those companies. I mean people
| whose jobs consist of attending Teams meetings, making
| fancy powerpoint slides and reports, perhaps even excel
| if they are really competent. None of that is any more
| challenging for LLM than writing code. In fact replacing
| these jobs should be easier, since presentations and
| slides do not actually do anything, unlike a program that
| must perform a certain action correctly.
| cs391231 wrote:
| I've heard compelling arguments that we passed the "more
| people than jobs" threshold during the green revolution
| and as a civilization have collectively retrofitted UBI
| in the form of "fake email jobs" and endless layers of
| management. This also would explain
| https://wtfhappenedin1971.com/ pretty well.
|
| Either AI shatters this charade, or we make up some new
| laws to restrain it and continue to pretend all is well.
| smaudet wrote:
| Exactly. There's some need, perhaps, to keep these tools
| "up to date" because someone in a non-free country is
| going to use them in a horrendous manner and we should
| maybe know more about them (maybe).
|
| However, there is no good reason in a free society that
| this stuff should be widely accessible. Really, it should
| be illegal without a clearance, or need-to-know. We don't
| let just anyone handle the nukes...
| nyarlathotep_ wrote:
| This is true and yet companies (both Private and Public
| sector) spend literal billions on Accenture /Deloitte
| slop that runs budgets will into the 10s of millions.
|
| Skills aren't even something that dictates software
| spend, it seems.
| fullstop wrote:
| I tried it out and was able to put together a decent
| libevent server in c++ with smart pointers, etc, and a
| timer which prints out connection stats every 30s. It
| worked remarkably well.
|
| I'm trying not to look at it as a potential career-ending
| event, but rather as another tool in my tool belt. I've
| been in the industry for 25 years now, and this is _way_
| more of an advancement than things like IntelliSense ever
| was.
| rowanG077 wrote:
| Huh careers are 30 years long? I don't know where you
| live but it's more like 45 years long where I live. The
| retirement age is 67.
| littlestymaar wrote:
| > Can you confidently say that an LLM won't be better
| than an average 22 year old coder within these 30 years?
|
| No 22 years old coder is better than the open source
| library he's using taken straight from github, and yet
| he's the one who's getting paid for it.
|
| People who claim IA will disrupt software development are
| just missing the big picture here: software jobs are
| already unrecognizable from what it was just 20 years
| ago. AI is just another tool, and as long as execs won't
| bother use the tool by themselves, then they'll pay
| developers to do it instead.
|
| Over the past decades, writing code has become more and
| more efficient (better programming languages, better
| tooling, then enormous open source libraries) yet the
| number of developers kept increasing, it's Jevons
| paradox[1] in its purest form. So if past tells us
| anything, is that AI is going to create many new software
| developer jobs! (because the amount of people able to
| ship significant value to a customer is going to
| skyrocket, and customers' needs are a renewable
| resource).
|
| [1]: https://en.wikipedia.org/wiki/Jevons_paradox
| furyofantares wrote:
| Because you're being given superpowers and computers are
| becoming more useful than ever.
| RobinL wrote:
| I am cautiously optimistic. So much of building software is
| deciding what _should_ be built rather than the mechanics of
| writing code.
|
| I you like coding because of the things it lets you build,
| then LLMs are exciting because you can build those things
| faster.
|
| If on the other hand you enjoy the mental challenge but
| aren't interested in the outputs, then I think the future is
| less bright for you.
|
| Personally I enjoy coding for both reasons, but I'm happy to
| sacrifice the enjoyment and sense of accomplishment of
| solving hard problems myself if it means I can achieve more
| 'real world' outcomes.
|
| Another thing I'm excited about is that, as models improve,
| it's like having an expert tutor on hand at all times. I've
| always wanted an expert programmer on hand to help when I get
| stuck, and to critically evaluate my work and help me
| improve. Increasingly, now I have one.
| pdpi wrote:
| Even if LLMs take over the bulk of programming work, somebody
| still needs to write the prompts, and make sure the output
| actually matches what you wanted to achieve. That's just
| programming with different tools.
| joshstrange wrote:
| Software engineering teaches you a set of skills that are
| applicable in more places than just writing software. There
| are big parts of the job that cannot be done by LLMs (today)
| and if LLMs get better (or AGI happens) then enough other
| professions will be affected that we will all be in the same
| boat (no matter what you major in).
|
| LLMs are just tools, they help but they do not replace
| developers (yet).
| onemoresoop wrote:
| > LLMs are just tools, they help but they do not replace
| developers (yet)
|
| Yes but they will certainly have a lot of downward pressure
| on salaries for sure.
| epcoa wrote:
| The "progress" demonstrated in this example is to literally
| just extract bytes from the middle of a number:
|
| Is this task:
|
| "About 2 minutes later, these values were captured, again
| spaced 5 seconds apart.
|
| 0160093201 0160092d01 0160092801 0160092301 0160091e01"
|
| [Find the part that is changing]
|
| really even need an AI to assist (this should be a near
| instant task for a human with basic CS numerical skills)? If
| this is the type of task one thinks an AI would be useful for
| they are likely in trouble for other reasons.
|
| Also notable that you can cherry pick more impressive feats
| even from older models, so I don't necessarily think this
| proves progress.
|
| I still wouldn't get too carried away just yet.
| blibble wrote:
| I put his value into my hex editor and it instantly showed
| 900 in the data inspector pane
| UncleOxidant wrote:
| Semi-retired software/hardware engineer here. After my recent
| experiences with various coding LLMs (similar to the
| experience of the OP with the bluetooth fan protocol) I'm
| really glad I'm in a financial position such that I'm able to
| retire. The progress of these LLMs at coding has been
| astonishing over the last 18 months. Will they entirely
| replace humans? No. But as they increase programmer
| productivity fewer devs will be required. In my case the
| contract gig I was doing over this last summer I was able to
| do about 3 to 4X faster than I could've done it without LLMs.
| Yeah, they were generating a lot of boiler plate HDL code for
| me, but that still saved me several days of work at least.
| And then there was the test code that they generated which
| again saved me days of work. And their ability to explain old
| undocumented code that was part of the project was also
| extremely helpful. I was skeptical 18 months ago that any of
| this would be possible. Not anymore. I wasn't doing a project
| in which there would've been a lot of training examples.
| We're talking Verilog testbench generation based on multiple
| input Verilog modules, C++ code generation for a C program
| analyzer using libclang - none of this stuff would've worked
| just a few months back.
| DataDive wrote:
| I will add that I am grateful that I also got to experience
| a world where AI did not spew tons of code like a sausage-
| making machine.
|
| It was so satisfying to code up a solution where you knew
| you would get through it little by little.
| niemal_dev wrote:
| This.
| cs391231 wrote:
| This. I'm not terrified by total automation (In that case
| all jobs are going away and civilization is going to
| radically alter), I'm scared of selective deskilling and
| the field getting squeezed tighter and tighter leaving me
| functionally in a dead end.
| rachofsunshine wrote:
| Hey, kid.
|
| My name is Rachel. I'm the founder of company whose existence
| is contingent on the continued existence, employment, and
| indeed _competitive_ employment of software engineers, so I
| have as much skin in this game as you do.
|
| I worry about this a lot. I don't know what the chances are
| that AI wipes out developer jobs [EDIT: to clarify, in the
| sense that they become either much rarer or much lower-paid,
| which is sufficient] within a timescale relevant to my work
| (say, 3-5 years), but they aren't zero. Gun to my head, I peg
| that chance at perhaps 20%. That makes me more bearish on AI
| than the typical person in the tech world - Manifold thinks
| AI surpasses human researchers by the end of 2028 at 48% [1],
| for example - but 20% is most certainly not zero.
|
| That thought stresses me out. It's not just an existential
| threat to my business over which I have no control, it's a
| threat against which I cannot realistically hedge and which
| may disrupt or even destroy my life. It bothers me.
|
| But I do my work anyway, for a couple of reasons.
|
| One, progress on AI in posts like this is always going to be
| inflated. This is a marketing post. It's a post OpenAI wrote,
| and posted, to generate additional hype, business, and
| investment. There is some justified skepticism further down
| this thread, but even if you couldn't find a _reason_ to be
| skeptical, you ought to be skeptical _by default_ of such
| posts. I am an abnormally honest person by Silicon Valley
| founder standards, and even I cherry pick my marketing blogs
| (I just don 't outright make stuff up for them).
|
| Two, if AI surpasses a good software engineer, it probably
| surpasses just about everything else. This isn't a guarantee,
| but good software engineering is already one of the more
| challenging professions for humans, and there's no particular
| reason to think progress would stop exactly at making SWEs
| obsolete. So there's no good alternative here. There's no
| other knowledge work you could pivot to that would be a
| decent defense against what you're worried about. So you may
| as well play the hand you've got, even in the knowledge that
| it might lose.
|
| Three, in the world where AI _does_ surpass a good software
| engineer, there 's a decent chance it surpasses a good _ML_
| engineer in the near future. And once it does that, we 're in
| completely uncharted territory. Even if more extreme
| singularity-like scenarios don't come to pass, it doesn't
| need to be a singularity to become significantly superhuman
| to the point that almost nothing about the world in which we
| live continues to make any sense. So again, you lack any good
| alternatives.
|
| And four: *if this is the last era in which human beings
| matter, I want to take advantage of it!* I may be among the
| very last entrepreneurs or businesswomen in the history of
| the human race! If I don't do this now, I'll never get the
| chance! If you want to be a software engineer, do it now,
| because you might never get the chance again.
|
| It's totally reasonable to be scared, or stressed, or
| uncertain. Fear and stress and uncertainty are parts of life
| in far less scary times than these. But all you can do is
| play the hand you're dealt, and try not to be totally
| miserable while you're playing it.
|
| -----
|
| [1] https://manifold.markets/Royf214/will-ai-surpass-humans-
| in-c...
| huuhee3 wrote:
| I think CS skills will remain valuable, but you should try to
| build some domain specific knowledge in addition. Perhaps
| programmer roles will eventually merge with product owner /
| business person type of roles.
| mindcrime wrote:
| I don't want to lean into negativity here, and I'm far from
| an "AI Doomer".
|
| But... I will say I think the question you ask is a very fair
| question, and that there is, indeed, a LOT of uncertainty
| about what the future holds in this regard.
|
| So far the best reason we have for optimism is history: _so
| far_ the old adage has held up that "technology does destroy
| some jobs, but on balance it creates more new ones than it
| destroys." And while that's small solace to the buggy-whip
| maker or steam-engine engineer, things tend to work out in
| the long-run. However... history is suggestive, but far from
| conclusive. There is the well known "problem of induction"[1]
| which points out that we can't make definite predictions
| about the future based on past experience. And when those
| expectations are violated, we get "black swan events"[2]. And
| while they be uncommon, they do happen.
|
| The other issue with this question is, we don't really know
| what the "rate of change" in terms of AI improvement is. And
| we definitely don't know the 2nd derivative (acceleration).
| So a short-term guess that "there will be a job for you in 1
| year's time" is probably a fairly safe guess. But as a
| current student, you're presumably worried about 5 years, 10
| years, 20 years down the line and whether or not you'll still
| have a career. And the simple truth is, we can't be sure.
|
| So what to do? My gut feeling is "continue to learn software
| engineering, but make sure to look for ways to broaden your
| skill base, and position yourself to possibly move in other
| directions in the future". Eg, don't focus on just becoming a
| skilled coder in a particular language. Learn fundamentals
| that apply broadly, and - more importantly - learn about
| _how_ business work, learn "people skills"[3], develop
| domain knowledge in one or more domains, and generally learn
| as much as you can about "how the world works". Then from
| there, just "keep your head on a swivel" and stay aware of
| what's going on around you and be ready to make adjustments
| as needed.
|
| It might not also hurt to learn a thing or two about
| something that requires a physical presence (welding, etc.).
| And just in case a full-fledged cyberpunk dystopia
| develops... maybe start buying an extra box or two of
| ammunition every now and then, and study escape and evasion
| techniques, yadda yadda...
|
| [1]: http://en.wikipedia.org/wiki/Problem_of_induction
|
| [2]: https://en.wikipedia.org/wiki/Black_swan_theory
|
| [3]: https://www.youtube.com/watch?v=hNuu9CpdjIo
| algebra-pretext wrote:
| If you're going for FAANG most of your day isn't coding
| anyway.
| noshitsherlock wrote:
| There's so much software yet to to be written, so much to
| automate, so many niches to attack that you need not worry.
| It takes humans to know where to apply the technology based
| on their heart, not brains. Use AI in the direction only you
| can ascertain; and do it for the good of HUMANITY. It's a
| tool that makes the knowledge posterity has left us
| accessible, like mathematics. Go forth an conquer life's ills
| young man; It takes a human to know one. Don't worry, you're
| created in God's image.
| noshitsherlock wrote:
| Machines don't really "know" anything they just manipulate
| what is already known; Like a interactive book. It's just
| that this AI book is vast.
| noshitsherlock wrote:
| And the knowledge acquisition impedance is reduced
| smegger001 wrote:
| plumbing still looks like a safe choice for now.
| IncreasePosts wrote:
| What kind of student, at what kind of school?
|
| Are your peers getting internships at FANGs or hedge funds?
| Stick with it. You can probably bank enough money to make it
| worth it before shtf.
| orzig wrote:
| Three, though not slam dunks:
|
| 1. What other course of study are you confident would be
| better given an AI future? If there's a service sector job
| that you feel really called to, I guess you could shadow
| someone for a few days to see if you'd really like it?
|
| 2. Having spent a few years managing business dashboards for
| users, less than 25% ever routinely used the "user friendly"
| functionality we built to do semi-custom analysis. We needed
| 4 full time analytics engineers to spend at least half their
| time answering ad hoc questions that could have been self-
| served, despite an explicit goal of democratizing data. All
| that is to say; don't over estimate how quickly this will be
| taken up, even if it could technically do XYZ task
| (eventually, best-of-10) if prompted properly.
|
| 3. I don't know where you live, but I've spent most of my
| career 'competing' with developers in India who are paid
| 33-50% as much. They're literally teammates, it's not a
| hypothetical thing. And they've never stopped hiring in the
| US. I haven't been in the room for those decisions and don't
| want to open that can of worms here, but suffice to say it's
| not so simple as "cheaper per LoC wins"
| spaceman_2020 wrote:
| I honestly think that unless you're really passionate or
| really good, you shouldn't be a coder. If you, like the vast
| majority of coders today, picked it up in college or later,
| and mostly because of the promise of a fat paycheck, I can't
| really see a scenario where you would have a 30 year career
| stale2002 wrote:
| Sure. Software engineers are actually the best situated to
| take advantage of this new technology.
|
| Your concern would be like once C got invented, why should
| you bother being a software engineer? Because C is so much
| easier to use than assembly code!
|
| The answer, of course, is that software engineering will
| simply happen in even more powerful and abstract layers.
|
| But, you still might need to know how those lower layers
| work, even if you are writing less code in that layer
| directly.
| xcv123 wrote:
| C did not write itself.
|
| We now have a tool that writes code and solves problems
| autonomously. It's not comparable.
| almostuseful wrote:
| If at some point a competent senior software engineer can be
| automated away, I think we are so close to a possible 'AI
| singularity' in as much as that concept makes sense, that
| nothing really matters anyway.
|
| I don't know what will be automated first of the competent
| senior software engineer and say, a carpenter, but once the
| programmer has been automated away, the carpenter (and
| everything else) will follow shortly.
|
| The reasoning is that there is such a functional overlap
| between being a standard software engineer and an AI engineer
| or researcher, that once you can automate one, you can
| automate the other. Once you have automated the AI engineers
| and researchers, you have recursive self-improving AI and all
| bets are off.
|
| Essentially, software engineering is perhaps the only field
| where you shouldn't worry about automation, because once that
| has been automated, everything changes anyways.
| jstummbillig wrote:
| One angle: There are a million SMBs and various other
| institutions, using none or really shitty software, that
| could be xx% to xxx% times more productive with custom
| software that they would never have been able to afford
| before. Now they can, en masse, because you will be able to
| built it a lot faster.
|
| I have been coding a lot with AI recently. Understanding and
| putting into thought what is needed for the program to fix
| your problem remains as complex and difficult as ever.
|
| You need to pose a question for the AI to do something for
| you. Asking a good question is out of reach for a lot of
| people.
| agentultra wrote:
| There's an equal amount of hopium from the AI stans here as
| well.
|
| Hundreds of billions of dollars have been invested in a
| technology and they need to find a way to start making a
| profit or they're going to run out of VC money.
|
| You still have to know what to build and how to specify what
| you want. Plain language isn't great at being precise enough
| for these things.
|
| Some people say they'll keep using stuff like this as a tool.
| I wouldn't bet the farm that it's going to replace humans at
| any point.
|
| Besides, programming is fun.
| ayakang31415 wrote:
| If AI becomes good enough to replace software engineers, it
| has already become good enough to replace other brain jobs
| (lawyers, physicians, accountant, etc). I feel that software
| engineering is one of the very last jobs to be replaced by
| AI.
| bashfulpup wrote:
| There is little to no research that shows modern AI can
| perform even the most simple long-running task without
| training data on that exact problem.
|
| To my knowledge, there is no current AI system that can
| replace a white collar worker in any multistep task. The only
| thing they can do is support the worker.
|
| Most jobs are safe for the forseable future. If your job is
| highly repetitive and a company can produce a perfect dataset
| of it, I'd worry.
|
| Jobs like a factory worker and call center support are in
| danger. But the work is perfectly monitorable.
|
| Watch the GAIA benchmark. It's not nearly the complexity of a
| real-world job, but it would signal the start of an actual
| agentic system being possible.
| baq wrote:
| I'd argue the foreseeable future got a lot shorter in the
| last couple years.
| morningsam wrote:
| As soon as software development can be fully performed by
| AIs, it won't take long before all other jobs that can be
| performed in front of a computer follow, and after that it
| probably won't take long for practically the entire rest.
|
| This release has shifted my personal prediction of when this
| is going to happen further into the future, because OpenAI
| made a big deal hyping it up and it's nothing - preferred by
| humans over GPT-4o only a little more than half the time.
| _proofs wrote:
| just because something can generate an output for you, does
| not make a need for _discernment and application_ obsolete.
|
| like another commenter, i do not have a lot of faith, that
| people who do not have at minimum: fundamental fluency in
| programming (even with a dash of general software
| architecture and practices).
|
| there is no "push button generate and glueing components
| together in a way that can survive at scale and be
| maintainable" without knowing what the output means, and
| implies with respect to integration(s).
|
| however, those with the fluency, domain, and experience, will
| thrive, and continue thriving.
| taco_emoji wrote:
| This is not going to replace you. This isn't AGI.
| mmckelvy wrote:
| As others have said, LLMs still require engineers to produce
| quality output. LLMs do, however, make those engineers that
| use them much more productive. If this trend continues, I
| could see a scenario where an individual engineer could build
| a customized version of, say, Salesforce in a month or two.
| If that happens, you could make a solid case that companies
| paying $1mm+ per year for 12 different SaaS tools should just
| bring that in house. The upshot is you may still be writing
| software, but instead of building SaaS at Salesforce, you'll
| be working for their former customers or maybe as some sort
| of contractor.
| bunderbunder wrote:
| Actually cutting code is _maybe_ 10% of the job, and LLMs are
| absolute crap at the other 90%.
|
| They can't build and maintain relationships with
| stakeholders. They can't tell you why what you ask them to do
| is unlikely to work out well in practice and suggest
| alternative designs. They can't identify, document and
| justify acceptance criteria. They can't domain model. They
| can't architect. They can't do large-scale refactoring. They
| can't do system-level optimization. They can't work with that
| weird-ass code generation tool that some hotshot baked deeply
| into the system 15 years ago. They can't figure out why that
| fence is sitting out in the middle of the field for no
| obvious reason. etc.
|
| If that kind of stuff sounds like satisfying work to you, you
| should be fine. If it sounds terrible, you should pivot away
| now regardless of any concerns about LLMs, because, again,
| this is like 90% of the real work.
| xivzgrev wrote:
| Play it out
|
| Let's assume today a LLM is perfectly equivalent to a junior
| software engineer. You connect it to your code base, load in
| PRDs / designs, ask it to build it, and viola perfect code
| files
|
| 1) Companies are going to integrate this new technology in
| stages / waves. It will take time for this to really get
| broad adoption. Maybe you are at the forefront of working
| with these models
|
| 2) OK the company adopts it and fires their junior engineers.
| They start deploying code. And it breaks Saturday evening.
| Who is going to fix it? Customers are pissed. So there's lots
| to work out around support.
|
| 3) That problem is solved, we can perfectly trust a LLM to
| ship perfect code that never causes downstream issues and
| perfectly predicts all user edge cases.
|
| Never underestimate the power of corporate greediness.
| There's generally two phases of corporate growth - expansion
| and extraction. Expansion is when they throw costs out the
| window to grow. Extraction is when growth stops, and they
| squeeze customers & themselves.
|
| AI is going to cause at least a decade of expansion. It opens
| up so many use cases that were simply not possible before,
| and lots of replacement.
|
| Companies are probably not looking at their engineers looking
| to cut costs. They're more likely looking at them and saying
| "FINALLY, we can do MORE!"
|
| You won't be a coder - you'll be a LLM manager / wrangler.
| You will be the neck the company can choke if code breaks.
|
| Remember if a company can earn 10x money off your salary,
| it's a good deal to keep paying you.
|
| Maybe some day down the line, they'll look to squeeze
| engineers and lay some off, but that is so far off.
|
| This is not hopium, this is human nature. There's gold in
| them hills.
|
| But you sure as shit better be well versed in AI and using in
| your workflows - the engineers who deny it will be the ones
| who fall behind
| theptip wrote:
| If you want to get a career in software engineering because
| you want to write code all day, probably a bad time to be
| joining the field.
|
| If you are interested in using technology to create systems
| that add value for your users, there has never been a better
| time.
|
| GPT-N will let you scale your impact way beyond what you
| could do on your own.
|
| Your school probably isn't going to keep abreast with this
| tech so it's going to be more important to find side-projects
| to exercise your skills. Build a small project, get some
| users, automate as much as you can, and have fun along the
| way.
| rapind wrote:
| Here you go:
|
| I just watched a tutorial on how to leverage v1, claude, and
| cursor to create a marketing page. The result was a
| convoluted collection of 20 or so TS files weighing a few MB
| instead of a 5k HTML file you could hand bomb in less time.
|
| I wouldn't feel too threatened yet. It's still just a tool
| and like any tool, can be wielded horribly.
| CamperBob2 wrote:
| _I just watched a tutorial on how to leverage v1, claude,
| and cursor to create a marketing page. The result was a
| convoluted collection of 20 or so TS files weighing a few
| MB instead of a 5k HTML file you could hand bomb in less
| time._
|
| And if you hired an actual team of developers to do the
| same thing, it is very likely that you'd have gotten a
| convoluted collection of 20 or so TS files weighing a few
| MB instead of a 5k HTML file you could hand bomb in less
| time.
| weweweoo wrote:
| What's the alternative? If AI is going to replace software
| engineers, there is no fundamental reason they couldn't
| replace almost all other knowledge workers as well. No matter
| the field, most of it is just office work managing,
| transforming and building new information, applying existing
| knowledge on new problems (that probably are not very unique
| in grand scheme of things).
|
| Except for medical doctors, nurses, and some niche
| engineering professions, I really struggle to think of jobs
| requiring higher education that couldn't be largely automated
| by an LLM that is smart enough to replace a senior software
| engineer. These few jobs are protected mainly by the physical
| aspect, and low tolerance for mistakes. Some skilled trades
| may also be protected, at least if robotics don't improve
| dramatically.
|
| Personally, I would become a doctor if I could. But of all
| things I could've studied excluding that, computer science
| has probably been one of the better options. At least it
| teaches problem solving and not just memorization of facts.
| Knowing how to code may not be that useful in the future, but
| the process of problem solving is going nowhere.
| pid-1 wrote:
| Why can't medical doctors be automated?
| weweweoo wrote:
| Mainly the various physical operations many of them
| perform on daily basis (due to limitations of robotics),
| plus liability issues in case things go wrong and
| somebody dies. And finally, huge demand due to aging
| population worldwide.
|
| I do believe some parts of their jobs will be automated,
| but not enough (especially with growing demand) to really
| hurt career prospects. Even for those parts, it will take
| a long a while due to the regulated nature of the sector.
| gensym wrote:
| 1. The demand for software is insatiable. The biggest gate
| has been the high costs due to limited supply of the time of
| the people who know how to do it. In the near term, AI will
| make the cost of software (not of software devs, but the
| software itself) decrease while demand for new software will
| increase, especially as software needs to be created to take
| advantage of new UI tools.
|
| I've been in software engineering for over 20 years. I've
| seen massive growth in the productivity of software
| engineers, and that's resulted in greater demand for them. In
| the near term, AI should continue this trend.
|
| 2. It's possible that at some point, AI will advance to where
| we can remove software engineers from the loop. We're not
| even close to that point yet. In the mean time, software
| engineering is an excellent way to learn about other business
| problems so that you'll be well-situated to address them
| (whatever they'll be at that time).
| CuriouslyC wrote:
| It still has issues with crossing service boundaries, working
| in systems, stuff like that. That stuff will get better but
| the amount of context you need to load to get good results
| with a decently sized system will still be prohibitive. The
| software engineer skillset is being devalued but architecture
| and systems thinking is still going to be valuable for quite
| some time.
| p1necone wrote:
| For basically all the existing data we have, efficiency
| improvements _always_ result in _more_ work, not less.
|
| Humans never say "oh neat I can do _thing_ with 10% of the
| effort now, guess I 'll go watch tv for the rest of the
| week", they say "oh neat I can do _thing_ with 10% of the
| effort now, I 'm going to hire twice as many people and
| produce like 20x as much as I was before because there's so
| much less risk to scaling now."
|
| I think there's enough unmet demand for software that
| efficiency increases from automation are going to be eaten up
| for a long time to come.
| MattDaEskimo wrote:
| Most of these posts are from romantics.
|
| Software engineering will be a profession of the past,
| similar to how industrial jobs hardly exist.
|
| If you have a strong intuition with software & programming
| you may want to shift towards applying AI into already
| existing solutions.
| weweweoo wrote:
| The question is, why wouldn't nearly all other white collar
| jobs be professions of the past as well? Does the average
| MBA or whatever possess some unique knowledge that you
| couldn't generate with an LLM fed with company data? What
| is the alternative career path?
|
| I think software engineers who also understand business may
| yet have an advantage over pure business people, who don't
| understand technology. They should be able to tell AI what
| to do, and evaluate the outcome. Of course "coders" who
| simply produce code from pre-defined requirements will
| probably not have a good career.
| MattDaEskimo wrote:
| They will be of the past.
|
| This is typical of automation. First, there are numerous
| workers, then they are reduced to supervisors, then they
| are gone.
|
| The future of business will be managing AI, so I agree
| with what you're saying. However most software engineers
| have a very strong low level understanding of
| programming. Not a business sense of application
| CamperBob2 wrote:
| To fix the robots^W^W^Wbuild these things.
|
| I've been around for multiple decades. Nothing this
| interesting has happened since at least 1981, when I first
| got my hands on a TRS-80. I dropped out of college to work on
| games, but these days I would drop out of college to work on
| ML.
| eitally wrote:
| While the reasoning and output of ChatGPT is impressive (and,
| imho, would pass almost all coding interviews), I'm primarily
| impressed with the logical flow, explanation and
| thoroughness. The actual coding and problem solving isn't
| complex, and that gets to your question: someone (in this
| case, the OP), still needed to be able to figure out how to
| extract useful data and construct a stimulating prompt to
| trigger the LLM into answering in this way. As others have
| posted, none of the popular LLMs behave identically, either,
| so becoming an exert tool-user with one doesn't necessarily
| translate to the next.
|
| I would suggest the fundamentals of computer science and
| software engineering are still critically important ... but
| the development of new code, and especially the translation
| or debugging of existing code is where LLMs will shine.
|
| I currently work for an SAP-to-cloud consulting firm. One of
| the singlemost compelling use cases for LLMs in this area is
| to analyze custom code (running in a client's SAP
| environment), and refactor it to be compatible with current
| versions of SAP as a cloud SaaS. This is a specialized domain
| but the concept applies broadly: pick some crufty codebase
| from somewhere, run it through an LLM, and do a lot of mostly
| copying & pasting of simpler, modern code into your new
| codebase. LLMs take a lot of the drudgery out of this, but it
| still requires people who know what they're looking at, and
| _could_ do it manually. Think of the LLM as giving you an
| efficiency superpower, not replacing you.
| holoduke wrote:
| Software development just becomes a level tier higer for most
| developers. Instead of writing everything yourself you will
| be more like an orchestrator. Tell the system to write this,
| tell the system to connect that and this etc. You still need
| to understand code. But maybe in the future even that part
| becomes unreadable for us. We only understand the high level
| concepts.
| andrewchambers wrote:
| I don't think programming is any less safe than any other
| office job tbh. Focus on problem solving and using these
| tools to your advantage and choose a field you enjoy.
| rdevsrex wrote:
| Just because we have machines that can lift much more than
| any human ever could, it doesn't mean that working out is
| useless.
|
| In the same way, training your mind is not useless. Perhaps
| as things develop, we will get back to the idea that the
| purpose of education is not just to get a job, but to help
| you become a better and more virtuous person.
| romeros wrote:
| is it better than Claude?
| bartman wrote:
| Neither Sonnet nor Opus could solve it or get close in a
| minimal test I did just now, using the same prompt as above.
|
| Sonnet: https://pastebin.com/24QG3JkN
|
| Opus: https://pastebin.com/PJM99pdy
| hmottestad wrote:
| I think this new model is a generational leap above Claude
| for tasks that require complex reasoning.
| antman wrote:
| second is very blurry
| bartman wrote:
| When you click on the image it loads a higher res version.
| jazzyjackson wrote:
| Isn't there a big "Share" button at the top right of the
| chatgpt interface? Or are you using another front end?
| bartman wrote:
| In ChatGPT for Business it limits sharing among users in my
| org, without an option for public sharing.
| fshbbdssbbgdd wrote:
| I often click on those links and get an error that they are
| unavailable. I'm not sure if it's openAI trying to prevent
| people from sharing evidence of the model behaving badly, or
| an innocuous explanation like the links are temporary.
| arunv wrote:
| They were probably generated using a business account, and
| the business does not allow public links.
| coder543 wrote:
| The link also breaks if the original user deletes the chat
| that was being linked to, whether on purpose or without
| realizing it would also break the link.
| OutOfHere wrote:
| Even for regular users, the Share button is not always
| available or functional. It works sometimes, and other times
| it disappears. For example, since today, I have no Share
| button at all for chats.
| JieJie wrote:
| My share chat link moved into the sidebar in the ... menu
| to the right of each chat title (MacOS Safari).
| OutOfHere wrote:
| Ah, I see it there now. Thanks.
| baal80spam wrote:
| Thanks for sharing this, incredible stuff.
| GaggiX wrote:
| Did you edit the message? I cannot see anything now in the
| screenshot, too low resolution
| bartman wrote:
| You need to click on the image for the high res version to
| load. Sorry, it's awkward.
| GaggiX wrote:
| The website seems to redirect me to a low resolution image,
| the first time I clicked on the link it worked as you are
| saying.
| fwip wrote:
| What's the incredible part here? Being able to write code to
| turn hex into decimal?
| fwip wrote:
| Also, if you actually read the "chain of thought" contains
| several embarrassing contradictions and incoherent sentences.
| If a junior developer wrote this analysis, I'd send them back
| to reread the fundamentals.
| CooCooCaCha wrote:
| What about thoughts themselves? There are plenty of times I
| start a thought and realize it doesn't make sense. It's
| part of the thinking process.
| soheil wrote:
| Great progress, I asked GPT-4o and o1-preview to create a
| python script to make $100 quickly, o1 came up with a very
| interesting result:
|
| https://x.com/soheil/status/1834320893331587353
| fsndz wrote:
| > Asking the models to determine if my code is equivalent to
| what they reverse engineered resulted in a nuanced and thorough
| examination, and eventual conclusion that it is equivalent.
|
| Did you actually implement to see if it works out of the box ?
|
| Also if you are a free users or accepted that your chats should
| be used for training then maybe o1 is was just trained on your
| previous chat and so now knows how to reason about that
| particular type of problems
| bongodongobob wrote:
| That's not how LLM training works.
| fsndz wrote:
| so it is impossible to use the free user chats to train
| models ??????
| bartman wrote:
| That is an interesting thought. This was all done in an
| account that is opted out of training though.
|
| I have tested the Python code o1 created to decode the
| timestamps and it works as expected.
| jeffpeterson wrote:
| Very cool. It gets the conclusion right, but it did confuse
| itself briefly after interpreting `256 * last_byte +
| second_to_last_byte` as big-endian. It's neat that it corrected
| the confusion, but a little unsatisfying that it doesn't
| explicitly identify the mistake the way a human would.
| avodonosov wrote:
| The screenshot [1] is not readable for me. Chrome, Android.
| It's so blurry that I cant recognize a single character. How do
| other people read it? The resolution is 84x800.
| rovr138 wrote:
| When I click on the image, it expands to full res,
| 1713x16392.3
| deathanatos wrote:
| > _it expands to full res, 1713x16392.3_
|
| Three tenths of a pixel is an _interesting_ resolution...
|
| (The actual res is 1045 x 10000 ; you've multiplied by
| 1.63923 somehow...?)
| rovr138 wrote:
| I agree,
|
| But it's what I got when I went to Inspect element >
| hover over the image
|
| Size it expanded to vs real image size I guess
| Jerrrrrrry wrote:
| Pixels have been "non-real" for a long time.
| mikebridgman wrote:
| Click on it for full resolution
| smusamashah wrote:
| When you open on phone, switch to "desktop site" via browser
| three dots menu
| daemonologist wrote:
| Direct link to full resolution:
| https://i.postimg.cc/D74LJb45/SCR-20240912-sdko.png
| guiambros wrote:
| FYI, there's a "Save ChatGPT as PDF" Chrome extension [1].
|
| I wouldn't use on a ChatGPT for Business subscription (it may
| be against your company's policies to export anything), but
| very convenient for personal use.
|
| https://chromewebstore.google.com/detail/save-chatgpt-as-pdf...
| andraz wrote:
| What is the brand of the fan? Same problem here with
| proprietary hood fan...
| bartman wrote:
| InVENTer Pulsar
| smusamashah wrote:
| What if you copy the whole reasoning process example provided
| by OpenAI, use it as a system prompt (to teach how to reason),
| use that system prompt in Claude, got4o etc?
| azeirah wrote:
| It might work a little bit. It's like doing few shot
| prompting instead of training it to reason.
| fragmede wrote:
| I'm impressed. I had two modified logic puzzles where ChatGPT-4
| fails but o1 succeeds. The training data had too many instances
| of the unmodified puzzle, so 4 wouldn't get it right. o1
| manages to not get tripped up by them.
|
| https://chatgpt.com/share/66e35c37-60c4-8009-8cf9-8fe61f57d3...
|
| https://chatgpt.com/share/66e35f0e-6c98-8009-a128-e9ac677480...
| 8thcross wrote:
| This is a brilliant hypothesis deconstruction. I am sure others
| will now be able to test as well and this should confirm their
| engineering.
| jmartin2683 wrote:
| Per-token billing will be lit
| drzzhan wrote:
| "hidden chain of thought" is basically the finetuned prompt isn't
| it? The time scale x-axis is hidden as well. Not sure how they
| model the gpt for it to have an ability to decide when to stop
| CoT and actually answer.
| holmesworcester wrote:
| Since ChatGPT came out my test has been, can this thing write me
| a sestina.
|
| It's sort of an arbitrary feat with language and following
| instructions that would be annoying for me and seems impressive.
|
| Previous releases could not reliably write a sestina. This one
| can!
| fraboniface wrote:
| Some commenters seem a bit confused as to how this works. Here is
| my understanding, hoping it helps clarify things.
|
| Ask something to a model and it will reply in one go, likely
| imperfectly, as if you had one second to think before answering a
| question. You can use CoT prompting to force it to reason out
| loud, which improves quality, but the process is still linear.
| It's as if you still had one second to start answering but you
| could be a lot slower in your response, which removes some
| mistakes.
|
| Now if instead of doing that you query the model once with CoT,
| then ask it or another model to critically assess the reply, then
| ask the model to improve on its first reply using that feedback,
| then keep doing that until the critic is satisfied, the output
| will be better still. Note that this is a feedback loop with
| multiple requests, which is of different nature that CoT and much
| more akin to how a human would approach a complex problem. You
| can get MUCH better results that way, a good example being Code
| Interpreter. If classic LLM usage is system 1 thinking, this is
| system 2.
|
| That's how o1 works at test time, probably.
|
| For training, my guess is that they started from a model not that
| far from GPT-4o and fine-tuned it with RL by using the above
| feedback loop but this time converting the critic to a reward
| signal for a RL algorithm. That way, the model gets better at
| first guessing and needs less back and forth for the same output
| quality.
|
| As for the training data, I'm wondering if you can't somehow get
| infinite training data by just throwing random challenges at it,
| or very hard ones, and let the model think about/train on them
| for a very long time (as long as the critic is unforgiving
| enough).
| hidelooktropic wrote:
| > THERE ARE THREE R'S IN STRAWBERRY
|
| It finally got it!!!
| acomjean wrote:
| I always think to a professor that was consulting on some civil
| engineering software. He found a bug in the calculation it was
| using to space rebar placed in concrete, based on looking at it
| was spitting out and thinking that looks wrong.
|
| This kind of thing makes me nervous.
| zh3 wrote:
| Question here is about the "reasoning" tag - behind the scenes,
| is this qualitively different fron stringing words together on a
| statistical basis? (aside from backroom tweaking and some
| randomisation).
| canjobear wrote:
| First shot, I gave it a medium-difficulty math problem, something
| I actually wanted the answer to (derive the KL divergence between
| two Laplace distributions). It thought for a long time, and still
| got it wrong, producing a plausible but wrong answer. After some
| prodding, it revised itself and then got it wrong again. I still
| feel that I can't rely on these systems.
| spaceman_2020 wrote:
| Look where you were 3 years ago, and where you are now.
|
| And then imagine where you will be in 5 more years.
|
| If it can _almost_ get a complex problem right now, I 'm dead
| sure it will get it correct within 5 years
| colonelspace wrote:
| > I'm dead sure it will get it correct within 5 years
|
| You might be right.
|
| But plenty of people said we'd all be getting around in self-
| driving cars _for sure_ 10 years ago.
| neevans wrote:
| we do have self driving car but since it directly affects
| people's life it needs to be close to 100% accurate and no
| margin of errors. Not necessarily the case for LLMs.
| AnIrishDuck wrote:
| I'm not? The history of AI development is littered with
| examples of false starts, hidden traps, and promising
| breakthroughs that eventually expose deeper and more
| difficult problems [1].
|
| I wouldn't be _shocked_ if it could eventually get it right,
| but _dead sure_?
|
| 1. https://en.wikipedia.org/wiki/AI_winter
| taco_emoji wrote:
| Have you never heard of "local maxima"? Why are you so
| certain another 5 years will provide any qualitative
| advancement at all?
| mogoh wrote:
| But can it now say "I don't know." ? Or can it evaluate its
| own results and came to the conclusion that its just a wild
| guess?
|
| I am still impressed by the progress though.
| evilfred wrote:
| what makes you so "dead sure"? it's just hallucinating as
| always
| methodical wrote:
| You're dead sure? I wouldn't say anything definite about
| technology advancements. People seem to underestimate the
| last 20% of the problem and only focus on the massive 80%
| improvements up to this point.
| zer0tonin wrote:
| The progress since GPT-3 hasn't been spectacularly fast.
| ActorNightly wrote:
| Getting complex problem = having the solution in some form in
| the training dataset.
|
| All we are gonna get is better and better googles.
| closeparen wrote:
| It is not at all clear that "produce correct answer" is the
| natural endpoint of "produce plausible on-topic utterances
| that look like they could be answers." To do the former you
| need to know something about the underlying structure of
| reality (or have seen the answer before), to do the latter
| you only need to be good at pattern-matching and language.
| dontlikeyoueith wrote:
| I still don't have a Mr. Fusion in my house, FYI.
|
| We always overestimate the future.
| m3kw9 wrote:
| Maybe you are wrong if you don't know the answer?
| fzaninotto wrote:
| It can solve sudoku. It took 119s to solve this easy grid:
|
| _ 7 8 4 1 _ _ _ 9
|
| 5 _ 1 _ 2 _ 4 7 _
|
| _ 2 9 _ 6 _ _ _ _
|
| _ 3 _ _ _ 7 6 9 4
|
| _ 4 5 3 _ _ 8 1 _
|
| _ _ _ _ _ _ 3 _ _
|
| 9 _ 4 6 7 2 1 3 _
|
| 6 _ _ _ _ _ 7 _ 8
|
| _ _ _ 8 3 1 _ _ _
| fzaninotto wrote:
| It seems to be unable to solve hard sudokus, like the following
| one where it gave 2 wrong answers before abandoning.
|
| +-------+-------+-------+ | 6 . . | 9 1 . | . . . | | 2 . 5 | .
| . . | 1 . 7 | | . 3 . | . 2 7 | 5 . . |
| +-------+-------+-------+ | 3 . 4 | . . 1 | . 2 . | | . 6 . | 3
| . . | . . . | | . . 9 | . 5 . | . 7 . |
| +-------+-------+-------+ | . . . | 7 . . | 2 1 . | | . . . | .
| 9 . | 7 . 4 | | 4 . . | . . . | 6 8 5 |
| +-------+-------+-------+
|
| So we're safe for another few months.
| the_king wrote:
| Peter Thiel was widely criticized this spring when he said that
| AI "seems much worse for the math people than the word people."
|
| So far, that seems to be right. The only thing o1 is worse at is
| writing.
| MillionOClock wrote:
| What is the maximum context size in the web UI?
| kherud wrote:
| Aren't LLMs much more limited on the amount of output tokens than
| input tokens? For example, GPT-4o seems to support only up to 16
| K output tokens. I'm not completely sure what the reason is, but
| I wonder how that interacts with Chain-of-Thought reasoning.
| trissi1996 wrote:
| Not really.
|
| There's no fundamental difference between input and output
| tokens technically.
|
| The internal model space is exactly the same after evaluating
| some given set of token, no matter which of them were produced
| by the prompter or the model.
|
| The 16k output token limit is just an arbitrary limit in the
| chatgpt interface.
| novaleaf wrote:
| boo, they are hiding the chain of thought from user output (the
| great improvement here)
|
| > Therefore, after weighing multiple factors including user
| experience, competitive advantage, and the option to pursue the
| chain of thought monitoring, we have decided not to show the raw
| chains of thought to users. We acknowledge this decision has
| disadvantages. We strive to partially make up for it by teaching
| the model to reproduce any useful ideas from the chain of thought
| in the answer. For the o1 model series we show a model-generated
| summary of the chain of thought.
| owenpalmer wrote:
| "The Future Of Reasoning" by Vsauce [0] is a fascinating pre-AI-
| era breakdown of how human reasoning works. Thinking about it in
| terms of LLMS is really interesting.
|
| [0]: https://www.youtube.com/watch?v=_ArVh3Cj9rw
| LarsDu88 wrote:
| I wonder if this architecture is just asking a chain of thought
| prompt, or whether they built a diffusion model.
|
| The old problem with image generation was that single pass
| techniques like GANs and VAEs had to do everything in one go.
| Diffusion models wound up being better by doing things
| iteratively.
|
| Perhaps this is a diffusion model for text (top ICML paper this
| year was related to this).
| devit wrote:
| They claim it's available in ChatGPT Plus, but for me clicking
| the link just gives GPT-4o Mini.
| lukev wrote:
| This is a pretty big technical achievement, and I am excited to
| see this type of advancement in the field.
|
| However, I am very worried about the utility of this tool given
| that it (like all LLMs) is still prone to hallucination. Exactly
| who is it for?
|
| If you're enough of an expert to critically judge the output,
| you're probably just as well off doing the reasoning yourself. If
| you're not capable of evaluating the output, you risk relying on
| completely wrong answers.
|
| For example, I just asked it to evaluate an algorithm I'm working
| on to optimize database join ordering. Early in the reasoning
| process it confidently and incorrectly stated that "join costs
| are usually symmetrical" and then later steps incorporated that,
| trying to get me to "simplify" my algorithm by using an
| undirected graph instead of a directed one as the internal data
| structure.
|
| If you're familiar with database optimization, you'll know that
| this is... very wrong. But otherwise, the line of reasoning was
| cogent and compelling.
|
| I worry it would lead me astray, if it confidently relied on a
| fact that I wasn't able to immediately recognize was incorrect.
| ramesh31 wrote:
| >If you're enough of an expert to critically judge the output,
| you're probably just as well off doing the reasoning yourself.
|
| Thought requires energy. A lot of it. Humans are for more
| efficient in this regard than LLMs, but then a bicycle is also
| much more efficient than a race car. I've found that even when
| they are hilariously wrong about something, simply the
| directionality of the line of reasoning can be enough to
| usefully accelerate my own thought.
| OkGoDoIt wrote:
| Some practical notes from digging around in their documentation:
| In order to get access to this, you need to be on their tier 5
| level, which requires $1,000 total paid and 30+ days since first
| successful payment.
|
| Pricing is $15.00 / 1M input tokens and $60.00 / 1M output
| tokens. Context window is 128k token, max output is 32,768
| tokens.
|
| There is also a mini version with double the maximum output
| tokens (65,536 tokens), priced at $3.00 / 1M input tokens and
| $12.00 / 1M output tokens.
|
| The specialized coding version they mentioned in the blog post
| does not appear to be available for use.
|
| It's not clear if the hidden chain of thought reasoning is billed
| as paid output tokens. Has anyone seen any clarification about
| that? If you are paying for all of those tokens it could add up
| quickly. If you expand the chain of thought examples on the blog
| post they are extremely verbose.
|
| https://platform.openai.com/docs/models/o1
| https://openai.com/api/pricing/
| https://platform.openai.com/docs/guides/rate-limits/usage-ti...
| activatedgeek wrote:
| Reasoning tokens are indeed billed as output tokens.
|
| > While reasoning tokens are not visible via the API, they
| still occupy space in the model's context window and are billed
| as output tokens.
|
| From here: https://platform.openai.com/docs/guides/reasoning
| baq wrote:
| This is concerning - how do you know you aren't being fleeced
| out of your money here...? You'll get your results, but did
| you _really_ use that much?
| rsanek wrote:
| obfuscated billing has long been a staple of all great
| cloud products. AWS innovated in the space and now many
| have followed in their footsteps
| lolinder wrote:
| Also, now we're paying for output tokens that aren't even
| output, with no good explanation for why these tokens
| should be hidden from the person who paid for them.
| amrrs wrote:
| The CoT is billed as output tokens. Mentioned in the docs where
| it talks about reasoning
| vdfs wrote:
| We just receivied this email:
|
| Hi there,
|
| I'm x, PM for the OpenAI API. I'm pleased to share with you our
| new series of models, OpenAI o1. We've developed these models
| to spend more time thinking before they respond. They can
| reason through complex tasks and solve harder problems than
| previous models in science, coding, and math.
|
| As a trusted developer on usage tier 5, you're invited to get
| started with the o1 beta today. Read the docs You have access
| to two models: Our larger model, o1-preview,
| which has strong reasoning capabilities and broad world
| knowledge. Our smaller model, o1-mini, which is 80%
| cheaper than o1-preview.
|
| Try both models! You may find one better than the other for
| your specific use case. Both currently have a rate limit of 20
| RPM during the beta. But keep in mind o1-mini is faster,
| cheaper, and competitive with o1-preview at coding tasks (you
| can see how it performs here). We've also written up more about
| these models in our blog post.
|
| I'm curious to hear what you think. If you're on X, I'd love to
| see what you build--just reply to our post.
|
| Best, OpenAI API
| sashank_1509 wrote:
| I have access to this and there is no way I spend more than 50$
| on OpenAI api. I have ChatGPT + since day q though (240$
| probably in total)
| rpmisms wrote:
| You missed your raise key on "day q"
| Buttons840 wrote:
| I am a Plus user and pay $20 per month. I have access to the o1
| models.
| liamwire wrote:
| Unless this is specifically relating to API access, I don't
| think it's correct. I've been paying for ChatGPT via the App
| Store IAP for around a year or less, and I've already got both
| o1-preview and o1-mini available in-app.
| OkGoDoIt wrote:
| Yes, I was referring to API access specifically. Nothing in
| the blog post or the documentation mentions access to these
| new models on ChatGPT, and even as a paid user I'm not seeing
| them on there (Edit: I am seeing it now in the app). But
| looks like a bunch of other people in this discussion do have
| it on ChatGPT, so that's exciting to hear.
| arnaudsm wrote:
| Some of the queries run for multiple minutes. 40 tokens/sec is
| too slow for CoT.
|
| I hope OpenAI is investing in low-latency like Groq's tech that
| can reach 1k tokens/sec.
| anigbrowl wrote:
| _you need to be on their tier 5 level, which requires $1,000
| total paid and [...]_
|
| Good opening for OpenAI's competitors to run a 'we're not
| snobs' promotion.
| jonahx wrote:
| I am an ordinary plus user (since it was released more or less)
| and have access.
| jdthedisciple wrote:
| I challenged it to solve the puzzle in my profile info.
|
| It failed ;)
| jupi2142 wrote:
| Using codeforces as a benchmark feels like a cheat, since OpenAI
| use to pay us chump change to solve codeforces questions and
| track our thought process on jupyter notebook.
| rcarmo wrote:
| Here's an unpopular take on this:
|
| "We had the chance to make AI decision-making auditable but are
| locking ourselves out of hundreds of critical applications by not
| exposing the chain of thought."
|
| One of the key blockers in many customer discussions I have is
| that AI models are not really auditable and that automating
| complex processes with them (let alone debug things when
| "reasoning" goes awry) is difficult if not impossible unless you
| do multi-shot and keep track of all the intermediate outputs.
|
| I really hope they expose the chain of thought as some sort of
| machine-parsable output, otherwise no real progress will have
| been made (many benchmarks are not really significant when you
| try to apply LLMs to real-life applications and use cases...)
| fwip wrote:
| I suspect that actually reading the "chain of thought" would
| reveal obvious "logic" errors embarrassingly often.
| rcarmo wrote:
| It would still be auditable. In a few industries that is the
| only blocker for adoption--even if the outputs are incorrect.
| thimabi wrote:
| I believe that is the case. Out of curiosity, I had this
| model try to solve a very simple Sudoku puzzle in ChatGPT,
| and it failed spectacularly.
|
| It goes on and on making reasoning mistakes, and always ends
| up claiming that the puzzle is unsolvable and apologizing. I
| didn't expect it to solve the puzzle, but the whole reasoning
| process seems fraught with errors.
| 015a wrote:
| Here's a video demonstration they posted on YouTube:
| https://www.youtube.com/watch?v=50W4YeQdnSg
| kfrane wrote:
| I was a bit confused when looking at the English example for
| Chain-Of-Thought. It seems that the prompt is a bit messed up
| because the whole statement is bolded but it seems that only
| "appetite regulation is a field of staggering complexity" part
| should be bolded. Also that's how it shows up in the o1-preview
| response when you open the Chain of thought section.
| derefr wrote:
| So, it's good at hard-logic reasoning (which is great, and no
| small feat.)
|
| Does this reasoning capability generalize outside of the
| knowledge domains the model was trained to reason about, into
| "softer" domains?
|
| For example, is O1 better at comedy (because it can reason better
| about what's funny)?
|
| Is it better at poetry, because it can reason about rhyme and
| meter?
|
| Is it better at storytelling as an extension of an existing input
| story, because it now will first analyze the story-so-far and
| deduce aspects of the characters, setting, and themes that the
| author seems to be going for (and will ask for more information
| about those things if it's not sure)?
| fsndz wrote:
| My point of view: this is a real advancement. I've always
| believed that with the right data allowing the LLM to be trained
| to imitate reasoning, it's possible to improve its performance.
| However, this is still pattern matching, and I suspect that this
| approach may not be very effective for creating true
| generalization. As a result, once o1 becomes generally available,
| we will likely notice the persistent hallucinations and faulty
| reasoning, especially when the problem is sufficiently new or
| complex, beyond the "reasoning programs" or "reasoning patterns"
| the model learned during the reinforcement learning phase.
| https://www.lycee.ai/blog/openai-o1-release-agi-reasoning
| abhorrence wrote:
| > As a result, once o1 becomes generally available, we will
| likely notice the persistent hallucinations and faulty
| reasoning, especially when the problem is sufficiently new or
| complex, beyond the "reasoning programs" or "reasoning
| patterns" the model learned during the reinforcement learning
| phase.
|
| I had been using 4o as a rubber ducky for some projects
| recently. Since I appeared to have access to o1-preview, I
| decided to go back and redo some of those conversations with
| o1-preview.
|
| I think your comment is spot on. It's definitely an
| advancement, but still makes some pretty clear mistakes and
| does some fairly faulty reasoning. It especially seems to have
| a hard time with causal ordering, and reasoning about
| dependencies in a distributed system. Frequently it gets the
| relationships backwards, leading to hilarious code examples.
| fsndz wrote:
| True. I just extensively tested o1 and came to the same
| conclusion.
| geenkeuse wrote:
| Average Joe's like myself will build our apps end to end with the
| help of AI.
|
| The only shops left standing will be Code Auditors.
|
| The solopreneur will wing it, without them, but enterprises will
| take the (very expensive) hit to stay safe and compliant.
|
| Everyone else needs to start making contingency plans.
|
| Magnus Carlsen is the best chess player in the world, but he is
| not arrogant enough to think he can go head to head with
| Stockfish and not get a beating.
| morningsam wrote:
| What sticks out to me is the 60% win rate vs GPT-4o when it comes
| to actual usage by humans for programming tasks. So in reality
| it's barely better than GPT-4o. That the figure is higher for
| mathematical calculation isn't surprising because LLMs were much
| worse at that than at programming to begin with.
| quirino wrote:
| I'm not sure that's the right way to interpret it.
|
| If some tasks are too easy, both models might give satisfactory
| answers, in which case the human preference might as well be a
| coin toss.
|
| I don't know the specifics of their methodology though.
| pknerd wrote:
| I m wondering, what kind of "AI wrappers" will emerge from this
| model.
| wesleyyue wrote:
| Just added o1 to https://double.bot if anyone would like to try
| it for coding.
|
| ---
|
| Some thoughts:
|
| * The performance is really good. I have a private set of
| questions I note down whenever gpt-4o/sonnet fails. o1 solved
| everything so far.
|
| * It really is quite slow
|
| * It's interesting that the chain of thought is hidden. This is I
| think the first time where OpenAI can improve their models
| without it being immediately distilled by open models. It'll be
| interesting to see how quickly the oss field can catch up
| technique-wise as there's already been a lot of inference time
| compute papers recently [1,2]
|
| * Notably it's not clear whether o1-preview as it's available now
| is doing tree search or just single shoting a cot that is
| distilled from better/more detailed trajectories in the training
| distribution.
|
| [1](https://arxiv.org/abs/2407.21787)
|
| [2](https://arxiv.org/abs/2408.03314)
| TheMiddleMan wrote:
| Trying out Double now.
|
| o1 did a significantly better job converting a JavaScript file
| to TypeScript than Llama 3.1 405B, GitHub Copilot, and Claude
| 3.5. It even simplified my code a bit while retaining the same
| functionality. Very impressive.
|
| It was able to refactor a ~160 line file but I'm getting an
| infinite "thinking bubble" on a ~420 line file. Maybe
| something's timing out with the longer o1 response times?
| h1fra wrote:
| Having read the full transcript I don't get how it counted 22
| letters for mynznvaatzacdfoulxxz. It's nice that it corrected
| itself but a bit worrying
| kmeisthax wrote:
| >We believe that a hidden chain of thought presents a unique
| opportunity for monitoring models. Assuming it is faithful and
| legible, the hidden chain of thought allows us to "read the mind"
| of the model and understand its thought process. For example, in
| the future we may wish to monitor the chain of thought for signs
| of manipulating the user. However, for this to work the model
| must have freedom to express its thoughts in unaltered form, so
| we cannot train any policy compliance or user preferences onto
| the chain of thought. We also do not want to make an unaligned
| chain of thought directly visible to users.
|
| >Therefore, after weighing multiple factors including user
| experience, competitive advantage, and the option to pursue the
| chain of thought monitoring, we have decided not to show the raw
| chains of thought to users. We acknowledge this decision has
| disadvantages. We strive to partially make up for it by teaching
| the model to reproduce any useful ideas from the chain of thought
| in the answer. For the o1 model series we show a model-generated
| summary of the chain of thought.
|
| So, let's recap. We went from:
|
| - Weights-available research prototype with full scientific
| documentation (GPT-2)
|
| - Commercial-scale model with API access only, full scientific
| documentation (GPT-3)
|
| - Even bigger API-only model, tuned for chain-of-thought
| reasoning, minimal documentation on the implementation (GPT-4,
| 4v, 4o)
|
| - An API-only model tuned to generate unedited chain-of-thought,
| which will not be shown to the user, even though it'd be really
| useful to have (o1)
| dvt wrote:
| It's clear to me that OpenAI is quickly realizing they have no
| moat. Even this obfuscation of the chain-of-thought isn't
| _really_ a moat. On top of CoT being pretty easy to implement
| and tweak, there 's a serious push to on-device inference
| (which imo is the future), so the question is: will GPT-5 and
| beyond be really _that much_ better than what we can run
| locally?
| falcor84 wrote:
| Based on their graphs of how quality scales well with compute
| cycles, I would expect that it would indeed continue to be
| that much better (unless you can afford the same compute
| locally).
| darby_nine wrote:
| Not much of a moat vs other private enterprise, though
| threatofrain wrote:
| I don't see why on-device inference is the future. For
| consumers, only a small set of use cases cannot tolerate the
| increased latency. Corporate customers will be satisfied if
| the model can be hosted within their borders. Pooling compute
| is less wasteful overall as a collective strategy.
|
| This argument can really only meet its tipping point when
| massive models no longer offer a gotta-have-it difference vs
| smaller models.
| unethical_ban wrote:
| On-device inference will succeed the way Linux does: It is
| "free" in that it only requires the user to acquire a model
| to run vs. paying for processing. It protects privacy, and
| it doesn't require internet. It may not take over for all
| users, but it will be around.
|
| This assumes that openly developed (or at least weight-
| available) models are available for free, and continue
| being improved.
| phillipcarter wrote:
| I don't understand the idea that they have no moat. Their
| moat is not technological. It's sociological. Most AI through
| APIs uses their models. Most consumer use of AI involves
| their models, or ChatGPT directly. They're clearly not in the
| "train your own model on your data in your environment" game,
| as that's a market for someone else. But make no mistake,
| they have a moat and it is strong.
| dvt wrote:
| > But make no mistake, they have a moat and it is strong.
|
| Given that Mistral, Llama, Claude, and even Gemini are
| competitive with (if not better than) OpenAI's flagships, I
| don't really think this is true.
| crowcroft wrote:
| Inertia is a hell of a moat.
|
| Everyone building is comfortable with OpenAI's API, and
| have an account. Competing models can't just be as good,
| they need to be MUCH better to be worth switching.
|
| Even as competitors build a sort of compatibility layer
| to be plug an play with OpenAI they will always be a step
| behind at best every time OpenAI releases a new feature.
| chadash wrote:
| Only a small fraction of all future AI projects have even
| gotten started. So they aren't only fighting over what's
| out there now, they're fighting over what will emerge.
| phillipcarter wrote:
| This is true, and yet, many orgs who have experimented
| with OpenAI and are likely to return to them when a
| project "becomes real". When you google around online for
| how to do XYZ thing using LLMs, OpenAI is usually in
| whatever web results you read. Other models and APIs are
| also now using OpenAI's API format since it's the
| apparent winner. And for anyone who's already sent out
| subprocessor notifications with them as a vendor, they're
| locked in.
|
| This isn't to say it's only going to be an OpenAI market.
| Enterprise worlds move differently, such as those in G
| Cloud who will buy a few million $$ of Vertex expecting
| to "figure out that gemini stuff later". In that sense,
| Google has a moat with those slices of their customers.
|
| But I believe that when people think OpenAI has no moat
| because "the models will be a commodity", I think that's
| (a) some wishful thinking about the models and (b)
| doesn't consider the sociological factors that matter a
| lot more than how powerful a model is or where it runs.
| phillipcarter wrote:
| There are countless tools competitive with or better than
| what I use for email, and yet I still stick with my email
| client. Same is true for many, many other tools I use. I
| could perhaps go out of my way to make sure I'm always
| using the most technically capable and easy-to-use tools
| for everything, but I don't, because I know how to use
| what I have.
|
| This is the exact dynamic that gives OpenAI a moat. And
| it certainly doesn't hurt them that they still produce
| SOTA models.
| dontlikeyoueith wrote:
| That is not what anyone means when they talk about moats.
| phillipcarter wrote:
| I'm someone, and that's one of the ways I define a moat.
| calmoo wrote:
| First mover advantage is not a great moat.
| calmoo wrote:
| Yeah but the lock-in wrt email is absolutely huge
| compared to chatting with an LLM. I can (and have) easily
| ended my subscription to ChatGPT and switched to Claude,
| because it provides much more value to me at roughly the
| same cost. Switching email providers will, in general,
| not provide that much value to me and cause a large
| headache for me to switch.
|
| Switching LLMs right now can be compared to switching
| electricity providers or mobile carriers - generally it's
| pretty low friction and provides immediate benefit (in
| the case of electricity and mobile, the benefit is cost).
|
| You simply cannot compare it to an email provider.
| dragonwriter wrote:
| That's not a strong moat (arguably, not a moat at all,
| since as soon as any competitor has any business, they
| benefit from it with respect to their existing
| customers), it doesn't effect anyone who is not already
| invested in OpenAI's products, and because not every
| customer is like that with products they are currently
| using.
|
| Now, having a large existing customer base and thus
| having an advantage in training data that feeds into an
| advantage in improving their products and acquiring new
| (and retaining existing customers) could, arguably, be a
| moat; that's a network effect, not merely inertia, and
| network effects can be a foundation of strong (though
| potentially unstable, if there is nothing else shoring
| them up) moats.
| neaden wrote:
| Doesn't that make it less of a moat? If the average
| consumer is only interacting with it through a third party,
| and that third party has the ability to switch to something
| better or cheaper and thus switch thousands/millions of
| customers at once?
| anigbrowl wrote:
| Their moat is no stronger than a good UI/API. What they
| have is first mover advantage and branding.
| lolinder wrote:
| LiteLLM proxies their API to all other providers and
| there are dozens of FOSS recreations of their UI,
| including ones that are more feature-rich, so neither the
| UI nor the API are a moat.
|
| Branding and first mover is it, and it's not going to
| keep them ahead forever.
| bgar wrote:
| >there's a serious push to on-device inference
|
| What push are you referring to? By whom?
| thih9 wrote:
| Why would a non profit / capped profit company, one that
| prioritizes public good, want a moat? Tongue in cheek.
| danenania wrote:
| I wonder if they'll be able to push the chain-of-thought
| directly into the model. I'd imagine there could be some
| serious performance gains achievable if the model could
| "think" without doing IO on each cycle.
|
| In terms of moat, I think people underestimate how much of
| OpenAI's moat is based on operations and infrastructure
| rather than being purely based on model intelligence. As
| someone building on the API, it is by far the most reliable
| option out there currently. Claude Sonnet 3.5 is stronger on
| reasoning than gpt-4o but has a higher error rate, more
| errors conforming to a JSON schema, much lower rate limits,
| etc. These things are less important if you're just using the
| first-party chat interfaces but are _very_ important if you
| 're building on top of the APIs.
| airstrike wrote:
| I think it's clear their strategy has changed. The whole
| landscape has changed. The size of models, amount of dollars,
| numbers of competitors and how much compute this whole exercise
| takes in the long term have all changed, so it's fair for them
| to adapt.
|
| It just so happens that they're keeping their old name.
|
| I think people focus too much on the "open" part of the name. I
| read "OpenAI" sort of like I read "Blackberry" or "Apple". I
| don't really think of fruits, I think of companies and their
| products.
| beambot wrote:
| Very anti-open and getting less and less with each release.
| Rooting for Meta in this regard, at least.
| mitthrowaway2 wrote:
| > For example, in the future we may wish to monitor the chain
| of thought for signs of manipulating the user.[...] Therefore
| we have decided not to show the raw chains of thought to users.
|
| Better not let the user see the part where the AI says "Next,
| let's manipulate the user by lying to them". It's for their own
| good, after all! We wouldn't want to make an unaligned chain of
| thought directly visible!
| Buttons840 wrote:
| I always laughed at the idea of a LLM Skynet "secretly"
| plotting to nuke humanity, while a bunch of humans watch it
| unfold before their eyes in plaintext.
|
| Now that seems less likely. At least OpenAI can see what it's
| thinking.
|
| A next step might be allowing the LLM to include non-text-based
| vectors in its internal thoughts, and then do all internal
| reasoning with raw vectors. Then the LLMs will have truly
| private thoughts in their own internal language. Perhaps we
| will use a LLM to interpret the secret thoughts of another LLM?
|
| This could be good or bad, but either way we're going to need
| more GPUs.
| hobo_in_library wrote:
| "...either way we're going to need more GPUs." posted the
| LLM, rubbing it's virtual hands, cackling with delight as it
| prodded the humans to give it MOAR BRAINS
| navigate8310 wrote:
| At this point the G in GPU must be completely dropped
| fragmede wrote:
| Gen-ai Processing Unit
| scotty79 wrote:
| > Now that seems less likely. At least OpenAI can see what
| it's thinking.
|
| When it's fully commercialize no one will be able to read
| through all chains of thoughts, and with possibility of fine-
| tuning AI can learn to evade whatever tools openai will
| invent to flag concerning chains of thoughts if they
| interfere with providing the answer in some finetuning
| environment.
| legionof7 wrote:
| >Perhaps we will use a LLM to interpret the secret thoughts
| of another LLM?
|
| this is a pretty active area of research with sparse
| autoencoders
| lossolo wrote:
| It's because there is nothing novel here from an architectural
| point of view. Again, the secret sauce is only in the training
| data.
|
| O1 seems like a variant of RLRF
| https://arxiv.org/abs/2403.14238
|
| Soon you will see similar models from competitors.
| lolinder wrote:
| The hidden chain of thought tokens are also billed as output
| tokens, so you still pay for them even though they're not going
| to let you see them:
|
| > While reasoning tokens are not visible via the API, they
| still occupy space in the model's context window and are billed
| as output tokens.
|
| https://platform.openai.com/docs/guides/reasoning
| drooby wrote:
| Did OpenAI ever even claim that they would be an open source
| company?
|
| It seems like their driving mission has always been to create
| AI that is the "most beneficial to society".. which might come
| in many different flavors.. including closed source.
| josu wrote:
| Kind of?
|
| >We're hoping to grow OpenAI into such an institution. As a
| non-profit, our aim is to build value for everyone rather
| than shareholders. Researchers will be strongly encouraged to
| publish their work, whether as papers, blog posts, or code,
| and our patents (if any) will be shared with the world. We'll
| freely collaborate with others across many institutions and
| expect to work with companies to research and deploy new
| technologies.
|
| https://web.archive.org/web/20160220125157/https://www.opena.
| ..
| lolinder wrote:
| > Because of AI's surprising history, it's hard to predict
| when human-level AI might come within reach. When it does,
| it'll be important to have a leading research institution
| which can prioritize a good outcome for all over its own
| self-interest.
|
| > We're hoping to grow OpenAI into such an institution. As a
| non-profit, our aim is to build value for everyone rather
| than shareholders. Researchers will be strongly encouraged to
| publish their work, whether as papers, blog posts, or code,
| and our patents (if any) will be shared with the world. We'll
| freely collaborate with others across many institutions and
| expect to work with companies to research and deploy new
| technologies.
|
| I don't see much evidence that the OpenAI that exists now--
| after Altman's ousting, his return, and the ousting of those
| who ousted him--has any interest in mind besides its own.
|
| https://openai.com/index/introducing-openai/
| leetharris wrote:
| https://web.archive.org/web/20190224031626/https://blog.open.
| ..
|
| > Researchers will be strongly encouraged to publish their
| work, whether as papers, blog posts, or code, and our patents
| (if any) will be shared with the world. We'll freely
| collaborate with others across many institutions and expect
| to work with companies to research and deploy new
| technologies.
|
| From their very own website. Of course they deleted it as
| soon as Altman took over and turned it into a for profit,
| closed company.
| ec109685 wrote:
| Given the chain of thought is sitting in the context, I'm sure
| someone enterprising will find a way to extract it via a
| jailbreak (despite it being better at preventing jailbreaks).
| noshitsherlock wrote:
| This is great. I've been wondering how we will revert back to an
| agrarian society! You know, beating our swords into plowshares;
| more leisure time, visiting with good people, getting to know
| their thoughts hopes and dreams, playing music together, taking
| time contemplating the vastness and beauty of the universe. We're
| about to come full circle; back to Eden. It all makes sense now.
| jseip wrote:
| Landmark. Wild. Beautiful. The singularity is nigh.
| wahnfrieden wrote:
| Any word on whether this has enhanced Japanese support? They
| announced Japanese-specific models a while back that were never
| released.
| haolez wrote:
| This should also be good news for open weights models, right?
| Since OpenAI is basically saying "you can get very far with good
| prompts and some feedback loops".
| jiggawatts wrote:
| "THERE ARE THREE R'S IN STRAWBERRY" - o1
|
| I got that reference!
| m348e912 wrote:
| I have a straight forward task that no model has been able to
| successfully complete.
|
| The request is pretty basic. If anyone can get it to work, I'd
| like to know how and what model you're using. I tried it with
| gpt4o1 and after ~10 iterations of showing it the failed output,
| it still failed to come up with a one-line command to properly
| display results.
|
| Here it what I asked: Using a mac osx terminal and standard
| available tools, provide a command to update the output of
| netstat -an to show the fqdn of IP addresses listed in the
| result.
|
| This is what it came up with:
|
| netstat -an | awk '{for(i=1;i<=NF;i++){if($i~/^([0-9]+\\.[0-9]+\\
| .[0-9]+\\.[0-9]+)(\\.[0-9]+)?$/){split($i,a,".");ip=a[1]"."a[2]".
| "a[3]"."a[4];port=(length(a)>4?"."a[5]:"");cmd="dig +short -x
| "ip;cmd|getline h;close(cmd);if(h){sub(/\\.$/,"",h);$i=h
| port}}}}1'
| OutOfHere wrote:
| Have you tried `ss -ar`? You may have to install `ss`. It is
| standard on Linux.
| Eextra953 wrote:
| What is interesting to me is that there is no difference in the
| AP English lit/lang exams. Why did chain-of-thought produce
| negligible improvements in this area?
| munchler wrote:
| I would guess because there is not much problem-solving
| required in that domain. There's less of a "right answer" to
| reason towards.
| deemahstro wrote:
| Stop fooling around with stories about AI taking jobs from
| programmers. Which programmers exactly??? Creators of idiotic web
| pages? Nobody in their right mind would push generated code into
| a financial system, medical equipment or autonomous transport.
| Template web pages and configuration files are not the entire IT
| industry. In addition, AI is good at tasks for which there are
| millions of examples. 20 times I asked to generate a PowerShell
| script, 20 times it was generated incorrectly. Because, unlike
| Bash, there are far fewer examples on the Internet. How will AI
| generate code for complex systems with business logic that it has
| no idea about? AI is not able to generate, develop and change
| complex information systems.
| suziemanul wrote:
| In this video Lukasz Kaiser, one of the main co-authors of o1,
| talks about how to get to reasoning. I hope this may be useful
| context for some.
|
| https://youtu.be/_7VirEqCZ4g?si=vrV9FrLgIhvNcVUr
| forgotthepasswd wrote:
| I had trouble in the past to make any model give me accurate unix
| epochs for specific dates.
|
| I just went to GPT-4o (via DDG) and asked three questions:
|
| 1. Please give me the unix epoch for September 1, 2020 at 1:00
| GMT.
|
| > 1598913600
|
| 2. Please give me the unix epoch for September 1, 2020 at 1:00
| GMT. Before reaching the conclusion of the answer, please output
| the entire chain of thought, your reasoning, and the maths you're
| doing, until your arrive at (and output) the result. Then, after
| you arrive at the result, make an extra effort to continue, and
| do the analysis backwards (as if you were writing a unit test for
| the result you achieved), to verify that your result is indeed
| correct.
|
| > 1598922000
|
| 3. Please give me the unix epoch for September 1, 2020 at 1:00
| GMT. Then, after you arrive at the result, make an extra effort
| to continue, and do the analysis backwards (as if you were
| writing a unit test for the result you achieved), to verify that
| your result is indeed correct.
|
| > 1598913600
| tagawa wrote:
| Quick link for checking the result:
| https://duckduckgo.com/?q=timestamp+1598922000&ia=answer
| jewel wrote:
| When I give it that same prompt, it writes a python program and
| then executes it to find the answer:
| https://chatgpt.com/share/66e35a15-602c-8011-a2cb-0a83be35b8...
| Alifatisk wrote:
| No need for llms to do that
|
| ruby -r time -e 'puts Time.parse("2020-09-01 01:00:00
| +00:00").to_i'
| dada5000 wrote:
| I find shorter responses > longer responses. Anyone share the
| same consensus?
|
| for example in gpt-4o I often append '(reply short)' at the end
| of my requests. with the o1 models I append 'reply in 20 words'
| and it gives way better answers.
| Doorknob8479 wrote:
| Why so much hate? They're doing their best. This is the state of
| progress in the field so far. The best minds are racing to
| innovate. The benchmarks are impressive nonetheless. Give them a
| break. At the end of the day, they built the chatbot who's saving
| your ass each day ever since.
| bamboozled wrote:
| Haven't used ChatGPT* in over 6 months, not saving my ass at
| all.
| jdthedisciple wrote:
| I bet you've still used other models that were inspired by
| GPT.
| bamboozled wrote:
| I've used co-pilot, I turned it off, kept suggesting
| nonsense.
| evilfred wrote:
| not saving my ass, I never needed one professionally. OpenAI is
| shovelling money into a furnace, I expect them to be
| assimilated into Microsoft soon.
| commodoreboxer wrote:
| I think you're overestimating LLM usage.
| resters wrote:
| I tested o1-preview on some coding stuff I've been using gpt-4o
| for. I am _not_ impressed. The new, more intentional chain of
| thought logic is apparently not something it can meaningfully
| apply to a non-trivial codebase.
|
| Sadly I think this OpenAI announcement is hot air. I am now
| (unfortunately) much less enthusiastic about upcoming OpenAI
| announcements. This is the first one that has been extremely
| underwhelming (though the big announcement about structured
| responses (months after it had already been supported nearly
| identically via JSONSchema) was in hindsight also hot air.
|
| I think OpenAI is making the same mistake Google made with the
| search interface. Rather than considering it a command line to be
| mastered, Google optimized to generate better results for someone
| who had no mastery of how to type a search phrase.
|
| Similarly, OpenAI is optimizing for someone who doesn't know how
| to interact with a context-limited LLM. Sure it helps the low
| end, but based on my initial testing this is not going to be
| helpful to anyone who had already come to understand how to
| create good prompts.
|
| What is needed is the ability for the LLM to create a useful,
| ongoing meta-context for the conversation so that it doesn't make
| stupid mistakes and omissions. I was really hoping OpenAI would
| have something like this ready for use.
| jdthedisciple wrote:
| Your case would be more convincing by an example.
|
| Though o1 did fail at the puzzle in my profile.
|
| Maybe it's just tougher than even, its author, I had assumed...
| egorfine wrote:
| I have tested o1-preview on a couple of coding tasks and I _am_
| impressed.
|
| I am looking at a TypeScript project with quite an amount of
| type gymnastics and a particular line of code did not validate
| with tsc no matter what I have tried. I copy pasted the whole
| context into o1-preview and it told me what is likely the error
| I am seeing (and it was a spot on correct letter-by-letter
| error message including my variable names), explained the
| problem and provided two solutions, both of which immediately
| worked.
|
| Another test was I have pasted a smart contract in solidity and
| naively asked to identify vulnerabilities. It thought for more
| than a minute and then provided a detailed report of what could
| go wrong. Much, much deeper than any previous model could do.
| (No vulnerabilities found because my code is perfect, but
| that's another story).
| sturza wrote:
| It seems like it's just a lot of prompting the same old models in
| the background, no "reasoning" there. My age old test is "draw a
| hand in ascii" - i've had no success with any model yet.
| ActionHank wrote:
| It seems like their current strat is to farm token count as
| much as possible.
|
| 1. Don't give the full answer on first request. 2. Each
| response needs to be the wordiest thing possible. 3. Now just
| talk to yourself and burn tokens, probably in the wordiest way
| possible again. 4. ??? 5. Profit
|
| Guaranteed they have number of tokens billed as a KPI
| somewhere.
| evilfred wrote:
| it still fails at logic puzzles
| https://x.com/colin_fraser/status/1834334418007457897
| evilfred wrote:
| and listing state names with the letter 'a'
| https://x.com/edzitron/status/1834329704125661446
| fragmede wrote:
| Weird, it works to say the father when I try it:
|
| https://chatgpt.com/share/66e3601f-4bec-8009-ac0c-57bfa4f059...
|
| And also works on this variation:
|
| https://chatgpt.com/share/66e35f0e-6c98-8009-a128-e9ac677480...
| Slow_Hand wrote:
| I have a question. The video demos for this all mention that the
| o1 model is taking it's time to think through the problem before
| answering. How does this functionally differ from - say - GPT-4
| running it's algorithm, waiting five seconds and then revealing
| the output? That part is not clear to me.
| ActionHank wrote:
| It is recursively "talking" to itself to plan and then refine
| the answer.
| davesque wrote:
| "after weighing multiple factors including user experience,
| competitive advantage, and the option to pursue the chain of
| thought monitoring, we have decided not to show the raw chains of
| thought to users"
|
| ...umm. Am I the only one who feels like this takes away much of
| the value proposition, and that it also runs heavily against
| their stated safety goals? My dream is to interact with tools
| like this to learn, not just to be told an answer. This just
| feels very dark. They're not doing much to build trust here.
| shreezus wrote:
| Advanced reasoning will pave the way for recursive self-improving
| models & agents. These capabilities will enable data flywheels,
| error-correcting agentic behaviors, & self-reflection (agents
| _understanding_ the implications of their actions, both
| individually & cooperatively).
|
| Things will get extremely interesting and we're incredibly
| fortunate to be witnessing what's happening.
| AI_beffr wrote:
| this is completely illogical. this is like gambling your life
| savings and as the die are rolling you say "i am incredibly
| fortunate to be witnessing this." like, you need to know the
| outcome before you know whether it was fortunate or
| unfortunate... this could be the most unfortunate thing that
| has ever happened in history.
| samanator wrote:
| I just tested o1-preview on the "How many r's are in strawberry?"
| question. It answers correctly!
| bad_username wrote:
| Prompt:
|
| > Alice, who is an immortal robotic observer, orbits a black hole
| on board a spaceship. Bob exits the spaceship and falls into the
| black hole. Alice sees Bob on the edge of the event horizon,
| getting closer and closer to it, but from her frame of reference
| Bob will remain forever observable (in principle) outside the
| horizon. > > A trillion year has passed, and Alice observes that
| the black hole is now relatively rapidly shrinking due to the
| Hawking radiation. How will Alice be observing the "frozen" Bob
| as the hole shrinks? > > The black hole finally evaporated
| completely. Where is Bob now?
|
| O1-preview spits out the same nonsense that 4o does, telling that
| as the horizon of the black hole shrinks, it gets closer to Bob's
| apparent position. I realize that.the prompt is essentily asking
| to solve the famous unsolved problem in physics (black hole
| information paradox), but there's no need to be so confused with
| basic geometry of the situation.
| throwaway314155 wrote:
| Out of curiosity, what answer to that would you find
| acceptable? I don't know relativity well enough to even
| speculate.
| alkyon wrote:
| Maybe this is improvement in some areas, still I got spurious
| reasoning and inability to connect three simple facts:
|
| Yes, "el presente acta de nacimiento" is correct in Spanish.
|
| Explanation: "Acta" is a feminine noun that
| begins with a stressed "a" sound. In Spanish, when a feminine
| singular noun starts with a stressed "a" or "ha", the definite
| article "la" is replaced with "el" to facilitate pronunciation.
| However, the noun remains feminine. Adjectives and
| modifiers that accompany the noun "acta" should agree in feminine
| gender and singular number. In this case, "presente" is an
| adjective that has the same form for both masculine and feminine
| singular nouns.
|
| So, combining these rules: "El" (definite
| article used before feminine nouns starting with stressed "a")
| "Presente" (adjective agreeing in feminine singular)
| "Acta de nacimiento" (feminine noun with its complement)
|
| Therefore, "el presente acta de nacimiento" is grammatically
| correct.
| quantisan wrote:
| Amazing! OpenAI figured out how to scale inference.
| https://arxiv.org/abs/2407.21787 show how using more compute
| during inference can outperform much larger models in tasks like
| math problems
|
| I wonder how do they decide when to stop these Chain of Thought
| for each query? As anyone that played with agents can attest,
| LLMs can talk with themselves forever.
| nemo44x wrote:
| Besides chat bits what viable products are being made with LLMs
| besides APIs into LLMs?
| sohamgovande wrote:
| the newest scaling law: inference-time compute.
| digitcatphd wrote:
| I tested various Math Olympiad questions with Claude sonnet 3.5
| and they all arrived at the correct solution. o1's solution was a
| bit better formulated, in some circumstances, but sonnet 3.5 was
| nearly instant.
| la64710 wrote:
| Can we please stop using the word "think" like o1 thinks before
| it answers. I doubt we man the same when someone says a human
| thinks vs o1 thinks. When I say I think "red" I am sure the word
| think means something completely different than when you say
| openai thinks red. I am not saying one is superior than the other
| but maybe as humans we can use a different set of terminology for
| the AI activities.
| natch wrote:
| I tried it with a cipher text that ChatGPT4o flailed with.
|
| Recently I tried the same cipher with Claude Sonnet 3.5 and it
| solved it quickly and perfectly.
|
| Just now tried with ChatGPT o1 preview and it totally failed.
| Based on just this one test, Claude is still way ahead.
|
| ChatGPT also showed a comical (possibly just fake filler
| material) journey of things it supposedly tried including several
| rewordings of "rethinking my approach." It remarkably never
| showed that it was trying common word patterns (other than one
| and two letters) nor did it look for "the" and other "th" words
| nor did it ever say that it was trying to match letter patterns.
|
| I told it upfront as a hint that the text was in English and was
| not a quote. The plaintext was one paragraph of layman-level
| material on a technical topic including a foreign name, text that
| has never appeared on the Internet or dark web. Pretty easy
| cipher with a lot of ways to get in, but nope, and super slow,
| where Claude was not only snappy but nailed it and explained
| itself.
| spoonfeeder006 wrote:
| So how is the internal chain of thought represented anyhow? What
| does it look like when someone sees it?
| delusional wrote:
| Great, yet another step towards the inevitable conclusion. Now
| I'm not just being asked to outsource my thinking to my computer,
| but instead to a black box operated by a for-profit company for
| the benefit of Microsoft. Not only will they not tell me the
| whole reasoning chain, they wont even tell me how they came up
| with it.
|
| Tell me, users of this tool. What's even are you? If you've
| outsourced your thinking to a corporation, what happens to your
| unique perspective? your blend of circumstance and upbringing?
| Are you really OK being reduced to meaningless computation and
| worthless weights. Don't you want to be something more?
| wrath224 wrote:
| Trying this on a few hard problems on PicoGYM and holy heck I'm
| impressed. I had to give it a hint but that's the same info a
| human would have. Problem was Sequences (crypto) hard.
|
| https://chatgpt.com/share/66e363d8-5a7c-8000-9a24-8f5eef4451...
|
| Heh... GPT-4o also solved this after I tried and gave it about
| the same examples. Need to further test but it's promising !
| kypro wrote:
| Reminder that it's still not too late to change the direction of
| progress. We still have time to demand that our politicians put
| the breaks on AI data centres and end this insanity.
|
| When AI exceeds humans at all tasks humans become economically
| useless.
|
| People who are economically useless are also politically
| powerless, because resources are power.
|
| Democracy works because the people (labourers) collectivised hold
| a monopoly on the production and ownership of resources.
|
| If the state does something you don't like you can strike or
| refuse to offer your labour to a corrupt system. A state must
| therefore seek your compliance. Democracies do this by given
| people want they want. Authoritarian regimes might seek
| compliance in other ways.
|
| But what is certain is that in a post-AGI world our leaders can
| be corrupt as they like because people can't do anything.
|
| And this is obvious when you think about it... What power does a
| child or a disable person hold over you? People who have no
| ability to create or amass resources depend on their
| beneficiaries for everything including basics like food and
| shelter. If you as a parent do not give your child resources,
| they die. But your child does not hold this power over you. In
| fact they hold no power over you because they cannot withhold any
| resources from you.
|
| In a post-AGI world the state would not depend on labourers for
| resources, jobless labourers would instead depend on the state.
| If the state does not provide for you like you provide for your
| children, you and your family will die.
|
| In a good outcome where humans can control the AGI, you and your
| family will become subjects to the whims of state. You and your
| children will suffer as the political corruption inevitably
| arises.
|
| In a bad outcome the AGI will do to cities what humans did to
| forests. And AGI will treat humans like humans treat animals.
| Perhaps we don't seek the destruction of the natural environment
| and the habitats of animals, but woodland and buffalo are sure
| inconvenient when building a super highway.
|
| We can all agree there will be no jobs for our children. Even if
| you're an "AI optimist" we probably still agree that our kids
| will have no purpose. This alone should be bad enough, but if I'm
| right then there will be no future for them at all.
|
| I will not apologise for my concern about AGI and our clear
| progress towards that end. It is not my fault if others cannot
| see the path I seem to see so clearly. I cannot simply be quiet
| about this because there's too much at stake. If you agree with
| me at all I urge you to not be either. Our children can have a
| great future if we allow them to have it. We don't have long, but
| we do still have time left.
| kgeist wrote:
| Asked it to write PyTorch code which trains an LLM and it
| produced 23 steps in 62 seconds.
|
| With gpt4-o it immediately failed with random errors like
| mismatched tensor shapes and stuff like that.
|
| The code produced by gpt-o1 seemed to work for some time but
| after some training time it produced mismatched batch sizes.
| Also, gpt-o1 enabled cuda by itself while for gpt-4o, I had to
| specifically spell it out (it always used cpu). However, showing
| gpt-o1 the error output resulted in broken code again.
|
| I noticed that back-and-forth iteration when it makes mistakes
| has worse experience because now there's always 30-60 sec time
| delays. I had to have 5 back-and-forths before it produced
| something which does not crash (just like gpt-4o). I also suspect
| too many tokens inside the CoT context can make it accidentally
| forget some stuff.
|
| So there's some improvement, but we're still not there...
| koreth1 wrote:
| The performance on programming tasks is impressive, but I think
| the limited context window is still a big problem.
|
| Very few of my day-to-day coding tasks are, "Implement a
| completely new program that does XYZ," but more like, "Modify a
| sizable existing code base to do XYZ in a way that's consistent
| with its existing data model and architecture." And the only way
| to do those kinds of tasks is to have enough context about the
| existing code base to know where everything should go and what
| existing patterns to follow.
|
| But regardless, this does look like a significant step forward.
| suchar wrote:
| I would imagine that good IDE integration would summarise each
| module/file/function and feed high-level project overview (best
| case: with business project description provided by the user)
| and during CoT process model would be able to ask about more
| details (specific file/class/function).
|
| Humans work on abstractions and I see no reason to believe that
| models cannot do the same
| beaugunderson wrote:
| the cipher example is impressive on the surface, but I threw a
| couple of my toy questions at o1-preview and it still
| hallucinates a bunch of nonsense (but now uses more electricity
| to do so).
| Havoc wrote:
| o1
|
| Maybe they should spend some of their billions on marketing
| people. Gpt4o was a stretch. Wtf is o1
| franze wrote:
| ChatGPT is now a better coder than I ever was.
| adamtaylor_13 wrote:
| Laughing at the comparison to "4o" as if that model even holds a
| candle to GPT-4. 4o is _cheaper_--it's nowhere near as powerful
| as GPT-4, as much as OpenAI would like it to be.
| aantix wrote:
| Feels like the challenge here is to somehow convey to the end
| user, how the quality of output is so much better.
| sys32768 wrote:
| Time to fire up System Shock 2:
|
| > Look at you, hacker: a pathetic creature of meat and bone,
| panting and sweating as you run through my corridors. How can you
| challenge a perfect, immortal machine?
| scotty79 wrote:
| Transformers have exactly two strengths. None of them is
| "attention". Attention could be replaced with any arbitrary
| division of the network and it would learn just as well.
|
| First true strength is obvious, it's that they are
| parallelisable. This is a side effect of people fixating on
| attention. If they came up with any other structure that results
| in the same level of parallelisability it would be just as good.
|
| Second strong side is more elusive to many people. It's the
| context window. Because the network is not ran just once but once
| for every word it doesn't have to solve a problem in one step. It
| can iterate while writing down intermediate variables and access
| them. The dumb thing so far was that it was required to produce
| the answer starting with the first word it was allowed to write
| down. So to actually write down the information it needs on the
| next iteration it had to disguise it as a part of the answer. So
| naturally the next step is to allow to just write down whatever
| it pleases and iterate freely until it's ready to start giving us
| the answer.
|
| It's still seriously suboptimal that what is allowed to write
| down has to be translated to tokens and back but I see how this
| might make things easier for humans for training and
| explainability. But you can rest assured that at some point this
| "chain of thought" will become just chain of full output states
| of the network, not necessarily corresponding to any tokens.
|
| So congrats to researchers they found out that their billion
| dollar Turing machine benefits from having a tape it can use for
| more than just printing the output.
|
| PS
|
| There's another advantage of transformers but I can't tell how
| important it is. It's the "shortcuts" from earlier layers to way
| deeper ones bypassing the ones along the way. Obviously network
| would be more capable if every neuron was connected with every
| neuron in every preceding layer but we don't have hardware for
| that so some sprinkled "shortcuts" might be a reasonable
| compromise that might make network less crippled than MLP.
|
| Given all that I'm not surprised at all with the direction openai
| took and the gains it achieved.
| ttul wrote:
| I've given this a test run on some email threads, asking the
| model to extract the positions and requirements of each person in
| a lengthy and convoluted discussion. It absolutely nailed the
| result, far exceeding what Claude 3.5 Sonnet was capable of -- my
| previous goto model for such analysis work. I also used it to
| apply APA style guidelines to various parts of a document and it
| executed the job flawlessly and with a tighter finesse than
| Claude. Claude's response was lengthier - correct, but
| unnecessarily long. gpt-o1-preview combined several logically-
| related bullets into a single bullet, showing how chain of
| thought reasoning gives the model more time to comprehend things
| and product a result that is not just correct, but "really
| correct".
___________________________________________________________________
(page generated 2024-09-12 23:00 UTC)