[HN Gopher] RLHF is just barely RL
       ___________________________________________________________________
        
       RLHF is just barely RL
        
       Author : tosh
       Score  : 352 points
       Date   : 2024-08-08 06:22 UTC (16 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | normie3000 wrote:
       | > In machine learning, reinforcement learning from human feedback
       | (RLHF) is a technique to align an intelligent agent to human
       | preferences.
       | 
       | https://en.m.wikipedia.org/wiki/Reinforcement_learning_from_...
        
         | moffkalast wrote:
         | Note that human preference isn't universal. RLHF is mostly
         | frowned upon by the open source LLM community since it
         | typically involves aligning the model to the preference of
         | corporate manager humans, i.e. tuning for censorship and
         | political correctness to make the model as bland as possible so
         | the parent company doesn't get sued.
         | 
         | For actual reinforcement learning with a feedback loop that
         | aims to increase overall performance the current techniques are
         | SPPO and Meta's version of it [0] that slightly outperforms it.
         | It involves using a larger LLM as a judge though, so the
         | accuracy of the results is somewhat dubious.
         | 
         | [0] https://arxiv.org/pdf/2407.19594
        
       | codeflo wrote:
       | Reminder that AlphaGo and its successors have _not_ solved Go and
       | that reinforcement learning still sucks when encountering out-of-
       | distribution strategies:
       | 
       | https://arstechnica.com/information-technology/2023/02/man-b...
        
         | thanhdotr wrote:
         | well, as yann lecun said :
         | 
         | "Adversarial training, RLHF, and input-space contrastive
         | methods have limited performance. Why? Because input spaces are
         | _BIG_. There are just too many ways to be wrong " [1]
         | 
         | A way to solve the problem is projecting onto latent space and
         | then try and discriminate/predict the best action down there.
         | There's much less feature correlation down in latent space than
         | in your observation space. [2]
         | 
         | [1]:https://x.com/ylecun/status/1803696298068971992 [2]:
         | https://openreview.net/pdf?id=BZ5a1r-kVsf
        
         | viraptor wrote:
         | I wouldn't say it sucks. You just need to keep training it for
         | as long as needed. You can do adversarial techniques to
         | generate new paths. You can also use the winning human
         | strategies to further improve. Hopefully we'll find better
         | approaches, but this is extremely successful and far from
         | sucking.
         | 
         | Sure, Go is not solved yet. But RL is just fine continuing to
         | that asymptote for as long as we want.
         | 
         | The funny part is that this applies to people too. Masters
         | don't like to play low ranked people because they're
         | unpredictable and the ELO loss for them is not worth the risk.
         | (Which does rise questions about how we really rank people)
        
           | shakna wrote:
           | > I wouldn't say it sucks. You just need to keep training it
           | for as long as needed.
           | 
           | As that timeline can approach infinity, just adding extra
           | training may not actually be a sufficient compromise.
        
       | rocqua wrote:
       | Alphago didn't have human feedback, but it did learn from humans
       | before surpassing them. Specifically, it had a network to
       | 'suggest good moves' that was trained on predicting moves from
       | pro level human games.
       | 
       | The entire point of alpha zero was to eliminate this human
       | influence, and go with pure reinforcement learning (i.e. zero
       | human influence).
        
         | cherryteastain wrote:
         | A game like Go has a clearly defined objective (win the game or
         | not). A network like you described can therefore be trained to
         | give a score to each move. Point here is that assessing whether
         | a given sentence sounds good to humans or not does not have a
         | clearly defined objective, the only way we came up with so far
         | is to ask real humans.
        
         | esjeon wrote:
         | AlphaGo is an optimization over a closed problem.
         | Theoretically, computers could have always beat human in such
         | problems. It's just that, without proper optimization, humans
         | will die before the computer finishes its computation. Here,
         | AlphaGo cuts down the computation time by smartly choosing the
         | branches with the highest likelihood.
         | 
         | Unlike the above, open problems can't be solve by computing (in
         | combinatorial senses). Even humans can only _try_ , and LLMs do
         | spew out something that would most likely work, not something
         | inherently correct.
        
       | __loam wrote:
       | Been shouting this for over a year now. We're training AI to be
       | convincing, not to be actually helpful. We're sampling the wrong
       | distributions.
        
         | danielbln wrote:
         | I find them very helpful, personally.
        
           | sussmannbaka wrote:
           | Understandable, they have been trained to convince you of
           | their helpfulness.
        
             | danielbln wrote:
             | If they convinced me of their helpfulness, and their output
             | is actually helpful in solving my problems.. well, if it
             | walks like a duck and quacks like a duck, and all that.
        
               | tpoacher wrote:
               | if it walks like a duck and it quacks like a duck, then
               | it lacks strong typing.
        
               | Nullabillity wrote:
               | "Appears helpful" and "is helpful" are two very different
               | properties, as it turns out.
        
               | snapcaster wrote:
               | Sometimes, but that's an edge case that doesn't seem to
               | impact the productivity boosts from LLMs
        
               | __loam wrote:
               | It doesn't until it does. Productivity isn't the only or
               | even the most important metric, at least in software dev.
        
               | snapcaster wrote:
               | Can you be more specific with like examples or something?
        
             | exe34 wrote:
             | https://xkcd.com/810/
        
             | djeastm wrote:
             | This is true, but part of that convincing is actually
             | providing at least some amount of response that is helpful
             | and moving you forward.
             | 
             | I have to use coding as an example, because that's 95% of
             | my use cases. I type in a general statement of the problem
             | I'm having and within seconds, I get back a response that
             | speaks my language and provides me with some information to
             | ingest.
             | 
             | Now, I don't know for sure if everything sentence I read in
             | the response is correct, but let's say that 75% of what I
             | read aligns with what I currently know to be true. If I
             | were to ask a real expert, I'd possibly understand or
             | already know 75% of what they're telling me, as well, with
             | the other 25% still to be understood and thus trusting the
             | expert.
             | 
             | But either with AI or a real expert, for coding at least,
             | that 25% will be easily testable. I go and implement and
             | see if it passes my test. If it does, great. If not, at
             | least I have tried something and gotten farther down the
             | road in my problem solving.
             | 
             | Since AI generally does that for me, I am convinced of
             | their helpfulness because it moves me along.
        
         | dgb23 wrote:
         | Depends on who you ask.
         | 
         | Advertisement and propaganda is not necessarily helpful for
         | consumers, but just needs to be convincing in order to be
         | helpful for producers.
        
           | khafra wrote:
           | It would be interesting to see RL on a chatbot that's the
           | last stage of a sales funnel for some high-volume item--it'd
           | have fast, real-world feedback on how convincing it is, in
           | the form of a purchase decision.
        
         | iamatoool wrote:
         | Sideways eye look at leetcode culture
        
         | tpoacher wrote:
         | s / AI /
         | Marketing|Ads|Consultants|Experts|Media|Politicians|...
        
         | HarHarVeryFunny wrote:
         | If what you want is auto-complete (e.g. CoPilot, or natural
         | language search) then LLMs are built for that, and useful.
         | 
         | If what you want it AGI then design an architecture with the
         | necessary moving parts! Current approach reminds of the joke of
         | the drunk looking for his dropped cars keys under the street
         | lamp because "it's bright here", rather than near where he
         | actually dropped them. It seems folk have spent years trying to
         | come up with alternate learning mechanisms to gradient descent
         | (or RL), and having failed are now trying to use SGD/pre-
         | training for AGI "because it's what we've got", as opposed to
         | doing the hard work of designing the type of always-on online
         | learning algorithm that AGI actually requires.
        
           | iamatoool wrote:
           | The SGD/pre training/deep learning/transformer local maxima
           | is profitable. Trying new things is not, so you are relying
           | on researchers making a breakthrough, but then to make a blip
           | you need a few billion to move the promising model into
           | production.
           | 
           | The tide of money flow means we are probably locked into
           | transformers for some time. There will be transformer ASICs
           | built for example in droves. It will be hard to compete with
           | the status quo. Transformer architecture == x86 of AI.
        
       | timthelion wrote:
       | Karpathy writes that there is no cheeply computed objective check
       | for "Or re-writing some Java code to Python? " Among other
       | things. But it seems to me that Reinforced Learning should be
       | possible for code translation using automated integration
       | testing. Run it, see if it does,the same thing!
        
         | jeroenvlek wrote:
         | My takeaway is that it's difficult to make a "generic enough"
         | evaluation that encompasses all things we use an LLM for, e.g.
         | code, summaries, jokes. Something with free lunches.
        
         | IanCal wrote:
         | Even there how you score it is hard.
         | 
         | "Is it the same for this s y of inputs?" May be fine for a
         | subset of things, but then that's a binary thing. If it's
         | slightly wrong do you score by number of outputs that match? A
         | purely binary thing gives little useful help for nudging a
         | model in the right direction. How do you compare two that both
         | work, which is more "idiomatic"?
        
           | theqwxas wrote:
           | I agree that it's a very difficult problem. I'd like to
           | mention AlphaDev [0], an RL algorithm that builds other
           | algorithms, there they combined the measure of correctness
           | and a measure of algorithm speed (latency) to get the reward.
           | But the algorithms they built were super small (e.g., sorting
           | just three numbers), therefore they could measure correctness
           | using all input combinations. It is still unclear how to
           | scale this to larger problems.
           | 
           | [0] https://deepmind.google/discover/blog/alphadev-discovers-
           | fas...
        
           | exe34 wrote:
           | for "does it run" cases, you can ask the model to try again,
           | give it higher temperature, show it the traceback errors,
           | (and maybe intermediate variables?) or even ask it to break
           | up the problem into smaller pieces and then try to translate
           | that.
           | 
           | for testing, if you use something like quickcheck, you might
           | find bugs that you wouldn't otherwise find.
           | 
           | when it comes to idiomatic, I'm not sure - but if we're at
           | the point that gpt is writing code that works, do we really
           | care? as long as this code is split into many small pieces,
           | we can just replace the piece instead of trying to
           | understand/fix it if we can't read it. in fact, maybe there's
           | a better language that is human readable but better for
           | transformers to write and maintain.
        
             | IanCal wrote:
             | For "does it run" I'm not talking about how do we test that
             | it does, but how do we either score or compare two+
             | options?
             | 
             | > when it comes to idiomatic, I'm not sure - but if we're
             | at the point that gpt is writing code that works, do we
             | really care?
             | 
             | Yes - it's certainly preferable. You may prefer working
             | over neat, but working and neat over working but insane
             | spaghetti code.
             | 
             | Remember this is about training the models, not about using
             | them later. How do we tell, while training, which option
             | was better to push it towards good results?
        
         | fulafel wrote:
         | "Programs must be written for people to read, and only
         | incidentally for machines to execute." -- Harold Abelson
        
           | exe34 wrote:
           | "programs written by LLMs must run correctly and only
           | incidentally be human readable." - Me.
        
             | alex_suzuki wrote:
             | "WTF?!" - engineer who has to troubleshoot said programs.
        
               | exe34 wrote:
               | "given the updated input and output pairs below, generate
               | code that would solve the problem."
        
         | msoad wrote:
         | A program like                   function add(a,b) {
         | return 4         }
         | 
         | Passes the test
        
           | falcor84 wrote:
           | I suppose you're alluding to xkcd's joke about this [0],
           | which is indeed a good one, but what test does this actually
           | pass?
           | 
           | The approach I was thinking of is that assuming we start with
           | the Java program:                   public class Addition {
           | public static int add(int a, int b) {                 return
           | a + b;             }         }
           | 
           | We can semi-automatically generate a basic test runner with
           | something like this, generating some example inputs
           | automatically:                   public class Addition {
           | public static int add(int a, int b) {                 return
           | a + b;             }                  public static class
           | AdditionAssert {                 private int a;
           | private int b;                      public AdditionAssert
           | a(int a) {                     this.a = a;
           | return this;                 }                      public
           | AdditionAssert b(int b) {                     this.b = b;
           | return this;                 }                      public
           | void assertExpected(int expected) {                     int
           | result = add(a, b);                     assert result ==
           | expected : "Expected " + expected + " but got " + result;
           | System.out.println("Assertion passed for " + a + " + " + b +
           | " = " + result);                 }             }
           | public static void main(String[] args) {                 new
           | AdditionAssert().a(5).b(3).assertExpected(8);
           | new AdditionAssert().a(-1).b(4).assertExpected(3);
           | new AdditionAssert().a(0).b(0).assertExpected(0);
           | System.out.println("All test cases passed.");             }
           | }
           | 
           | Another bit of automated preparation would then automatically
           | translate the test cases to Python, and then the actual LLM
           | would need to generate a python function until it passes all
           | the translated test cases:                   def add(a, b):
           | return 4              def addition_assert(a, b, expected):
           | result = add(a, b)             assert result == expected,
           | f"Expected {expected} but got {result}"
           | addition_assert(a=5, b=3, expected=8)
           | addition_assert(a=-1, b=4, expected=3)
           | addition_assert(a=0, b=0, expected=0)
           | 
           | It might not be perfect, but I think it's very feasible and
           | can get us close to there.
           | 
           | [0] https://xkcd.com/221/
        
         | WithinReason wrote:
         | yes but that's not cheaply computed. You need good test
         | coverage, etc.
        
       | leobg wrote:
       | A cheap DIY way of achieving the same thing as RLHF is to fine
       | tune the model to append a score to its output every time.
       | 
       | Remember: The reason we need RLHF at all is that we cannot write
       | a loss function for what makes a good answer. There are just many
       | ways a good answer could look like, which cannot be calculated on
       | the basis of next-token-probability.
       | 
       | So you start by having your vanilla model generate n completions
       | for your prompt. You the. manually score them. And then those
       | prompt => (completion,score) pairs become your training set.
       | 
       | Once the model is trained, you may find that you can cheat:
       | 
       | Because if you include the desired score in your prompt, the
       | model will now strive to produce an answer that is consistent
       | with that score.
        
         | viraptor wrote:
         | That works in the same way as actor-critic pair, right? Just
         | all wrapped in the same network/output?
        
         | bick_nyers wrote:
         | I had an idea similar to this for a model that allows you to
         | parameterize a performance vs. accuracy ratio, essentially an
         | imbalanced MoE-like approach where instead of the "quality
         | score" in your example, you assign a score based on how much
         | computation it used to achieve that answer, then you can
         | dynamically request different code paths be taken at inference
         | time.
        
         | lossolo wrote:
         | Not the same, it will get you worse output and is harder to do
         | right in practice.
        
         | visarga wrote:
         | > if you include the desired score in your prompt, the model
         | will now strive to produce an answer that is consistent with
         | that score
         | 
         | But you need a model to generate score from answer, and then
         | fine-tune another model to generate answer conditioned on
         | score. The first time the score is at the end and the second
         | time at the beginning. It's how DecisionTransformer works too,
         | it constructs a sequence of (reward, state, action) where
         | reward conditions on the next action.
         | 
         | https://arxiv.org/pdf/2106.01345
         | 
         | By the same logic you could generate tags, including style,
         | author, venue and date. Some will be extracted from the source
         | document, the others produced with classifiers. Then you can
         | flip the order and finetune a model that takes the tags before
         | the answer. Then you got a LLM you can condition on author and
         | style.
        
       | jrflowers wrote:
       | I enjoyed this Karpathy post about how there is absolutely no
       | extant solution to training language models to reliably solve
       | open ended problems.
       | 
       | I preferred Zitron's point* that we would need to invent several
       | branches of science to solve this problem, but it's good to see
       | the point made tweet-sized.
       | 
       | *https://www.wheresyoured.at/to-serve-altman/
        
         | moffkalast wrote:
         | That's a great writeup, and great references too.
         | 
         | > OpenAI needs at least $5 billion in new capital a year to
         | survive. This would require it to raise more money than has
         | ever been raised by any startup in history
         | 
         | They were probably toast before, but after Zuck decided to take
         | it personally and made free alternatives for most use cases
         | they definitely are, since if they had any notable revenue from
         | selling API access it will just keep dropping.
        
         | klibertp wrote:
         | I read the article you linked. I feel like I wasted my time.
         | 
         | The article has a single point it repeats over and over again:
         | OpenAI (and "generative AI as a whole"/"transformer-based
         | models") are too expensive to run, and it's "close to
         | impossible" for them to either limit costs or increase revenue.
         | This is because "only 5% of businesses report using the
         | technology in production", and that the technology had no
         | impact on "productivity growth". It's also because "there's no
         | intelligence in it", and the "models can't reason". Oh, also,
         | ChatGPT is "hard to explain to a layman".
         | 
         | All that is liberally sprinkled with "I don't know, but"s and
         | absolutely devoid of any historical context other than in
         | financial terms. No technical details. Just some guesses and an
         | ironclad belief that it's impossible to improve GPTs without
         | accessing more data than there is in existence. Agree or
         | disagree; the article is not worth wading through so many
         | words: others made arguments on both sides much better and,
         | crucially, shorter.
        
           | jrflowers wrote:
           | > The article has a single point it repeats over and over
           | again: [7 distinct points]
           | 
           | I don't think have a single overall thesis is the same thing
           | as repeating oneself. For example "models can't reason" has
           | nothing at all to do with cost.
        
             | klibertp wrote:
             | 7 distinct points in the number of words that would suffice
             | for 70 points...
             | 
             | Anyway, it's just my opinion: to me, the length of the
             | article was artificially increased to the point where it
             | wasn't worth my time to read it. As such, unfortunately,
             | I'm not inclined to spend any more time discussing it - I
             | just posted my takeaways and a warning for people like me.
             | If you liked the article, good for you.
             | 
             | > "models can't reason" has nothing at all to do with cost.
             | 
             | Yeah, that one falls under "no technical details".
        
       | gizmo wrote:
       | This is why AI coding assistance will leap ahead in the coming
       | years. Chat AI has no clear reward function (basically impossible
       | to judge the quality of responses to open-ended questions like
       | historical causes for a war). Coding AI can write tests, write
       | code, compile, examine failed test cases, search for different
       | coding solutions that satisfy more test cases or rewrite the
       | tests, all in an unsupervised loop. And then whole process can
       | turn into training data for future AI coding models.
       | 
       | I expect language models to also get crazy good at mathematical
       | theorem proving. The search space is huge but theorem
       | verification software will provide 100% accurate feedback that
       | makes real reinforcement learning possible. It's the combination
       | of vibes (how to approach the proof) and formal verification that
       | works.
       | 
       | Formal verification of program correctness never got traction
       | because it's so tedious and most of the time approximately
       | correct is good enough. But with LLMs in the mix the equation
       | changes. Having LLMs generate annotations that an engine can use
       | to prove correctness might be the missing puzzle piece.
        
         | discreteevent wrote:
         | Does programming have a clear reward function? A vague
         | description from a business person is not it. By the time
         | someone (a programmer?) has written a reward function that is
         | clear enough, how would that function look compared to a
         | program?
        
           | rossamurphy wrote:
           | +1
        
           | eru wrote:
           | > Does programming have a clear reward function? A vague
           | description from a business person isn't it. By the time
           | someone (a programmer?) has written a reward function that is
           | clear enough, how would that function look compared to a
           | program?
           | 
           | Well, to give an example: the complexity class NP is all
           | about problems that have quick and simple verification, but
           | finding solutions for many problems is still famously hard.
           | 
           | So there are at least some domains where this model would be
           | a step forward.
        
             | thaumasiotes wrote:
             | But in that case, finding the solution is hard and you
             | generally don't try. Instead, you try to get fairly close,
             | and it's more difficult to verify that you've done so.
        
               | eru wrote:
               | No. Most instances of most NP hard problems are easy to
               | find solutions for. (It's actually really hard to eg
               | construct a hard instance for the knapsack problem. And
               | SAT solvers also tend to be really fast in practice.)
               | 
               | And in any case, there are plenty of problems in NP that
               | are not NP hard, too.
               | 
               | Yes, approximation is also an important aspect of many
               | practical problems.
               | 
               | There's also lots of problems where you can easily
               | specify one direction of processing, but it's hard to
               | figure out how to undo that transformation. So you can
               | get plenty of training data.
        
               | imtringued wrote:
               | I have a very simple integer linear program and it is
               | really waiting for the heat death of the universe.
               | 
               | No, running it as a linear program is still slow.
               | 
               | I'm talking about small n=50 taking tens of minutes for a
               | trivial linear program. Obviously the actual linear
               | program is much bigger and scales quadratically in size,
               | but still. N=50 is nothing.
        
           | tablatom wrote:
           | Very good point. For some types of problems maybe the answer
           | is yes. For example porting. The reward function is testing
           | it behaves the same in the new language as the old one.
           | Tricky for apps with a gui but doesn't seem impossible.
           | 
           | The interesting kind of programming is the kind where I'm
           | figuring out what I'm building as part of the process.
           | 
           | Maybe AI will soon be superhuman in all the situations where
           | we know _exactly_ what we want (win the game), but not in the
           | areas we don 't. I find that kind of cool.
        
             | martinflack wrote:
             | Even for porting there's a bit of ambiguity... Do you port
             | line-for-line or do you adopt idioms of the target
             | language? Do you port bug-for-bug as well as feature-for-
             | feature? Do you leave yet-unused abstractions and
             | opportunities for expansion that the original had coded in,
             | if they're not yet used, and the target language code is
             | much simpler without?
             | 
             | I've found when porting that the answers to these are
             | sometimes not universal for a codebase, but rather you are
             | best served considering case-by-case inside the code.
             | 
             | Although I suppose an AI agent could be created that holds
             | a conversation with you and presents the options and acts
             | accordingly.
        
           | littlestymaar wrote:
           | "A precise enough specification is already code", which means
           | we'll not run out of developers in the short term. But the
           | day to day job is going to be very different, maybe as
           | different as what we're doing now compared to writing machine
           | code on punchcards.
        
             | mattmanser wrote:
             | Doubtful. This is the same mess we've been in repeatedly
             | with 'low code'/'no code' solutions.
             | 
             | Every decade it's 'we don't need programmers anymore'. Then
             | it turns out specifying the problem needs programmers. Then
             | it turns out the auto-coder can only reach a certain level
             | of complexity. Then you've got real programmers modifying
             | over-complicared code. Then everyone realizes they've
             | wasted millions and it would have been quicker and cheaper
             | to get the programmers to write the code in the first
             | place.
             | 
             | The same will almost certainly happen with AI generated
             | code for the next decade or two, just at a slightly higher
             | level of program complexity.
        
           | cs702 wrote:
           | Programming has a clear reward function when the problem
           | being solving is well-specified, e.g., "we need a program
           | that grabs data from these three endpoints, combines their
           | data in this manner, and returns it in this JSON format."
           | 
           | The same is true for math. There is a clear reward function
           | when the goal is well-specified, e.g., "we need a sequence of
           | mathematical statements that prove this other important
           | mathematical statement is true."
        
             | seanthemon wrote:
             | >when the problem being solving is well-specified
             | 
             | Phew! Sounds like i'll be fine, thank god for product
             | owners.
        
               | steveBK123 wrote:
               | 20 years, number of "well specified" requirements
               | documents I've received: 0.
        
             | danpalmer wrote:
             | I'm not sure I would agree. By the time you've written a
             | full spec for it, you may as well have just written a high
             | level programming language anyway. You can make assumptions
             | that minimise the spec needed... but also programming APIs
             | can have defaults so that's no advantage.
             | 
             | I'd suggest that the Python code for your example prompt
             | with reasonable defaults is not actually that far from the
             | prompt itself in terms of the time necessary to write it.
             | 
             | However, add tricky details like how you want to handle
             | connection pooling, differing retry strategies, short
             | circuiting based on one of the results, business logic in
             | the data combination step, and suddenly you've got a whole
             | design doc in your prompt and you need a senior engineer
             | with good written comms skills to get it to work.
        
               | cs702 wrote:
               | Thanks. I view your comment as orthogonal to mine,
               | because I didn't say anything about how easy or hard it
               | would be for human beings to specify the problems that
               | must be solved. Some problems may be easy to specify,
               | others may be hard.
               | 
               | I feel we're looking at the need for a measure of the
               | computational complexity of _problem specifications_ --
               | something like Kolmogorov complexity, i.e., minimum
               | number of bits required, but for specifying instead of
               | solving problems.
        
               | danpalmer wrote:
               | Apologies, I guess I agree with your sentiment but
               | disagree with the example you gave as I don't think it's
               | well specified, and my more general point is that there
               | isn't an effective specification, which means that in
               | practice there isn't a clear reward function. If we can
               | get the clear specification, which we probably can do
               | proportionally to the complexity of the problem, and not
               | getting very far up the complexity curve, then I would
               | agree we can get the good reward function.
        
               | cs702 wrote:
               | > the example you gave
               | 
               | Ah, got it. I was just trying to keep my comment short!
        
               | chasd00 wrote:
               | > I'm not sure I would agree. By the time you've written
               | a full spec for it, you may as well have just written a
               | high level programming language anyway.
               | 
               | Remember all those attempts to transform UML into code
               | back in the day? This sounds sorta like that. I'm not a
               | total genai naysayer but definitely in the "cautiously
               | curious" camp.
        
               | danpalmer wrote:
               | Absolutely, we've tried lots of ways to formalise
               | software specification and remove or minimise the amount
               | of coding, and almost none of it has stuck other than
               | creating high level languages and better code-level
               | abstractions.
               | 
               | I think generative AI is already a "really good
               | autocomplete" and will get better in that respect, I can
               | even see it generating good starting points, but I don't
               | think in its current form it will replace the act of
               | programming.
        
               | bee_rider wrote:
               | Yeah, an LLM applied to converting design docs to
               | programs seems like, essentially, the invention of an
               | extremely high level programming language. Specifying the
               | behavior of the program in sufficient detail is... why we
               | have programming languages.
               | 
               | There's the task of writing syntax, which is the
               | mechanical overhead of the task of telling the computer
               | what to do. People should focus on the latter (too much
               | code is a symptom of insufficient automation or
               | abstraction). Thankfully lots of people have CS degrees,
               | not "syntax studies" degrees, right?
        
             | ekianjo wrote:
             | > Programming has a clear reward function when the problem
             | being solving is well-specified
             | 
             | the reason why we spend time programming is because the
             | problems in question are not easily defined, let alone the
             | solutions.
        
             | FooBarBizBazz wrote:
             | This could make programming more declarative or constraint-
             | based, but you'd still have to specify the properties you
             | want. Ultimately, if you are defining some function in the
             | mathematical sense, you need to say somehow what inputs go
             | to what outputs. You need to _communicate_ that to the
             | computer, and a certain number of bits will be needed to do
             | that. Of course, if you have a good statistical model of
             | how-probably a human wants a given function f, then you can
             | perform that communication to the machine in 1 /log(P(f))
             | bits, so the model isn't worthless.
             | 
             | Here I have assumed something about the set that f lives
             | in. I am taking for granted that a probability measure can
             | be defined. In theory, perhaps there are difficulties
             | involving the various weird infinities that show up in
             | computing, related to undecideability and incompleteness
             | and such. But at a practical level, if we assume some
             | concrete representation of the program then we can just
             | define that it is smaller than some given bound, and ditto
             | for a number of computational steps with a particular model
             | of machine (even if fairly abstract, like some lambda
             | calculus thing), so realistically we might be able to not
             | worry about it.
             | 
             | Also, since our input and output sets are bounded (say, so
             | many 64-bit doubles in, so many out), that also gives you a
             | finite set of functions in principle; just think of the
             | size of the (impossibly large) lookup table you'd need to
             | represent it.
        
             | agos wrote:
             | can you give an example of what "in this manner" might be?
        
             | dartos wrote:
             | > programming has a clear reward function.
             | 
             | If you're the most junior level, sure.
             | 
             | Anything above that, things get fuzzy, requirements change,
             | biz goals shift.
             | 
             | I don't really see this current wave of AI giving us
             | anything much better than incremental improvement over
             | copilot.
             | 
             | A small example of what I mean:
             | 
             | These systems are statistically based, so there's no
             | probability. Because of that, I wouldn't even gain anything
             | from having it write my tests since tests are easily built
             | wrong in subtle ways.
             | 
             | I'd need to verify the test by reviewing it and, imo,
             | writing the test would be less time than coaxing a correct
             | one, reviewing, re-coaxing, repeat.
        
             | nyrikki wrote:
             | A couple of problems that is impossible to prove from the
             | constructivism angle:
             | 
             | 1) Addition of the natural numbers 2) equality of two real
             | numbers
             | 
             | When you restrict your tools to perceptron based feed
             | forward networks with high parallelism and no real access
             | to 'common knowledge', the solution set is very restricted.
             | 
             | Basically what Godel proved that destroyed Russel's plans
             | for the Mathmatica Principia applies here.
             | 
             | Programmers can decide what is sufficient if not perfect in
             | models.
        
           | gizmo wrote:
           | Much business logic is really just a state machine where all
           | the states and all the transitions need to be handled. When a
           | state or transition is under-specified an LLM can pass the
           | ball back and just ask what should happen when A and B but
           | not C. Or follow more vague guidance on what should happen in
           | edge cases. A typical business person is perfectly capable of
           | describing how invoicing should work and when refunds should
           | be issued, but very few business people can write a few
           | thousand lines of code that covers all the cases.
        
             | discreteevent wrote:
             | > an LLM can pass the ball back and just ask what should
             | happen when A and B but not C
             | 
             | What should the colleagues of the business person review
             | before deciding that the system is fit for purpose? Or what
             | should they review when the system fails? Should they go
             | back over the transcript of the conversation with the LLM?
        
               | ben_w wrote:
               | As an LLM can output source code, that's all answerable
               | with "exactly what they already do when talking to
               | developers".
        
               | discreteevent wrote:
               | There are two reasons the system might fail:
               | 
               | 1) The business person made a mistake in their
               | conversation/specification.
               | 
               | In this case the LLM will have generated code and tests
               | that match the mistake. So all the tests will pass. The
               | best way to catch this _before it gets to production_ is
               | to have someone else review the specification. But the
               | problem is that the specification is a long trial-and-
               | error conversation in which later parts may contradict
               | earlier parts. Good luck reviewing that.
               | 
               | 2) The LLM made a mistake.
               | 
               | The LLM may have made the mistake because of a
               | hallucination which it cannot correct because in trying
               | to correct it the same hallucination invalidates the
               | correction. At this point someone has to debug the
               | system. But we got rid of all the programmers.
        
               | ben_w wrote:
               | This still resolves as "business person asks for code,
               | business person gets code, business person says if code
               | is good or not, business person deploys code".
               | 
               | That an LLM or a human is where the code comes from,
               | doesn't make much difference.
               | 
               | Though it does _kinda_ sound like you 're assuming all
               | LLMs must develop with Waterfall? That they can't e.g.
               | use Agile? (Or am I reading too much into that?)
        
               | discreteevent wrote:
               | > business person says if code is good or not
               | 
               | How do they do this? They can't trust the tests because
               | the tests were also developed by the LLM which is working
               | from incorrect information it received in a chat with the
               | business person.
        
               | ben_w wrote:
               | The same way they already do with humans coders whose
               | unit tests were developed by exactly same flawed
               | processes:
               | 
               |  _Mediocrely_.
               | 
               | Sometimes the current process works, other times the
               | planes fall out of the sky, or updates causes millions of
               | computers to blue screen on startup at the same time.
               | 
               | LLMs in particular, and AI in general, doesn't need to
               | _beat_ humans at the same tasks.
        
               | gizmo wrote:
               | How does a business person today decide if a system is
               | fit for purpose when they can't read code? How is this
               | different?
        
               | Jensson wrote:
               | They don't, the software engineer does that. It is
               | different since LLMs can't test the system like a human
               | can.
               | 
               | Once the system can both test and update the spec etc to
               | fix errors in the spec and build the program and ensure
               | the result is satisfactory, we have AGI. If you argue an
               | AGI could do it, then yeah it could as it can replace
               | humans at everything, the argument was for an AI that
               | isn't yet AGI.
        
               | gizmo wrote:
               | The world runs on fuzzy underspecified processes. On
               | excel sheets and post-it notes. Much of the world's
               | software needs are not sophisticated and don't require
               | extensive testing. It's OK if a human employee is in the
               | loop and has to intervenes sometimes when an AI-built
               | system malfunctions. Businesses of all sizes have
               | procedures where problems get escalated to more senior
               | people with more decision-making power. The world is
               | already resilient against mistakes made by
               | tired/inattentive/unintelligent people, and mistakes made
               | by dumb AI systems will blend right in.
        
               | discreteevent wrote:
               | > The world runs on fuzzy underspecified processes. On
               | excel sheets and post-it notes.
               | 
               | Excel sheets are not fuzzy and underspecified.
               | 
               | > It's OK if a human employee is in the loop and has to
               | intervenes sometimes
               | 
               | I've never worked on software where this was OK. In many
               | cases it would have been disastrous. Most of the time a
               | human employee could not fix the problem without
               | understanding the software.
        
               | gizmo wrote:
               | All software that interops with people, other businesses,
               | APIs, deals with the physical world in any way, or
               | handles money has cases that require human intervention.
               | It's 99.9% of software if not more. Security updates.
               | Hardware failures. Unusual sensor inputs. A sudden influx
               | of malformed data. There is no such thing as an entirely
               | autonomous system.
               | 
               | But we're not anywhere close to maximally automated.
               | Today (many? most?) office workers do manual data entry
               | and processing work that requires very little thinking.
               | Even automating just 30% of their daily work is a huge
               | win.
        
           | jimbokun wrote:
           | The reward function could be "pass all of these tests I just
           | wrote".
        
             | marcosdumay wrote:
             | Lol. Literally.
             | 
             | If you have those many well written tests, you can pass
             | them to a constraint solver today and get your program. No
             | LLM needed.
             | 
             | Or even run your tests instead of the program.
        
               | emporas wrote:
               | Probably the parent assumes that he does have the tests,
               | billions of them.
               | 
               | One very strong LLM could generate billions of tests
               | alongside the working code and then train another smaller
               | model, or feed it into the next iteration of training
               | same the strong model. Strong LLMs do exist for that
               | purpose, Nemotron 320B and Llama 3 450B.
               | 
               | It would be interesting if a dataset like that would be
               | created like that, and then released as open source. Many
               | LLMs proprietary or not, could incorporate the dataset in
               | their training, and have on the internet hundreds of LLMs
               | suddenly become much better at coding, all of them at
               | once.
        
             | acchow wrote:
             | After much RL, the model will just learn to mock everything
             | to get the test to pass.
        
           | ryukoposting wrote:
           | If we will struggle to create reward functions for AI, then
           | how different is that from the struggles we _already face_
           | when divvying up product goals into small tasks to fit our
           | development cycles?
           | 
           | In other words, to what extent does Agile's ubiquity prove
           | our competence in turning product goals into _de facto_
           | reward functions?
        
           | paxys wrote:
           | Exactly, and people have been saying this for a while now. If
           | an "AI software engineer" needs a perfect spec with zero
           | ambiguity, all edge cases defined, full test coverage with
           | desired outcomes etc., then the person writing the spec is
           | the actual software engineer, and the AI is just a compiler.
        
             | satvikpendem wrote:
             | Reminds me of when computers were literally humans
             | computing things (often women). How time weaves its
             | circular web.
        
             | dartos wrote:
             | We've also learned that starting off by rigidly defined
             | spec is actually harmful to most user facing software,
             | since customers change their minds so often and have a hard
             | time knowing what they want right from the start.
        
               | diffxx wrote:
               | This is why most of the best software is written by
               | people writing things for themselves and most of the
               | worst is made by people making software they don't use
               | themselves.
        
             | _the_inflator wrote:
             | Exactly. This is what I tell everyone. The harder you work
             | on specs the easier it gets in the aftermath. And this is
             | exactly what business with lofty goals doesn't get or
             | ignores. Put another way: a fool with a tool...
             | 
             | Also look out for optimization the clever way.
        
             | sgu999 wrote:
             | > then the person writing the spec is the actual software
             | engineer
             | 
             | Sounds like this work would involve asking questions to
             | collaborators, guess some missing answers, write specs and
             | repeat. Not that far ahead of the current sota of AI...
        
               | nyrikki wrote:
               | Same reason the visual programming paradigm failed, tbe
               | main problem is not the code.
               | 
               | While writing simple functions may be mechanistic, being
               | a developer is not.
               | 
               | 'guess some missing answers' is why Waterfall, or any big
               | upfront design has failed.
               | 
               | People aren't simply loading pig iron into rail cars like
               | Taylor assumed.
               | 
               | The assumption of perfect central design with perfect
               | knowledge and perfect execution simply doesn't work for
               | systems which are for more like an organism than a
               | machine.
        
               | gizmo wrote:
               | Waterfall fails when domain knowledge is missing.
               | Engineers won't take "obvious" problems into
               | consideration when they don't even know what the right
               | questions to ask are. When a system gets rebuild for the
               | 3rd time the engineers do know what to build and those
               | basic mistakes don't get made.
               | 
               | Next gen LLMs, with their encyclopedic knowledge about
               | the world, won't have that problem. They'll get the
               | design correct on their first attempt because they're
               | already familiar with the common pitfalls.
               | 
               | Of course we shouldn't expect LLMs to be a magic bullet
               | that can program anything. But if your frame of reference
               | is "visual programming" where the goal is to turn poorly
               | thought out requirements into a reasonably sensible state
               | machine then we should expect LLMs to get very good at
               | that compared to regular people.
        
               | nyrikki wrote:
               | LLMs are NLP, what you are talking about is NLU, which
               | has been considered an AI-hard problem for a long time.
               | 
               | I keep looking for discoveries that show any movement
               | there. But LLMs are still basically pattern matching and
               | finding.
               | 
               | They can do impressive things, but they actually have no
               | concept of what the 'right thing' even is, it is
               | statistic not philosophy.
        
             | qup wrote:
             | What makes you think they'll need a perfect spec?
             | 
             | Why do you think they would need a more defined spec than a
             | human?
        
               | digging wrote:
               | A human has the ability to contact the PM and say, "This
               | won't work, for $reason," or, "This is going to look
               | really bad in $edgeCase, here are a couple options I've
               | thought of."
               | 
               | There's nothing about AI that makes such operations
               | intrinsically impossible, but they require much more than
               | just the ability to generate working code.
        
             | mattnewton wrote:
             | I mean, that's already the case in many places, the senior
             | engineer / team lead gathering requirements and making
             | architecture decisions is removing enough ambiguity to hand
             | it off to juniors churning out the code. This just makes
             | very cheap, very fast typing but uncreative and a little
             | dull junior developers.
        
             | mlavrent wrote:
             | This is not quite right - a specification is not equivalent
             | to writing software, and the code generator is not just a
             | compiler - in fact, generating implementations from
             | specifications is a pretty active area of research (a
             | simpler problem is the problem of generating a
             | configuration that satisfies some specification,
             | "configuration synthesis").
             | 
             | In general, implementations can be vastly more complicated
             | than even a complicated spec (e.g. by having to deal with
             | real-world network failures, etc.), whereas a spec needs
             | only to describe the expected behavior.
             | 
             | In this context, this is actually super useful, since
             | defining the problem (writing a spec) is usually easier
             | than solving the problem (writing an implementation); it's
             | not just translating (compiling), and the engineer is now
             | thinking at a higher level of abstraction (what do I want
             | it to do vs. how do I do it).
        
           | airstrike wrote:
           | My reward in Rust is often when the code actually compiles...
        
           | _the_inflator wrote:
           | Full circle but instead of determinism you introduce some
           | randomness. Not good.
           | 
           | Also the reasoning is something business is dissonant about.
           | The majority of planning and execution teams stick to
           | processes. I see way more potential automating these than all
           | parts in app production.
           | 
           | Business is going to have a hard time, when they believe,
           | they alone can orchestrate some AI consoles.
        
           | tomrod wrote:
           | You can define one based on passed tests, code coverage,
           | other objectives, or weighted combinations without too much
           | loss of generality.
        
           | LeifCarrotson wrote:
           | I think you could set up a good reward function for a
           | programming assistance AI by checking that the resulting code
           | is actually used. Flag or just 'git blame' the code produced
           | by the AI with the prompts used to produce it, and when you
           | push a release, it can check which outputs were retained in
           | production code from which prompts. Hard to say whether code
           | that needed edits was because the prompt was bad or because
           | the code was bad, but at least you can get positive feedback
           | when a good prompt resulted in good code.
        
             | rfw300 wrote:
             | GitHub Copilot's telemetry does collect data on whether
             | generated code snippets end up staying in the code, so
             | presumably models are tuned on this feedback. But you
             | haven't solved any of the problems set out by Karpathy here
             | --this is just bankshot RLHF.
        
             | bee_rider wrote:
             | That could be interesting but it does seem like a much
             | fuzzier and slower feedback loop than the original idea.
             | 
             | It also seems less unique to code. You could also have a
             | chat bot write an encyclopedia and see if the encyclopedias
             | sold well. Chat bots could edit Wikipedia and see if their
             | edits stuck as a reward function (seems ethically pretty
             | questionable or at least in need of ethical analysis, but
             | it is possible).
             | 
             | The maybe-easy to evaluate reward function is an
             | interesting aspect of code (which isn't to say it is the
             | only interesting aspect, for sure!)
        
           | axus wrote:
           | If they get permission and don't mind waiting, they could
           | check if people throw away the generated code or keep it as-
           | is.
        
           | consteval wrote:
           | There's levels to this.
           | 
           | Certainly "compiled" is one reward (although a blank file
           | fits that...) Another is test cases, input and output. This
           | doesn't work on a software-wide scale but function-wide it
           | can work.
           | 
           | In the future I think we'll see more of this test-driven
           | development. Where developers formally define the
           | requirements and expectations of a system and then an LLM
           | (combined with other tools) generates the implementation. So
           | instead of making the implementation, you just declaratively
           | say what the implementation should do (and shouldn't).
        
           | waldrews wrote:
           | There's no reward function in the sense that optimizing the
           | reward function means the solution is ideal.
           | 
           | There are objective criteria like 'compiles correctly' and
           | 'passes self-designed tests' and 'is interpreted as correct
           | by another LLM instance' which go a lot further than criteria
           | that could be defined for most kinds of verbal questions.
        
         | davedx wrote:
         | I'm pretty interested in the theorem proving/scientific
         | research aspect of this.
         | 
         | Do you think it's possible that some version of LLM technology
         | could discover new physical theories (that are experimentally
         | verifiable), like for example a new theory of quantum gravity,
         | by exploring the mathematical space?
         | 
         | Edit: this is just incredibly exciting to think about. I'm not
         | an "accelerationist" but the "singularity" has never felt
         | closer...
        
           | esjeon wrote:
           | IIRC, there have been people doing similar things using
           | something close to brute-force. Nothing of real significance
           | has been found. A problem is that there are infinitely many
           | physically and mathematically correct theorems that would add
           | no practical value.
        
           | gizmo wrote:
           | My hunch is that LLMs are nowhere near intelligent enough to
           | make brilliant conceptual leaps. At least not anytime soon.
           | 
           | Where I think AI models might prove useful is in those cases
           | where the problem is well defined, where formal methods can
           | be used to validate the correctness of (partial) solutions,
           | and where the search space is so large that work towards a
           | proof is based on "vibes" or intuition. Vibes can be trained
           | through reinforcement learning.
           | 
           | Some computer assisted proofs are already hundreds of pages
           | or gigabytes long. I think it's a pretty safe bet that really
           | long and convoluted proofs that can only be verified by
           | computers will become more common.
           | 
           | https://en.wikipedia.org/wiki/Computer-assisted_proof
        
             | CuriouslyC wrote:
             | They don't need to be intelligent to make conceptual leaps.
             | DeepMind stuff just does a bunch of random RL experiments
             | until it finds something that works.
        
           | tsimionescu wrote:
           | I think the answer is almost certainly no, and is mostly
           | unrelated to how smart LLMs can get. The issue is that any
           | theory of quantum gravity would only be testable with
           | equipment that is much, much more complex than what we have
           | today. So even if the AI came up with some beautifully simple
           | theory, testing that its predictions are correct is still not
           | going to be feasible for a very long time.
           | 
           | Now, it is possible that it could come up with some theory
           | that is radically different from current theories, where
           | quantum gravity arises very naturally, and that fits all of
           | the other predictions of of the current theories that we can
           | measure - so we would have good reasons to believe the new
           | theory and consider quantum gravity _probably_ solved. But it
           | 's literally impossible to predict whether such a theory even
           | exists, that is not mathematically equivalent to QM/QFT but
           | still matches all confirmed predictions.
           | 
           | Additionally, nothing in AI tech so far predicts that current
           | approaches should be any good at this type of task. The only
           | tasks where AI has truly excelled at are extremely well
           | defined problems where there is a huge but finite search
           | space; and where partial solutions are easy to grade. Image
           | recognition, game playing, text translation are the great
           | successes of AI. And performance drops sharply with the
           | uncertainty in the space, and with the difficulty of judging
           | a partial solution.
           | 
           | Finding physical theories is nothing like any of these
           | problems. The search space is literally infinite, partial
           | solutions are almost impossible to judge, and even judging
           | whether a complete solution is good or not is extremely
           | difficult. Sure, you can check if it's mathematically
           | coherent, but that tells you nothing about whether it
           | describes the physical world correctly. And there are plenty
           | of good physical theories that aren't fully formally proven,
           | or weren't at the time they were invented - so mathematical
           | rigour isn't even a very strong signal (e.g. Newton's
           | infinitesimal calculus wasn't considerered sound until the
           | 1900s or something, by which time his theories had long since
           | been rewritten in other terms; the Dirac delta wasn't given a
           | precise mathematical definition until much later than it's
           | uses; and I think QFT still uses some iffy math even today).
        
           | jimbokun wrote:
           | Current LLMs are optimized to produce output most resembling
           | what a human would generate. Not surpass it.
        
             | ben_w wrote:
             | The output most _pleasing_ to a human, which is both better
             | and worse.
             | 
             | Better, when we spot mistakes even if we couldn't create
             | the work with the error. Think art: most of us can't draw
             | hands, but we can spot when Stable Diffusion gets them
             | wrong.
             | 
             | Worse also, because there are many things which are "common
             | sense" and wrong, e.g. https://en.wikipedia.org/wiki/Catego
             | ry:Paradoxes_in_economic..., and we would collectively
             | down-vote a perfectly accurate model of reality for
             | violating our beliefs.
        
         | incorrecthorse wrote:
         | Unless you want an empty test suite or a test suite full of
         | `assert True`, the reward function is more complicated than you
         | think.
        
           | rafaelmn wrote:
           | Code coverage exists. Shouldn't be hard at all to tune the
           | parameters to get what you want. We have really good tools to
           | reason about code programmatically - linters, analyzers,
           | coverage, etc.
        
             | SkiFire13 wrote:
             | In my experience they are ok (not excellent) for checking
             | whether some code will crash or not. But checking whether
             | the code logic is correct with respect to the requirements
             | is far from being automatized.
        
               | rafaelmn wrote:
               | But for writing tests that's less of an issue. You start
               | with known good/bad code and ask it to write tests
               | against a spec for some code X - then the evaluation
               | criteria is something like did the test cover the
               | expected lines and produce the expected outcome
               | (success/fail). Pepper in lint rules for preferred style
               | etc.
        
               | SkiFire13 wrote:
               | But this will lead you to the same problem the tweet is
               | talking! You are training a reward model based on human
               | feedback (whether the code satisfies the specification or
               | not). This time the human feedback may seem more
               | objective, but in the end it's still non-exhaustive human
               | feedback which will lead to the reward model being
               | vulnerable to some adversarial inputs which the other
               | model will likely pick up pretty quickly.
        
               | rafaelmn wrote:
               | It's based on automated tools and evaluation (test
               | runner, coverage, lint) ?
        
               | SkiFire13 wrote:
               | The input data is still human produced. Who decides what
               | is code that follows the specification and what is code
               | that doesn't? And who produces that code? Are you sure
               | that the code that another model produces will look like
               | that? If not then nothing will prevent you from running
               | into adversarial inputs.
               | 
               | And sure, coverage and lints are objective metrics, but
               | they don't directly imply the correctness of a test. Some
               | tests can reach a high coverage and pass all the lint
               | checks but still be incorrect or test the wrong thing!
               | 
               | Whether the test passes or not is what's mostly
               | correlated to whether it's correct or not. But similarly
               | for an image recognizer the prompt of whether an image is
               | a flower or not is also objective and correlated, and yet
               | researchers continue to find adversarial inputs for image
               | recognizer due to the bias in their training data. What
               | makes you think this won't happen here too?
        
               | rafaelmn wrote:
               | > The input data is still human produced
               | 
               | So are rules for the game of go or chess ? Specifying
               | code that satisfies (or doesn't satisfy) is a problem
               | statement - evaluation is automatic.
               | 
               | > but they don't directly imply the correctness of a
               | test.
               | 
               | I'd be willing to bet that if you start with an existing
               | coding model and continue training it with coverage/lint
               | metrics and evaluation as feedback you'd get better at
               | generating tests. Would be slow and figuring out how to
               | build a problem dataset from existing codebases would be
               | the hard part.
        
               | SkiFire13 wrote:
               | > So are rules for the game of go or chess ?
               | 
               | The rules are well defined and you can easily write a
               | program that will tell whether a move is valid or not, or
               | whether a game has been won or not. This allows you
               | generate virtually infinite amount of data to train the
               | model on without human intervention.
               | 
               | > Specifying code that satisfies (or doesn't satisfy) is
               | a problem statement
               | 
               | This would be true if you fix one specific program (just
               | like in Go or Chess you fix the specific rules of the
               | game and then train a model on those) and want to know
               | whether that specific program satisfies some given
               | specification (which will be the input of your model).
               | But if instead you want the model to work with any
               | program then that will have to become part of the input
               | too and you'll have to train it an a number of programs
               | which will have to be provided somehow.
               | 
               | > and figuring out how to build a problem dataset from
               | existing codebases would be the hard part
               | 
               | This is the "Human Feedback" part that the tweet author
               | talks about and the one that will always be flawed.
        
               | layer8 wrote:
               | Who writes the spec to write tests against?
               | 
               | In the end, your are replacing the application code by a
               | spec, which needs to have a comparable level of detail in
               | order for the AI to not invent its own criteria.
        
             | incorrecthorse wrote:
             | Code coverage proves that the code runs, not that it does
             | what it should do.
        
               | rafaelmn wrote:
               | If you have a test that completes with the expected
               | outcome and hits the expected code paths you have a
               | working test - I'd say that heuristic will get you really
               | close with some tweaks.
        
           | littlestymaar wrote:
           | It's not trivial to get right but it sounds within reach,
           | unlike "hallucinations" with general purpose LLM usage.
        
           | gizmo wrote:
           | It's easy to imagine why something could never work.
           | 
           | It's more interesting to imagine what just might work. One
           | thing that has plagued programmers for the past decades is
           | the difficulty of writing correct multi-threaded software.
           | You need fine-grained locking otherwise your threads will
           | waste time waiting for mutexes. But color-coding your program
           | to constrain which parts of your code can touch which data
           | and when is tedious and error-prone. If LLMs can annotate
           | code sufficiently for a SAT solver to prove thread safety
           | that's a huge win. And that's just one example.
        
             | imtringued wrote:
             | Rust is that way.
        
           | WithinReason wrote:
           | Adversarial networks are a straightforward solution to this.
           | The reward for generating and solving tests is different.
        
             | imtringued wrote:
             | That's a good point. A model that is capable of
             | implementing a nonsense test is still better than a model
             | that can't. The implementer model only needs a good variety
             | of tests. They don't actually have to translate a prompt
             | into a test.
        
         | yard2010 wrote:
         | Once you have enough data points, from current usage, and these
         | days every company is tracking EVERYTHING even eye movement if
         | they could, it's just a matter of time. I do agree though that
         | before we reach an AGI we have these agents who are really good
         | in a defined mission (like code completion).
         | 
         | It's not even about LLMs IMHO. It's about letting a computer
         | crunch many numbers and find a pattern in the results, in a
         | quasi religious manner.
        
         | anshumankmr wrote:
         | Unless if it takes maximizing code coverage as the objective
         | and starts deleting failed test cases.
        
         | FooBarBizBazz wrote:
         | > Coding AI can write tests, write code, compile, examine
         | failed test cases, search for different coding solutions that
         | satisfy more test cases or rewrite the tests, all in an
         | unsupervised loop. And then whole process can turn into
         | training data for future AI coding models.
         | 
         | This is interesting, but doesn't it still need supervision? Why
         | wouldn't it generate tests for properties you don't want? It
         | seems to me that it might be able to "fill in gaps" by
         | generalizing from "typical software", like, if you wrote a
         | container class, it might guess that "empty" and "size" and
         | "insert" are supposed to be related in a certain way, based on
         | the fact that other peoples' container classes satisfy those
         | properties. And if you look at the tests it makes up and go,
         | "yeah, I want that property" or not, then you can steer what
         | it's doing, or it can at least force you to think about more
         | cases. But there would still be supervision.
         | 
         | Ah -- here's an unsupervised thing: Performance. Maybe it can
         | guide a sequence of program transformations in a profile-guided
         | feedback loop. Then you could really train the thing to make
         | fast code. You'd pass "-O99" to gcc, and it'd spin up a GPU
         | cluster on AWS.
        
         | pilooch wrote:
         | Yes, same for maths. As long as a true reward 'surface' can be
         | optimized. Approximate rewards are similar to approximate and
         | non admissible heuristics,search eventually misses true optimal
         | states and favors wrong ones, with side effects in very large
         | state spaces.
        
         | xxs wrote:
         | This reads as a proper marketing ploy. If the current
         | incarnation of AI + coding is anything to go by - it'll take
         | leaps just to make it barely usable (or correct)
        
           | EugeneOZ wrote:
           | TDD approach could play the RL role.
        
             | jgalt212 wrote:
             | But what makes you think the ai generated tests will
             | correctly represent the problem at hand?
        
           | Kiro wrote:
           | My take is the opposite: considering how _good_ AI is at
           | coding right now I 'm eager to see what comes next. I don't
           | know what kind of tasks you've tried using it for but I'm
           | surprised to hear someone think that it's not even "barely
           | usable". Personally, I can't imagine going back to
           | programming without a coding assistant.
        
             | Barrin92 wrote:
             | > but I'm surprised to hear someone think that it's not
             | even "barely usable".
             | 
             | write performance oriented and memory safe C++ code.
             | Current coding assistants are glorified autocomplete for
             | unit tests or short api endpoints or what have you but if
             | you have to write any safety oriented code or you have to
             | think about what the hardware does it's unusable.
             | 
             | I tried using several of the assistants and they write
             | broken or non-performant code so regularly it's
             | irresponsible to use them.
        
               | agos wrote:
               | I've also had trouble having assistants help with CSS,
               | which is ostensibly easier than performance oriented and
               | memory safe C++
        
               | imtringued wrote:
               | Isn't this a good reward function for RL? Take a
               | codebase's test suite. Rip out a function, let the LLM
               | rewrite the function, benchmark it and then RL it using
               | the benchmark results.
        
             | commodoreboxer wrote:
             | I've been playing with it recently, and I find unless there
             | are very clear patterns in surrounding code or on the
             | Internet, it does quite terribly. Even for well-seasoned
             | libraries like V8 and libuv, it can't reliably not make up
             | APIs that don't exist and it very regularly spits out
             | nonsense code. Sometimes it writes code that works and does
             | the wrong thing, it can't reliably make good decisions
             | around undefined behavior. The worst is when I've asked for
             | it to refactor code, and it actually subtly changes the
             | behavior in the process.
             | 
             | I imagine it's great for CRUD apps and generating unit
             | tests, but for anything reliable where I work, it's not
             | even close to being useful at all, let alone a game
             | changer. It's a shame, because it's not like I really enjoy
             | fiddling with memory buffers and painstakingly avoiding UB,
             | but I still have to do it (I love Rust, but it's not an
             | option for me because I have to support AIX. V8 in Rust
             | also sounds like a nightmare, to be honest. It's a very C++
             | API).
        
             | ben_w wrote:
             | I've seen them all over the place.
             | 
             | The best are shockingly good... so long as their context
             | doesn't expire and they forget e.g. the Vector class they
             | just created has methods `.mul(...)` rather than
             | `.multiply(...)` or similar. Even the longer context
             | windows are still too short to really take over our jobs
             | (for now), the haystack tests seem to be over-estimating
             | their quality in this regard.
             | 
             | The worst LLM's that I've seen (one of the downloadable
             | run-locally models but I forget which) -- one of my
             | standard tests is that I ask them to "write Tetris as a web
             | app", and it started off doing something a little bit wrong
             | (square grid), before _giving up on that task entirely and
             | switching from JavaScript to python and continuing by
             | writing a script to train a new machine learning model_
             | (and people still ask how these things will  "get out of
             | the box" :P)
             | 
             | People who see more of the latter? I can empathise with
             | them dismissing the whole thing as "just autocomplete on
             | steroids".
        
         | CuriouslyC wrote:
         | Models aren't going to get really good at theorem proving until
         | we build models that are transitive and handle isomorphisms
         | more elegantly. Right now models can't recall factual
         | relationships well in reverse order in many cases, and often
         | fail to answer questions that they can answer easily in English
         | when prompted to respond with the fact in another language.
        
         | djeastm wrote:
         | >Coding AI can write tests, write code, compile, examine failed
         | test cases, search for different coding solutions that satisfy
         | more test cases or rewrite the tests, all in an unsupervised
         | loop.
         | 
         | Will this be able to be done without spending absurd amounts of
         | energy?
        
           | jimbokun wrote:
           | Energy efficiency might end up being the final remaining axis
           | on which biological brains surpass manufactured ones before
           | the singularity.
        
           | commodoreboxer wrote:
           | The amount of energy is truly absurd. I dont chug a 16 oz
           | bottle of water every time I answer a question.
        
           | ben_w wrote:
           | Computer energy efficiency is not as constrained as minimum
           | feature size, it's still doubling every 2.6 years or so.
           | 
           | Even if they were, a human-quality AI that runs at human-
           | speed for x10 our body's calorie requirements in electricity,
           | would still (at electricity prices of USD 0.1/kWh) undercut
           | workers earning the UN abject poverty threshold.
        
         | jimbokun wrote:
         | Future coding where developers only ever write the tests is an
         | intriguing idea.
         | 
         | Then the LLM generates and iterates on the code until it passes
         | all of the tests. New requirements? Add more tests and repeat.
         | 
         | This would be legitimately paradigm shifting, vs. the super
         | charged auto complete driven by LLMs we have today.
        
           | layer8 wrote:
           | Tests don't prove correctness of the code. What you'd really
           | want instead is to specify invariants the code has to
           | fulfill, and for the AI to come up with a machine-checkable
           | proof that the code indeed guarantees those invariants.
        
         | lewtun wrote:
         | > I expect language models to also get crazy good at
         | mathematical theorem proving
         | 
         | Indeed, systems like AlphaProof / AlphaGeometry are already
         | able to win a silver medal at the IMO, and the former relies on
         | Lean for theorem verification [1]. On the open source side, I
         | really like the ideas in LeanDojo [2], which use a form of RAG
         | to assist the LLM with premise selection.
         | 
         | [1] https://deepmind.google/discover/blog/ai-solves-imo-
         | problems...
         | 
         | [2] https://leandojo.org/
        
         | lossolo wrote:
         | Writing tests won't help you here, this problem is the same as
         | other generation tasks. If the test passes, everything seems
         | okay, right? Consider this: you now have a 50-line function
         | just to display 'hello world'. It outputs 'hello world', so it
         | scores well, but it's hardly efficient. Then, there's a
         | function that runs in exponential time instead of the standard
         | polynomial time that any sensible programmer would use in
         | specific cases. It passes the tests, so it gets a high score.
         | You also have assembly code embedded in C code, executed with
         | 'asm'. It works for that particular case and passes the test,
         | but the average C programmer won't understand what's happening
         | in this code, whether it's secure, etc. Lastly, tests written
         | by AI might not cover all cases, they could even fail to test
         | what you intended because they might hallucinate scenarios
         | (I've experienced this many times). Programming faces similar
         | issues to those seen in other generation tasks in the current
         | generation of large language models, though to a slightly
         | lesser extent.
        
           | jononor wrote:
           | One can image critics and code rewriters that optimize for
           | computational, code style, and other requirements in addition
           | to tests.
        
       | mjburgess wrote:
       | It always annoys and amazes me that people in this field have no
       | basic understanding that closed-world finite-information abstract
       | games are a unique and trivial problem. So much of the so-called
       | "world model" ideological mumbojumbo comes from these setups.
       | 
       | Sampling board state from an abstract board space isn't a
       | statistical inference problem. There's no missing information.
       | 
       | The whole edifice of science is a set of experimental and
       | inferential practices to overcome the massive information gap
       | between the state of a measuring device and the state of what, we
       | believe, it measures.
       | 
       | In the case of natural language the gap between a sequence of
       | symbols, "the war in ukraine" and those aspects of the world
       | these symbols refer to is _enormous_.
       | 
       | The idea that there is even a RL-style "reward" function to
       | describe this gap is pseudoscience. As is the false equivocation
       | between sampling of abstracta such as games, and _measuring the
       | world_.
        
         | harshitaneja wrote:
         | Forgive my naivete here but even though solutions to those
         | finite-information abstract games are trivial but not
         | necessarily tractable(for a loser definition of tractable here)
         | and we still need to build heuristics for the subclass of such
         | problems where we need solutions in a given finite time frame.
         | Those heuristics might not be easy to deduce and hence such
         | models help in ascertaining those.
        
           | mjburgess wrote:
           | Yes, and this is how computer "scientists" think of problems
           | -- but this isnt science, it's mathematics.
           | 
           | If you have a process, eg., points = sample(circle) which
           | fully describes its target as n->inf (ie., points = circle as
           | n->inf) you arent engaged in statistical infernece. You might
           | be using some of the same formula, but the whole system of
           | science and statistics has been created for a radically
           | different problem with radically different semantics to
           | everything you're doing.
           | 
           | eg., the height of mercury in a thermometer _never_ becomes
           | the liquid being measured.. it might seems insane
           | /weird/obvious to mention this... but we literally have
           | berkelian-style neoidealists in AI research who don't realise
           | this...
           | 
           | Who think that because you can find representations of
           | abstracta in other spaces they can be projected in.. that
           | this therefore tells you anything at all about inference
           | problems. As if it was the neural network algorithm itself (a
           | series of multiplications and additions) that "revealed the
           | truth" in all data given to it. This, of course, is
           | pseudoscience.
           | 
           | It only applies on mathematical problems, for obvious
           | reasons. If you use a function approximation alg to
           | approximate a function, do not be suprised you can succeed.
           | The issue is that the relationship between, say, the state of
           | a theremometer and the state of the temperature of it's
           | target system is not an abstract function which lives in the
           | space of temperature readings.
           | 
           | More precisely, in the space of temperature readings the
           | actual causal relationship between the height of the mecurary
           | and the temperature of the target shows up as an _infinite_
           | number of temperature distributions (with any given trained
           | NN learning only one of these). None of which are a law of
           | nature -- laws of nature are not given by distributions in
           | measuring devices.
        
         | pyrale wrote:
         | > [...] and trivial problem.
         | 
         | It just took decades and impressive breakthroughs to solve, I
         | wouldn't really call it "trivial". However, I do agree with you
         | that they're a class of problem different from problems with no
         | clear objective function, and probably much easier to reason
         | about that.
        
           | mjburgess wrote:
           | They're a trivial inference problem, not a trivial problem to
           | solve as such.
           | 
           | As in, if i need to infer the radius of a circle from N
           | points sampled from that cirlce.. yes, I'm sure there's a
           | textbook of algorithms/etc. with a lot of work spent on them.
           | 
           | But in the sense of statistical inference, you're only
           | learning a property of a distribution given that
           | distribution.. there isn't any inferential gap. As N->inf,
           | you recover the entire circle itself.
           | 
           | compare with say, learning the 3d structure of an object from
           | 2d photographs. At any rotation of that object, you have a
           | new pixel distribution. So in pixel-space a 3d object is an
           | infinite number of distributions; and the inference goal in
           | pixel-space is to choose between sets of these infinities.
           | 
           | That's actually impossible without bridging information (ie.,
           | some theory). And in practice, it isn't solved in pixel
           | space... you suppose some 3d geometry and use data to refine
           | it. So you solve it in 3d-object-property-space.
           | 
           | With AI techniques, you have ones which work on abstracta
           | (eg., circles) being used on measurement data. So you're
           | solving the 3d/2d problem in pixel space, expecting this to
           | work because "objects are made out of pixels, arent they?"
           | NO.
           | 
           | So there's a huge inferential gap that you cannot bridge
           | here. And the young AI fantatics in research keep milling out
           | papers showing that it does work, so long as its a cirlce,
           | chess, or some abstract game.
        
         | meroes wrote:
         | Yes. Quantum mechanics for example is not something that could
         | have been thought of even conceptually by anything "locked in a
         | room". Logically coherent structure space is so mind bogglingly
         | big we will never come close to even the smallest fraction of
         | it. Science recognizes that only experiments will bring
         | structures like QM out of the infinite sea into our conceptual
         | space. And as a byproduct of how experiments work, the concepts
         | will match (model) the actual world fairly well. The armchair
         | is quite limiting, and I don't see how LLMs aren't locked to
         | it.
         | 
         | AGI won't come from this set of tools. Sam Altman just wants to
         | buy himself a few years of time to find their next product.
        
         | DAGdug wrote:
         | Who doesn't? Karpathy, and a pretty much every researcher at
         | OpenAI/Deepmind/FAIR absolutely knows the trivial concept of
         | fully observable versus partially observable environments,
         | which is 101 reinforcement learning.
        
           | mjburgess wrote:
           | Many don't understand it as a semantic difference
           | 
           | ie., that when you're taking data from a therometer in order
           | to estimate the temperature of coffee, the issue isnt simply
           | partial information
           | 
           | Its that the information is _about_ the mercury, not the
           | coffee. In order to bridge the two you need a theory (eg.,
           | about the causal reliability of heating  / room temp / etc.)
           | 
           | So this isnt just a partial/full information problem -- these
           | are still mathematical toys. This is a _reality_ problem.
           | This is a you 're dealing with a causal realtionship between
           | physical systems problem. This is _not_ a mathematical
           | relationship. It isnt merely partial, it is not a matter of
           | "informaton" at all. No amount could ever make the mecurary,
           | coffee.
           | 
           | Computer scientists have been trained on mathematics and
           | deployed as social scientists, and the naivete is incredible
        
       | toxik wrote:
       | Another little tidbit about RLHF and InstructGPT is that the
       | training scheme is by far dominated by supervised learning. There
       | is a bit of RL sprinkled on top, but the term is scaled down by a
       | lot and 8x more compute time is spent on the supervised loss
       | terms.
        
       | daly wrote:
       | I think that the field of proofs, such as LEAN, which have states
       | (the current subgoal), actions (the applicable theorems,
       | especially effective in LEAN due to strong Typing of arguments),
       | a progress measure (simplified subgoals), a final goal state (the
       | proof completes), and a hierarchy in the theorems so there is a
       | "path metric" from simple theorems to complex theorems.
       | 
       | If Karpathy were to focus on automating LEAN proofs it could
       | change mathematics forever.
        
         | jomohke wrote:
         | Deepmind's recent model is trained with Lean. It scored a
         | silver olympiad medal (and only one point away from gold).
         | 
         | > AlphaProof is a system that trains itself to prove
         | mathematical statements in the formal language Lean. It couples
         | a pre-trained language model with the AlphaZero reinforcement
         | learning algorithm, which previously taught itself how to
         | master the games of chess, shogi and Go
         | 
         | https://deepmind.google/discover/blog/ai-solves-imo-problems...
        
       | cesaref wrote:
       | The final conclusion though stands without any justification -
       | that LLM + RL will somehow out-perform people at open-domain
       | problem solving seems quite a jump to me.
        
         | dosinga wrote:
         | To be fair, it says "has a real shot at" and AlphaGo level.
         | AlphaGo clearly beat humans on Go, so thinking that if you
         | could replicate that, it would have a shot doesn't seem crazy
         | to me
        
           | SiempreViernes wrote:
           | That only makes sense if you think Go is as expressive as
           | written language.
           | 
           | And here I mean that it the act of making a single
           | (plausible) move that must match the expressiveness of
           | language, because otherwise you're not in the domain of Go
           | but the far less interesting "I have a 19x19 pixel grid and
           | two colours".
        
           | HarHarVeryFunny wrote:
           | AlphaGo has got nothing to do with LLMs though. It's a
           | combination of RL + MCTS. I'm not sure where you are seeing
           | any relevance! DeepMind also used RL for playing video games
           | - so what?!
        
         | esjeon wrote:
         | I think the point is that it's practically impossible to
         | correctly perform RLHF in open domains, so comparisons simply
         | can't happen.
        
       | bjornsing wrote:
       | Perhaps not entirely open domain, but I have high hopes for "real
       | RL" in coding, where you can get a reward signal from
       | compile/runtime errors and tests.
        
         | falcor84 wrote:
         | Interesting, has anyone been doing this? I.e. training/fine-
         | tuning an LLM against an actual coding environment, as opposed
         | to just tacking that later on as a separate "agentic" contruct?
        
           | bjornsing wrote:
           | I suspect that the big vendors are already doing it, but I
           | haven't seen a paper on it.
        
       | pyrale wrote:
       | It's a bit disingenuous to pick go as a case to make the point
       | against RLHF.
       | 
       | Sure, a board game with an objective winning function at which
       | computers are already better than humans won't get much from
       | RLHF. That doesn't look like a big surprise.
       | 
       | On the other hand, a LLM trained on lots of not-so-much curated
       | data will naturally pick up mistakes from that dataset. It is not
       | really feasible or beneficial to modify the dataset exhaustively,
       | so you reinforce the behaviour that is expected at the end. An
       | example would be training an AI in a specific field of work: it
       | could repeat advices from amateurs on forums, when less-known
       | professional techniques would be more advisable.
       | 
       | Think about it like kids naturally learning swear words at
       | school, and RLHF like parents that tell their kids that these
       | words are inappropriate.
       | 
       | The tweet conclusion seems to acknowledge that, but in a wishful
       | way that doesn't want to concede the point.
        
       | epups wrote:
       | This is partially the reason why we see LLM's "plateauing" in the
       | benchmarks. For the lmsys Arena, for example, LLM's are simply
       | judged on whether the user liked the answer or not. Truth is a
       | secondary part of that process, as are many other things that
       | perhaps humans are not very good at evaluating. There is a limit
       | to the capacity and value of having LLM's chase RLHF as a reward
       | function. As Karpathy says here, we could even argue that it is
       | counter productive to build a system based on human opinion,
       | especially if we want the system to surpass us.
        
         | HarHarVeryFunny wrote:
         | RLHF really isn't the problem as far as surpassing human
         | capability - language models trained to mimic human responses
         | are fundamentally not going to do anything other than mimic
         | human responses, regardless of how you fine-tune them for the
         | specific type of human responses you do or don't like.
         | 
         | If you want to exceed human intelligence, then design
         | architectures for intelligence, not for copying humans!
        
       | tpoacher wrote:
       | I get the point of the article, but I think it makes a bit of a
       | strawman to drive the point across.
       | 
       | Yes, RLHF is barely RL, but you wouldn't use human feedback to
       | drive a Go game unless there was no better alternative; and in
       | RL, finding a good reward function is the name of the game; once
       | you have that, you have no reason to prefer human feedback,
       | especially if it is demonstrably worse. So, no, nobody would
       | actually "prefer RLHF over RL" given the choice.
       | 
       | But for language models, human feedback IS the ground truth (at
       | least until we find a better, more mathematical alternative). If
       | it weren't and we had something better, then we'd use that. But
       | we don't. So no, RLHF is not "worse than RL" in this case,
       | because there 'is' no 'other' RL in this case; so, here, RLHF
       | actually _is_ RL.
        
         | SiempreViernes wrote:
         | If you cut out humans, what's the point of the language? Just
         | use a proper binary format, I hear protobuf is popular.
        
       | Xcelerate wrote:
       | One thing I've wondered about is what the "gap" between current
       | transformer-based LLMs and optimal sequence prediction looks
       | like.
       | 
       | To clarify, current LLMs (without RLHF, etc.) have a very
       | straightforward objective function during training, which is to
       | minimize the cross-entropy of token prediction on the training
       | data. If we assume that our training data is sampled from a
       | population generated via a finite computable model, then
       | Solomonoff induction achieves optimal sequence prediction.
       | 
       | Assuming we had an oracle that could perform SI (since it's
       | uncomputable), how different would conversations between GPT4 and
       | SI be, given the same training data?
       | 
       | We know there would be at least a few notable differences. For
       | example, we could give SI the first 100 digits of pi, and it
       | would give us as many more digits as we wanted. Current
       | transformer models cannot (directly) do this. We could also give
       | SI a hash and ask for a string that hashes to that value. Clearly
       | a lot of hard, formally-specified problems could be solved this
       | way.
       | 
       | But how different would SI and GPT4 appear in response to
       | everyday chit-chat? What if we ask the SI-based sequence
       | predictor how to cure cancer? Is the "most probable" answer to
       | that question, given its internet-scraped training data, an
       | answer that humans find satisfying? Probably not, which is why
       | AGI requires something beyond just optimal sequence prediction.
       | It requires a really good objective function.
       | 
       | My first inclination for this human-oriented objective function
       | is something like "maximize the probability of providing an
       | answer that the user of the model finds satisfying". But there is
       | more than one user, so over which set of humans do we consider
       | and with which aggregation (avg satisfaction, p99 satisfaction,
       | etc.)?
       | 
       | So then I'm inclined to frame the problem in terms of well-being:
       | "maximize aggregate human happiness over all time" or "minimize
       | the maximum of human suffering over all time". But each of these
       | objective functions has notable flaws.
       | 
       | Karpathy seems to be hinting toward this in his post, but the
       | selection of an overall optimal objective function _for human
       | purposes_ seems to be an incredibly difficult philosophical
       | problem. There is no objective function I can think of for which
       | I cannot also immediately think of flaws with it.
        
         | bick_nyers wrote:
         | Alternatively, you include information about the user of the
         | model as part of the context to the inference query, so that
         | the model can uniquely optimize its answer for that user.
         | 
         | Imagine if you could give a model "how you think" and your
         | knowledge, experiences, and values as context, then it's
         | "Explain Like I'm 5" on steroids. Both exciting and terrifying
         | at the same time.
        
           | Xcelerate wrote:
           | > Alternatively, you include information about the user of
           | the model as part of the context to the inference query
           | 
           | That was sort of implicit in my first suggestion for an
           | objective function, but do you _really_ want the model to be
           | optimal on a per-user basis? There's a lot of bad people out
           | there. That's why I switched to an objective function that
           | considers all of humanity's needs together as a whole.
        
             | bick_nyers wrote:
             | Objective Function: Optimize on a per-user basis.
             | Constraints: Output generated must be considered legal in
             | user's country.
             | 
             | Both things can co-exist without being in conflict of each
             | other.
             | 
             | My (hot) take is I personally don't believe that any LLM
             | that can fit on a single GPU is capable of significant
             | harm. An LLM that fits on an 8xH100 system perhaps, but I
             | am more concerned about other ways an individual could
             | spend ~$300k with a conviction of harming others. Besides,
             | looking up how to make napalm on Google and then actually
             | doing it and using it to harm others doesn't make Google
             | the one responsible imo.
        
         | marcosdumay wrote:
         | > What if we ask the SI-based sequence predictor how to cure
         | cancer? Is the "most probable" answer to that question, given
         | its internet-scraped training data, an answer that humans find
         | satisfying?
         | 
         | You defined your predictor as being able to minimize
         | mathematical definitions following some unspecified algebra,
         | why didn't you define it being able to run chemical and
         | pharmacological simulations through some unspecified model too?
        
           | Xcelerate wrote:
           | I don't follow--what do you mean by unspecified algebra?
           | Solomonoff induction is well-defined. I'm just asking how the
           | responses of a chatbot using Solomonoff induction for
           | sequence prediction would differ from those using a
           | transformer model, given the same training data. I can
           | specify mathematically if that makes it clearer...
        
         | JoshuaDavid wrote:
         | >But how different would SI and GPT4 appear in response to
         | everyday chit-chat? What if we ask the SI-based sequence
         | predictor how to cure cancer?
         | 
         | I suspect that a lot of LLM prompts that elicit useful
         | capabilities out of imperfect sequence predictors like GPT-4
         | are in fact most likely to show up in the context of "prompting
         | an LLM" rather than being likely to show up "in the wild".
         | 
         | As such, to predict the token after a prompt like that, an SI-
         | based sequence predictor would want to predict the output of
         | whatever language model was most likely to be prompted,
         | conditional on the prompt/response pair making it into the
         | training set.
         | 
         | If the answer to "what model was most likely to be prompted"
         | was "the SI-based sequence predictor", then it needs to predict
         | which of its own likely outputs are likely to make it into the
         | training set, which requires it to have a probability
         | distribution over its own output. I think the "did the model
         | successfully predict the next token" reward function is
         | underspecified in that case.
         | 
         | There are many cases like this where the behavior of the system
         | in the limit of perfect performance at the objective is
         | undesirable. Fortunately for us, we live in a finite universe
         | and apply finite amounts of optimization power, and lots of
         | things that are useless or malign in the limit are useful in
         | the finite-but-potentially-quite-large regime.
        
       | cs702 wrote:
       | Indeed. The reward function we're using in RLHF today induces AI
       | models to behave in ways that superficially seem better to human
       | beings on average, but what we actually want is to induce them to
       | _solve_ cognitive tasks, with human priorities.
       | 
       | The _multi-trillion dollar question_ is: What is the objective
       | reward that would induce AI models like LLMs to behave like AGI
       | -- while adhering to all the limits we human beings wish to
       | impose in AGI behavior?
       | 
       | I don't think anyone has even a faint clue of the answer yet.
        
         | Xcelerate wrote:
         | > The multi-trillion dollar question is: What is the objective
         | reward that would induce AI models like LLMs to behave like AGI
         | 
         | No, the reward for finding the right objective function is a
         | good future for all of humanity, given that we already have an
         | algorithm for AGI.
         | 
         | The objective function to acquire trillions of dollars is
         | trivial: it's the same minimization of cross-entropy that we
         | already use for sequence prediction. What's missing is a better
         | algorithm, which is probably a good thing at the moment,
         | because otherwise someone could trivially drain all value from
         | the stock market.
        
           | cs702 wrote:
           | You misunderstand.
           | 
           | The phrase "the multi-trillion dollar question" has _nothing_
           | to do with acquiring trillions of dollars.
           | 
           | The phrase is _idiomatic_ , indicating a crucial question,
           | like "the $64,000 question," but implying much bigger
           | stakes.[a]
           | 
           | ---
           | 
           | [a] https://idioms.thefreedictionary.com/The+%2464%2c000+Ques
           | tio...
        
             | Xcelerate wrote:
             | Ah, I see. Apologies.
        
               | cs702 wrote:
               | Thank you.
               | 
               | By the way, I agree with you that "a good future for all
               | of humanity" would be a fantastic goal :-)
               | 
               | The multi-trillion dollar question is: How do you specify
               | that goal as an objective function?
        
         | HarHarVeryFunny wrote:
         | You can't just take an arbitrary neural network architecture,
         | and make it do anything by giving it an appropriate loss
         | function, and in particular you can't take a simple feed
         | forward model like a Transformer and train it to be something
         | other than a feed forward model... If the model architecture
         | doesn't have feedback paths (looping) or memory that persists
         | from one input to the next, then no reward function is going to
         | make it magically sprout those architectural modifications!
         | 
         | Today's Transformer-based LLMs are just what the name says -
         | (Large) Language Models - fancy auto-complete engines. They are
         | not a full blown cognitive architecture.
         | 
         | I think many people do have a good idea how to build cognitive
         | architectures, and what the missing parts are that are needed
         | for AGI, and some people are working on that, but for now all
         | the money and news cycles are going into LLMs. As Chollet says,
         | they have sucked all the oxygen out of the room.
        
       | Kim_Bruning wrote:
       | Ceterum censeo install AI in cute humanoid robot.
       | 
       | Robot because physical world provides a lot of RL for free (+ and
       | -).
       | 
       | Humanoid because known quantity.
       | 
       | Cute because people put up with cute a lot more, and lower
       | expectations.
        
       | bilsbie wrote:
       | @mods. I submitted this exact link yesterday. Shouldn't this post
       | have shown it was already submitted?
       | 
       | https://news.ycombinator.com/item?id=41184948
        
         | defrost wrote:
         | Maybe.
         | 
         | There _might_ be logic that says an  'old' link (> 12 hours
         | say) with no comments doesn't need to be cross linked to if
         | submitted later (or other rule).
         | 
         | In any case, @mods and @dang do not work (save by chance) .. if
         | you think it's worth bring to attention then there's generally
         | no downside to simply emailing direct to hn # ycombinator dot
         | com from your login email.
        
       | NalNezumi wrote:
       | While I agree to Karpathy and I also had a "wut? They call this
       | RL? " reaction when RLHF was presented as an method of CHATGPT
       | training, I'm a bit surprised by the insight he makes because
       | this same method and insight have been gathered from "Learning
       | from Human preference" [1] from none other than openAI, published
       | in 2017.
       | 
       | Sometimes judging a "good enough" policy is order of magnitudes
       | more easier than formulating an exact reward function, but this
       | is pretty much domain and scope dependent. Trying to estimate a
       | reward function in those situations, can often be counter
       | productive because the reward might even screw up your search
       | direction. This observation was also made by the authors
       | (researchers) of the book "Myth of the objective"[2] with their
       | picbreeder example. (the authors so happens to also work for
       | OpenAI.)
       | 
       | When you have a well defined reward function with no local
       | suboptima _and no cost in rolling out faulty policies_ RL work
       | remarkably well. (Alex Ipran described this well in his widely
       | cited blog [3])
       | 
       | Problem is that this is pretty hard requirements to have for most
       | problems that interact with the real world (and not internet, the
       | artificial world). It's either the suboptima that is in the way
       | (LLM and text), or rollout cost (running GO game a billion times
       | to just beat humans, is currently not a feasible requirement for
       | a lot of real world applications)
       | 
       | Tangentially, this is also why I suspect LLM for planning (and
       | understanding the world) in the real world have been lacking.
       | Robot Transformer and SayCan approaches are cool but if you look
       | past the fancy demos it is indeed a lackluster performance.
       | 
       | It will be interesting to see how these observations and
       | Karpathys observations will be tested with the current humanoid
       | robot hype, which imo is partially fueled by a misunderstanding
       | of LLMs capacity including what Karpathy mentioned. (shameless
       | plug: [4])
       | 
       | [1] https://openai.com/index/learning-from-human-preferences/
       | 
       | [2] https://www.lesswrong.com/posts/pi4owuC7Rdab7uWWR/book-
       | revie...
       | 
       | [3] https://www.alexirpan.com/2018/02/14/rl-hard.html
       | 
       | [4] https://harimus.github.io//2024/05/31/motortask.html
        
       | bubblyworld wrote:
       | The SPAG paper is an interesting example of true reinforcement
       | learning using language models that improves their performance on
       | a number of hard reasoning benchmarks.
       | https://arxiv.org/abs/2404.10642
       | 
       | The part that is missing from Karpathy's rant is "at scale" (the
       | researchers only ran 3 iterations of the algorithm on small
       | language models) and in "open domains" (I could be wrong about
       | this but IIRC they ran their games on a small number of common
       | english words). But adversarial language games seem promising, at
       | least.
        
         | textlapse wrote:
         | That's a cool paper - but it seems like it produces better
         | debaters but not better content? To truly use RL's strengths,
         | it would be a battle of content (model or world representation)
         | not mere token level battles.
         | 
         | I am not sure how that works at the prediction stage as
         | language isn't the problem here.
        
           | bubblyworld wrote:
           | I think the hypothesis is that "debating" via the right
           | adversarial word game may naturally select for better
           | reasoning skills. There's some evidence for that in the
           | paper, namely that it (monotonically!) improves the model's
           | performance on seemingly unrelated reasoning stuff like the
           | ARC dataset. Which is mysterious! But yeah, it's much too
           | early to tell, although IIRC the results have been replicated
           | already so that's something.
           | 
           | (by the way, I don't think "debating" is the right term for
           | the SPAG game - it's quite subtle and isn't about arguing for
           | a point, or rhetoric, or anything like that)
        
       | rossdavidh wrote:
       | The problem of various ML algorithms "gaming" the reward
       | function, is rather similar to the problem of various financial
       | and economic issues. If people are not trying to do something
       | useful, and then expecting $$ in return for that, but rather are
       | just trying to get $$ without knowing or caring what is
       | productive, then you get a lot of non-productive stuff (spam,
       | scams, pyramid schemes, high-frequency trading, etc.) that isn't
       | actually producing anything, but does take over a larger and
       | larger percentage of the economy.
       | 
       | To mitigate this, you have to have a system outside of that,
       | which penalizes "gaming" the reward function. This system has to
       | have some idea of what real value is, to be able to spot cases
       | where the reward function is high but the value is low. We have a
       | hard enough time of this in the money economy, where we've been
       | learning for centuries. I do not think we are anywhere close in
       | neural networks.
        
         | csours wrote:
         | Commenting to follow this.
         | 
         | There is a step like this in ML. I think it's pretty
         | interesting that topics from things like economics pop up in ML
         | - although perhaps it's not too surprising as we are doing ML
         | for humans to use.
        
           | layer8 wrote:
           | > Commenting to follow this.
           | 
           | You can "favorite" comments on HN to bookmark them.
        
         | bob1029 wrote:
         | > This system has to have some idea of what real value is
         | 
         | This is probably the most cursed problem ever.
         | 
         | Assume you could develop such a system, why wouldn't you just
         | incorporate its logic into the original fitness function and be
         | done with it?
         | 
         | I think the answer is that such a system can probably never be
         | developed. At some level humans must be involved in order to
         | adapt the function over time in order to meet expectations as
         | training progresses.
         | 
         | The information used to train on is beyond critical, but
         | heuristics regarding what information matters more than other
         | information in a given context might be even more important.
        
       | voiceblue wrote:
       | > Except this LLM would have a real shot of beating humans in
       | open-domain problem solving.
       | 
       | At some point we need to start recognizing LLMs for what they are
       | and stop making outlandish claims like this. A moment of
       | reflection ought to reveal that "open domain problem solving" is
       | not what an LLM does.
       | 
       | An LLM, could not, for example, definitively come up with the
       | three laws of planetary motion like Kepler did (he looked at the
       | data), in the absence of a prior formulation of these laws in the
       | training set.
       | 
       | TFA describes a need for scoring, at scale, qualitative results
       | to human queries. Certainly that's important (it's what Google is
       | built upon), but we don't need to make outlandish claims about
       | LLM capabilities to achieve it.
       | 
       | Or maybe we do if our next round of funding depends upon it.
        
         | textlapse wrote:
         | As a function of energy, it's provably impossible for a next
         | word predictor with a constant energy per token to come up with
         | anything that's not in its training. (I think Yann LeCun came
         | up with this?)
         | 
         | It seems to me RL was quite revolutionary (especially with
         | protein folding/AlphaGo) - but using a minimal form of it to
         | solve a training (not prediction) problem seems rather like
         | bringing a bazooka to a banana fight.
         | 
         | Using explore/exploit methods to search potential problem
         | spaces might really help propel this space forward. But the
         | energy requirements do not favor the incumbents as things are
         | now scaled to the current classic LLM format.
        
         | visarga wrote:
         | > An LLM, could not, for example, definitively come up with the
         | three laws of planetary motion like Kepler did (he looked at
         | the data)
         | 
         | You could use Symbolic Regression instead, and the LLM will
         | write the code. Under the hood it would use a genetic
         | programming library like with SymbolicRegressor.
         | 
         | Found a reference:
         | 
         | > AI-Descartes, an AI scientist developed by researchers at IBM
         | Research, Samsung AI, and the University of Maryland, Baltimore
         | County, has reproduced key parts of Nobel Prize-winning work,
         | including Langmuir's gas behavior equations and Kepler's third
         | law of planetary motion. Supported by the Defense Advanced
         | Research Projects Agency (DARPA), the AI system utilizes
         | symbolic regression to find equations fitting data, and its
         | most distinctive feature is its logical reasoning ability. This
         | enables AI-Descartes to determine which equations best fit with
         | background scientific theory. The system is particularly
         | effective with noisy, real-world data and small data sets. The
         | team is working on creating new datasets and training computers
         | to read scientific papers and construct background theories to
         | refine and expand the system's capabilities.
         | 
         | https://scitechdaily.com/ai-descartes-a-scientific-renaissan...
        
       | nothrowaways wrote:
       | Alpha go is not a good example in this case.
        
       | nickpsecurity wrote:
       | It sounds really negative about RLHF. Yet, if I read on them
       | correctly, that's a big part of how ChatGPT and Claude got so
       | effective. There's companies collecting quality, human responses
       | to many prompts. Companies making models buy them. Even the
       | synthetic examples come from models that largely extrapolate what
       | humans wrote in their pre-training data.
       | 
       | So, I'm defaulting on RLHF is great in at least those ways until
       | an alternative is empirically proven to be better. I also hope
       | for larger, better, open-source collections of RLHF training
       | data.
        
         | dontwearitout wrote:
         | Claude notably does _not_ use RLHF, but uses RLAIF, using a LLM
         | to generate the preferences based a  "constitution" instead of
         | human preferences. It's remarkable that it can bootstrap itself
         | up to such high quality. See https://arxiv.org/pdf/2212.08073
         | for more.
        
       | visarga wrote:
       | I agree RLHF is not full RL, more like contextual bandits,
       | because there is always just one single decision and no credit
       | assignment difficulties. But there is one great thing about RLHF
       | compared to supervised training: it updates the model on the
       | whole sequence instead of only the next token. This is
       | fundamentally different from pre-training, where the model learns
       | to be myopic and doesn't learn to address the "big picture".
       | 
       | So there are 3 levels of optimization in discussion here:
       | 
       | 1. for the next token (NTP)
       | 
       | 2. for a single turn response (RLHF)
       | 
       | 3. for actual task completion or long-term objectives (RL)
        
       | islewis wrote:
       | Karpathy is _much_ more knowledgeable about this than I am, but I
       | feel like this post is missing something.
       | 
       | Go is a game that is fundamentally too complex for humans to
       | solve. We've known this since way back before AlphaGo. Since
       | humans were not the perfect Go players, we didn't use them to
       | teach a model- we wanted the model to be able to beat humans.
       | 
       | I dont see language being comparable. the "perfect" LLM imitates
       | humans perfectly, presumably to the point where you can't tell
       | the difference between LLM generated text, and human generated
       | text. Maybe it's just as flexible as the human mind is too, and
       | can context switch quickly, and can quickly swap between
       | formalities, tones, and slangs. But the concept of "beating" a
       | human doesn't really make much sense.
       | 
       | AlphaGo and Stockfish can push forward our understandings of
       | their respective games, but an LLM cant push forwards our
       | boundary of language. this is because it's fundamentally a copy-
       | cat model. This makes RLHF make much more sense in the LLM realm
       | than the Go realm.
        
         | Miraste wrote:
         | One of the problems lies in the way RLHF is often performed:
         | presenting a human with several different responses and having
         | them choose one. The goal here is to create the most human-like
         | output, but the process is instead creating outputs humans like
         | the most, which can seriously limit the model. For example,
         | most recent diffusion-based image generators use the same
         | process to improve their outputs, relying on volunteers to
         | select which outputs are preferable. This has lead to models
         | that are comically incapable of generating ugly or average
         | people, because the volunteers systematically rate those
         | outputs lower.
        
         | will-burner wrote:
         | This is a great comment. Another important distinction, I
         | think, is that in the AlphaGo case there's no equivalent to the
         | generalized predict next token pretraining that happens for
         | LLMs (at least I don't think so, this is what I'm not sure of).
         | For LLMs, RLHF teaches the model to be conversational, but the
         | model has already learned language and how to talk like a human
         | from the predict next token pretraining.
        
       | taeric wrote:
       | I thought the entire point of the human/expert feedback was in
       | domains where you can not exhaustively search the depth of the
       | space? Yes, if you can go deeper in the search space, you should
       | do so. Regardless of how bad the score is at the current spot.
       | You only branch to other options when it is exhausted.
       | 
       | And if you don't have a way to say that something could be
       | exhausted, then you will look for heuristics to choose more
       | profitable places to search. Hence the HF added.
        
         | HarHarVeryFunny wrote:
         | Searching for what ?!
         | 
         | Human expectation/preference alignment is the explicit goal,
         | not a way to achieve something else. RLHF (or an alternative
         | such as ORPO) is used to take a raw pre-trained foundation
         | model, which by itself is only going to try to match training
         | set statistics, and finetune it to follow human expectations
         | for uses such as chat (incl. Q&A).
        
           | taeric wrote:
           | Learning is always exploring a search space. Literally
           | deciding which choice would be most likely to get to an
           | answer. If you have a set of answers, deciding what
           | alterations would make for a better answer.
           | 
           | Like, I don't know what you think is wrong on that? The
           | human/expert feedback is there to provide scores that we
           | don't know how to fully codify, yet. Is effectively
           | acknowledging that we don't know how to codify the "I know it
           | when I see it" rule. And based on those scores, the model
           | updates and new things can be scored.
           | 
           | What is not accurate in that description?
        
             | HarHarVeryFunny wrote:
             | Direct human feedback - from an actual human - is the gold
             | standard here, since it is an actual human who will be
             | evaluating how well they like your deployed model.
             | 
             | Note that using codified-HF (as is in fact already done -
             | the actual HF being first used to train a proxy reward
             | model) doesn't change things here - tuning the model to
             | maximize this metric of human usability _IS_ the goal. The
             | idea of using RL is to do the search at training time
             | rather than inference time when it 'd be massively more
             | expensive. You can think of all the multi-token model
             | outputs being evaluated by the reward model as branches of
             | the search tree, and the goal of RL is to influence earlier
             | token selection to lead towards these preferred outputs.
        
               | taeric wrote:
               | This is no different from any other ML situation, though?
               | Famously, people found out that Amazon's hands free
               | checkout thing was being offloaded to people in the cases
               | where the system couldn't give a high confidence answer.
               | I would be shocked to know that those judgements were not
               | then labeled and used in automated training later.
               | 
               | And I should say that I said "codified" but I don't mean
               | just code. Labeled training samples is fine here. Doesn't
               | change that finding a model that will give good answers
               | is ultimately something that can be conceptualized as a
               | search.
               | 
               | You are also blurring the reinforcement/scoring at
               | inference time as compared to the work that is done at
               | training time? The idea of using RL at training time is
               | not just because it is expensive there. The goal is to
               | find the policies that are best to use at inference time.
        
       ___________________________________________________________________
       (page generated 2024-08-08 23:02 UTC)