[HN Gopher] RLHF is just barely RL
___________________________________________________________________
RLHF is just barely RL
Author : tosh
Score : 352 points
Date : 2024-08-08 06:22 UTC (16 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| normie3000 wrote:
| > In machine learning, reinforcement learning from human feedback
| (RLHF) is a technique to align an intelligent agent to human
| preferences.
|
| https://en.m.wikipedia.org/wiki/Reinforcement_learning_from_...
| moffkalast wrote:
| Note that human preference isn't universal. RLHF is mostly
| frowned upon by the open source LLM community since it
| typically involves aligning the model to the preference of
| corporate manager humans, i.e. tuning for censorship and
| political correctness to make the model as bland as possible so
| the parent company doesn't get sued.
|
| For actual reinforcement learning with a feedback loop that
| aims to increase overall performance the current techniques are
| SPPO and Meta's version of it [0] that slightly outperforms it.
| It involves using a larger LLM as a judge though, so the
| accuracy of the results is somewhat dubious.
|
| [0] https://arxiv.org/pdf/2407.19594
| codeflo wrote:
| Reminder that AlphaGo and its successors have _not_ solved Go and
| that reinforcement learning still sucks when encountering out-of-
| distribution strategies:
|
| https://arstechnica.com/information-technology/2023/02/man-b...
| thanhdotr wrote:
| well, as yann lecun said :
|
| "Adversarial training, RLHF, and input-space contrastive
| methods have limited performance. Why? Because input spaces are
| _BIG_. There are just too many ways to be wrong " [1]
|
| A way to solve the problem is projecting onto latent space and
| then try and discriminate/predict the best action down there.
| There's much less feature correlation down in latent space than
| in your observation space. [2]
|
| [1]:https://x.com/ylecun/status/1803696298068971992 [2]:
| https://openreview.net/pdf?id=BZ5a1r-kVsf
| viraptor wrote:
| I wouldn't say it sucks. You just need to keep training it for
| as long as needed. You can do adversarial techniques to
| generate new paths. You can also use the winning human
| strategies to further improve. Hopefully we'll find better
| approaches, but this is extremely successful and far from
| sucking.
|
| Sure, Go is not solved yet. But RL is just fine continuing to
| that asymptote for as long as we want.
|
| The funny part is that this applies to people too. Masters
| don't like to play low ranked people because they're
| unpredictable and the ELO loss for them is not worth the risk.
| (Which does rise questions about how we really rank people)
| shakna wrote:
| > I wouldn't say it sucks. You just need to keep training it
| for as long as needed.
|
| As that timeline can approach infinity, just adding extra
| training may not actually be a sufficient compromise.
| rocqua wrote:
| Alphago didn't have human feedback, but it did learn from humans
| before surpassing them. Specifically, it had a network to
| 'suggest good moves' that was trained on predicting moves from
| pro level human games.
|
| The entire point of alpha zero was to eliminate this human
| influence, and go with pure reinforcement learning (i.e. zero
| human influence).
| cherryteastain wrote:
| A game like Go has a clearly defined objective (win the game or
| not). A network like you described can therefore be trained to
| give a score to each move. Point here is that assessing whether
| a given sentence sounds good to humans or not does not have a
| clearly defined objective, the only way we came up with so far
| is to ask real humans.
| esjeon wrote:
| AlphaGo is an optimization over a closed problem.
| Theoretically, computers could have always beat human in such
| problems. It's just that, without proper optimization, humans
| will die before the computer finishes its computation. Here,
| AlphaGo cuts down the computation time by smartly choosing the
| branches with the highest likelihood.
|
| Unlike the above, open problems can't be solve by computing (in
| combinatorial senses). Even humans can only _try_ , and LLMs do
| spew out something that would most likely work, not something
| inherently correct.
| __loam wrote:
| Been shouting this for over a year now. We're training AI to be
| convincing, not to be actually helpful. We're sampling the wrong
| distributions.
| danielbln wrote:
| I find them very helpful, personally.
| sussmannbaka wrote:
| Understandable, they have been trained to convince you of
| their helpfulness.
| danielbln wrote:
| If they convinced me of their helpfulness, and their output
| is actually helpful in solving my problems.. well, if it
| walks like a duck and quacks like a duck, and all that.
| tpoacher wrote:
| if it walks like a duck and it quacks like a duck, then
| it lacks strong typing.
| Nullabillity wrote:
| "Appears helpful" and "is helpful" are two very different
| properties, as it turns out.
| snapcaster wrote:
| Sometimes, but that's an edge case that doesn't seem to
| impact the productivity boosts from LLMs
| __loam wrote:
| It doesn't until it does. Productivity isn't the only or
| even the most important metric, at least in software dev.
| snapcaster wrote:
| Can you be more specific with like examples or something?
| exe34 wrote:
| https://xkcd.com/810/
| djeastm wrote:
| This is true, but part of that convincing is actually
| providing at least some amount of response that is helpful
| and moving you forward.
|
| I have to use coding as an example, because that's 95% of
| my use cases. I type in a general statement of the problem
| I'm having and within seconds, I get back a response that
| speaks my language and provides me with some information to
| ingest.
|
| Now, I don't know for sure if everything sentence I read in
| the response is correct, but let's say that 75% of what I
| read aligns with what I currently know to be true. If I
| were to ask a real expert, I'd possibly understand or
| already know 75% of what they're telling me, as well, with
| the other 25% still to be understood and thus trusting the
| expert.
|
| But either with AI or a real expert, for coding at least,
| that 25% will be easily testable. I go and implement and
| see if it passes my test. If it does, great. If not, at
| least I have tried something and gotten farther down the
| road in my problem solving.
|
| Since AI generally does that for me, I am convinced of
| their helpfulness because it moves me along.
| dgb23 wrote:
| Depends on who you ask.
|
| Advertisement and propaganda is not necessarily helpful for
| consumers, but just needs to be convincing in order to be
| helpful for producers.
| khafra wrote:
| It would be interesting to see RL on a chatbot that's the
| last stage of a sales funnel for some high-volume item--it'd
| have fast, real-world feedback on how convincing it is, in
| the form of a purchase decision.
| iamatoool wrote:
| Sideways eye look at leetcode culture
| tpoacher wrote:
| s / AI /
| Marketing|Ads|Consultants|Experts|Media|Politicians|...
| HarHarVeryFunny wrote:
| If what you want is auto-complete (e.g. CoPilot, or natural
| language search) then LLMs are built for that, and useful.
|
| If what you want it AGI then design an architecture with the
| necessary moving parts! Current approach reminds of the joke of
| the drunk looking for his dropped cars keys under the street
| lamp because "it's bright here", rather than near where he
| actually dropped them. It seems folk have spent years trying to
| come up with alternate learning mechanisms to gradient descent
| (or RL), and having failed are now trying to use SGD/pre-
| training for AGI "because it's what we've got", as opposed to
| doing the hard work of designing the type of always-on online
| learning algorithm that AGI actually requires.
| iamatoool wrote:
| The SGD/pre training/deep learning/transformer local maxima
| is profitable. Trying new things is not, so you are relying
| on researchers making a breakthrough, but then to make a blip
| you need a few billion to move the promising model into
| production.
|
| The tide of money flow means we are probably locked into
| transformers for some time. There will be transformer ASICs
| built for example in droves. It will be hard to compete with
| the status quo. Transformer architecture == x86 of AI.
| timthelion wrote:
| Karpathy writes that there is no cheeply computed objective check
| for "Or re-writing some Java code to Python? " Among other
| things. But it seems to me that Reinforced Learning should be
| possible for code translation using automated integration
| testing. Run it, see if it does,the same thing!
| jeroenvlek wrote:
| My takeaway is that it's difficult to make a "generic enough"
| evaluation that encompasses all things we use an LLM for, e.g.
| code, summaries, jokes. Something with free lunches.
| IanCal wrote:
| Even there how you score it is hard.
|
| "Is it the same for this s y of inputs?" May be fine for a
| subset of things, but then that's a binary thing. If it's
| slightly wrong do you score by number of outputs that match? A
| purely binary thing gives little useful help for nudging a
| model in the right direction. How do you compare two that both
| work, which is more "idiomatic"?
| theqwxas wrote:
| I agree that it's a very difficult problem. I'd like to
| mention AlphaDev [0], an RL algorithm that builds other
| algorithms, there they combined the measure of correctness
| and a measure of algorithm speed (latency) to get the reward.
| But the algorithms they built were super small (e.g., sorting
| just three numbers), therefore they could measure correctness
| using all input combinations. It is still unclear how to
| scale this to larger problems.
|
| [0] https://deepmind.google/discover/blog/alphadev-discovers-
| fas...
| exe34 wrote:
| for "does it run" cases, you can ask the model to try again,
| give it higher temperature, show it the traceback errors,
| (and maybe intermediate variables?) or even ask it to break
| up the problem into smaller pieces and then try to translate
| that.
|
| for testing, if you use something like quickcheck, you might
| find bugs that you wouldn't otherwise find.
|
| when it comes to idiomatic, I'm not sure - but if we're at
| the point that gpt is writing code that works, do we really
| care? as long as this code is split into many small pieces,
| we can just replace the piece instead of trying to
| understand/fix it if we can't read it. in fact, maybe there's
| a better language that is human readable but better for
| transformers to write and maintain.
| IanCal wrote:
| For "does it run" I'm not talking about how do we test that
| it does, but how do we either score or compare two+
| options?
|
| > when it comes to idiomatic, I'm not sure - but if we're
| at the point that gpt is writing code that works, do we
| really care?
|
| Yes - it's certainly preferable. You may prefer working
| over neat, but working and neat over working but insane
| spaghetti code.
|
| Remember this is about training the models, not about using
| them later. How do we tell, while training, which option
| was better to push it towards good results?
| fulafel wrote:
| "Programs must be written for people to read, and only
| incidentally for machines to execute." -- Harold Abelson
| exe34 wrote:
| "programs written by LLMs must run correctly and only
| incidentally be human readable." - Me.
| alex_suzuki wrote:
| "WTF?!" - engineer who has to troubleshoot said programs.
| exe34 wrote:
| "given the updated input and output pairs below, generate
| code that would solve the problem."
| msoad wrote:
| A program like function add(a,b) {
| return 4 }
|
| Passes the test
| falcor84 wrote:
| I suppose you're alluding to xkcd's joke about this [0],
| which is indeed a good one, but what test does this actually
| pass?
|
| The approach I was thinking of is that assuming we start with
| the Java program: public class Addition {
| public static int add(int a, int b) { return
| a + b; } }
|
| We can semi-automatically generate a basic test runner with
| something like this, generating some example inputs
| automatically: public class Addition {
| public static int add(int a, int b) { return
| a + b; } public static class
| AdditionAssert { private int a;
| private int b; public AdditionAssert
| a(int a) { this.a = a;
| return this; } public
| AdditionAssert b(int b) { this.b = b;
| return this; } public
| void assertExpected(int expected) { int
| result = add(a, b); assert result ==
| expected : "Expected " + expected + " but got " + result;
| System.out.println("Assertion passed for " + a + " + " + b +
| " = " + result); } }
| public static void main(String[] args) { new
| AdditionAssert().a(5).b(3).assertExpected(8);
| new AdditionAssert().a(-1).b(4).assertExpected(3);
| new AdditionAssert().a(0).b(0).assertExpected(0);
| System.out.println("All test cases passed."); }
| }
|
| Another bit of automated preparation would then automatically
| translate the test cases to Python, and then the actual LLM
| would need to generate a python function until it passes all
| the translated test cases: def add(a, b):
| return 4 def addition_assert(a, b, expected):
| result = add(a, b) assert result == expected,
| f"Expected {expected} but got {result}"
| addition_assert(a=5, b=3, expected=8)
| addition_assert(a=-1, b=4, expected=3)
| addition_assert(a=0, b=0, expected=0)
|
| It might not be perfect, but I think it's very feasible and
| can get us close to there.
|
| [0] https://xkcd.com/221/
| WithinReason wrote:
| yes but that's not cheaply computed. You need good test
| coverage, etc.
| leobg wrote:
| A cheap DIY way of achieving the same thing as RLHF is to fine
| tune the model to append a score to its output every time.
|
| Remember: The reason we need RLHF at all is that we cannot write
| a loss function for what makes a good answer. There are just many
| ways a good answer could look like, which cannot be calculated on
| the basis of next-token-probability.
|
| So you start by having your vanilla model generate n completions
| for your prompt. You the. manually score them. And then those
| prompt => (completion,score) pairs become your training set.
|
| Once the model is trained, you may find that you can cheat:
|
| Because if you include the desired score in your prompt, the
| model will now strive to produce an answer that is consistent
| with that score.
| viraptor wrote:
| That works in the same way as actor-critic pair, right? Just
| all wrapped in the same network/output?
| bick_nyers wrote:
| I had an idea similar to this for a model that allows you to
| parameterize a performance vs. accuracy ratio, essentially an
| imbalanced MoE-like approach where instead of the "quality
| score" in your example, you assign a score based on how much
| computation it used to achieve that answer, then you can
| dynamically request different code paths be taken at inference
| time.
| lossolo wrote:
| Not the same, it will get you worse output and is harder to do
| right in practice.
| visarga wrote:
| > if you include the desired score in your prompt, the model
| will now strive to produce an answer that is consistent with
| that score
|
| But you need a model to generate score from answer, and then
| fine-tune another model to generate answer conditioned on
| score. The first time the score is at the end and the second
| time at the beginning. It's how DecisionTransformer works too,
| it constructs a sequence of (reward, state, action) where
| reward conditions on the next action.
|
| https://arxiv.org/pdf/2106.01345
|
| By the same logic you could generate tags, including style,
| author, venue and date. Some will be extracted from the source
| document, the others produced with classifiers. Then you can
| flip the order and finetune a model that takes the tags before
| the answer. Then you got a LLM you can condition on author and
| style.
| jrflowers wrote:
| I enjoyed this Karpathy post about how there is absolutely no
| extant solution to training language models to reliably solve
| open ended problems.
|
| I preferred Zitron's point* that we would need to invent several
| branches of science to solve this problem, but it's good to see
| the point made tweet-sized.
|
| *https://www.wheresyoured.at/to-serve-altman/
| moffkalast wrote:
| That's a great writeup, and great references too.
|
| > OpenAI needs at least $5 billion in new capital a year to
| survive. This would require it to raise more money than has
| ever been raised by any startup in history
|
| They were probably toast before, but after Zuck decided to take
| it personally and made free alternatives for most use cases
| they definitely are, since if they had any notable revenue from
| selling API access it will just keep dropping.
| klibertp wrote:
| I read the article you linked. I feel like I wasted my time.
|
| The article has a single point it repeats over and over again:
| OpenAI (and "generative AI as a whole"/"transformer-based
| models") are too expensive to run, and it's "close to
| impossible" for them to either limit costs or increase revenue.
| This is because "only 5% of businesses report using the
| technology in production", and that the technology had no
| impact on "productivity growth". It's also because "there's no
| intelligence in it", and the "models can't reason". Oh, also,
| ChatGPT is "hard to explain to a layman".
|
| All that is liberally sprinkled with "I don't know, but"s and
| absolutely devoid of any historical context other than in
| financial terms. No technical details. Just some guesses and an
| ironclad belief that it's impossible to improve GPTs without
| accessing more data than there is in existence. Agree or
| disagree; the article is not worth wading through so many
| words: others made arguments on both sides much better and,
| crucially, shorter.
| jrflowers wrote:
| > The article has a single point it repeats over and over
| again: [7 distinct points]
|
| I don't think have a single overall thesis is the same thing
| as repeating oneself. For example "models can't reason" has
| nothing at all to do with cost.
| klibertp wrote:
| 7 distinct points in the number of words that would suffice
| for 70 points...
|
| Anyway, it's just my opinion: to me, the length of the
| article was artificially increased to the point where it
| wasn't worth my time to read it. As such, unfortunately,
| I'm not inclined to spend any more time discussing it - I
| just posted my takeaways and a warning for people like me.
| If you liked the article, good for you.
|
| > "models can't reason" has nothing at all to do with cost.
|
| Yeah, that one falls under "no technical details".
| gizmo wrote:
| This is why AI coding assistance will leap ahead in the coming
| years. Chat AI has no clear reward function (basically impossible
| to judge the quality of responses to open-ended questions like
| historical causes for a war). Coding AI can write tests, write
| code, compile, examine failed test cases, search for different
| coding solutions that satisfy more test cases or rewrite the
| tests, all in an unsupervised loop. And then whole process can
| turn into training data for future AI coding models.
|
| I expect language models to also get crazy good at mathematical
| theorem proving. The search space is huge but theorem
| verification software will provide 100% accurate feedback that
| makes real reinforcement learning possible. It's the combination
| of vibes (how to approach the proof) and formal verification that
| works.
|
| Formal verification of program correctness never got traction
| because it's so tedious and most of the time approximately
| correct is good enough. But with LLMs in the mix the equation
| changes. Having LLMs generate annotations that an engine can use
| to prove correctness might be the missing puzzle piece.
| discreteevent wrote:
| Does programming have a clear reward function? A vague
| description from a business person is not it. By the time
| someone (a programmer?) has written a reward function that is
| clear enough, how would that function look compared to a
| program?
| rossamurphy wrote:
| +1
| eru wrote:
| > Does programming have a clear reward function? A vague
| description from a business person isn't it. By the time
| someone (a programmer?) has written a reward function that is
| clear enough, how would that function look compared to a
| program?
|
| Well, to give an example: the complexity class NP is all
| about problems that have quick and simple verification, but
| finding solutions for many problems is still famously hard.
|
| So there are at least some domains where this model would be
| a step forward.
| thaumasiotes wrote:
| But in that case, finding the solution is hard and you
| generally don't try. Instead, you try to get fairly close,
| and it's more difficult to verify that you've done so.
| eru wrote:
| No. Most instances of most NP hard problems are easy to
| find solutions for. (It's actually really hard to eg
| construct a hard instance for the knapsack problem. And
| SAT solvers also tend to be really fast in practice.)
|
| And in any case, there are plenty of problems in NP that
| are not NP hard, too.
|
| Yes, approximation is also an important aspect of many
| practical problems.
|
| There's also lots of problems where you can easily
| specify one direction of processing, but it's hard to
| figure out how to undo that transformation. So you can
| get plenty of training data.
| imtringued wrote:
| I have a very simple integer linear program and it is
| really waiting for the heat death of the universe.
|
| No, running it as a linear program is still slow.
|
| I'm talking about small n=50 taking tens of minutes for a
| trivial linear program. Obviously the actual linear
| program is much bigger and scales quadratically in size,
| but still. N=50 is nothing.
| tablatom wrote:
| Very good point. For some types of problems maybe the answer
| is yes. For example porting. The reward function is testing
| it behaves the same in the new language as the old one.
| Tricky for apps with a gui but doesn't seem impossible.
|
| The interesting kind of programming is the kind where I'm
| figuring out what I'm building as part of the process.
|
| Maybe AI will soon be superhuman in all the situations where
| we know _exactly_ what we want (win the game), but not in the
| areas we don 't. I find that kind of cool.
| martinflack wrote:
| Even for porting there's a bit of ambiguity... Do you port
| line-for-line or do you adopt idioms of the target
| language? Do you port bug-for-bug as well as feature-for-
| feature? Do you leave yet-unused abstractions and
| opportunities for expansion that the original had coded in,
| if they're not yet used, and the target language code is
| much simpler without?
|
| I've found when porting that the answers to these are
| sometimes not universal for a codebase, but rather you are
| best served considering case-by-case inside the code.
|
| Although I suppose an AI agent could be created that holds
| a conversation with you and presents the options and acts
| accordingly.
| littlestymaar wrote:
| "A precise enough specification is already code", which means
| we'll not run out of developers in the short term. But the
| day to day job is going to be very different, maybe as
| different as what we're doing now compared to writing machine
| code on punchcards.
| mattmanser wrote:
| Doubtful. This is the same mess we've been in repeatedly
| with 'low code'/'no code' solutions.
|
| Every decade it's 'we don't need programmers anymore'. Then
| it turns out specifying the problem needs programmers. Then
| it turns out the auto-coder can only reach a certain level
| of complexity. Then you've got real programmers modifying
| over-complicared code. Then everyone realizes they've
| wasted millions and it would have been quicker and cheaper
| to get the programmers to write the code in the first
| place.
|
| The same will almost certainly happen with AI generated
| code for the next decade or two, just at a slightly higher
| level of program complexity.
| cs702 wrote:
| Programming has a clear reward function when the problem
| being solving is well-specified, e.g., "we need a program
| that grabs data from these three endpoints, combines their
| data in this manner, and returns it in this JSON format."
|
| The same is true for math. There is a clear reward function
| when the goal is well-specified, e.g., "we need a sequence of
| mathematical statements that prove this other important
| mathematical statement is true."
| seanthemon wrote:
| >when the problem being solving is well-specified
|
| Phew! Sounds like i'll be fine, thank god for product
| owners.
| steveBK123 wrote:
| 20 years, number of "well specified" requirements
| documents I've received: 0.
| danpalmer wrote:
| I'm not sure I would agree. By the time you've written a
| full spec for it, you may as well have just written a high
| level programming language anyway. You can make assumptions
| that minimise the spec needed... but also programming APIs
| can have defaults so that's no advantage.
|
| I'd suggest that the Python code for your example prompt
| with reasonable defaults is not actually that far from the
| prompt itself in terms of the time necessary to write it.
|
| However, add tricky details like how you want to handle
| connection pooling, differing retry strategies, short
| circuiting based on one of the results, business logic in
| the data combination step, and suddenly you've got a whole
| design doc in your prompt and you need a senior engineer
| with good written comms skills to get it to work.
| cs702 wrote:
| Thanks. I view your comment as orthogonal to mine,
| because I didn't say anything about how easy or hard it
| would be for human beings to specify the problems that
| must be solved. Some problems may be easy to specify,
| others may be hard.
|
| I feel we're looking at the need for a measure of the
| computational complexity of _problem specifications_ --
| something like Kolmogorov complexity, i.e., minimum
| number of bits required, but for specifying instead of
| solving problems.
| danpalmer wrote:
| Apologies, I guess I agree with your sentiment but
| disagree with the example you gave as I don't think it's
| well specified, and my more general point is that there
| isn't an effective specification, which means that in
| practice there isn't a clear reward function. If we can
| get the clear specification, which we probably can do
| proportionally to the complexity of the problem, and not
| getting very far up the complexity curve, then I would
| agree we can get the good reward function.
| cs702 wrote:
| > the example you gave
|
| Ah, got it. I was just trying to keep my comment short!
| chasd00 wrote:
| > I'm not sure I would agree. By the time you've written
| a full spec for it, you may as well have just written a
| high level programming language anyway.
|
| Remember all those attempts to transform UML into code
| back in the day? This sounds sorta like that. I'm not a
| total genai naysayer but definitely in the "cautiously
| curious" camp.
| danpalmer wrote:
| Absolutely, we've tried lots of ways to formalise
| software specification and remove or minimise the amount
| of coding, and almost none of it has stuck other than
| creating high level languages and better code-level
| abstractions.
|
| I think generative AI is already a "really good
| autocomplete" and will get better in that respect, I can
| even see it generating good starting points, but I don't
| think in its current form it will replace the act of
| programming.
| bee_rider wrote:
| Yeah, an LLM applied to converting design docs to
| programs seems like, essentially, the invention of an
| extremely high level programming language. Specifying the
| behavior of the program in sufficient detail is... why we
| have programming languages.
|
| There's the task of writing syntax, which is the
| mechanical overhead of the task of telling the computer
| what to do. People should focus on the latter (too much
| code is a symptom of insufficient automation or
| abstraction). Thankfully lots of people have CS degrees,
| not "syntax studies" degrees, right?
| ekianjo wrote:
| > Programming has a clear reward function when the problem
| being solving is well-specified
|
| the reason why we spend time programming is because the
| problems in question are not easily defined, let alone the
| solutions.
| FooBarBizBazz wrote:
| This could make programming more declarative or constraint-
| based, but you'd still have to specify the properties you
| want. Ultimately, if you are defining some function in the
| mathematical sense, you need to say somehow what inputs go
| to what outputs. You need to _communicate_ that to the
| computer, and a certain number of bits will be needed to do
| that. Of course, if you have a good statistical model of
| how-probably a human wants a given function f, then you can
| perform that communication to the machine in 1 /log(P(f))
| bits, so the model isn't worthless.
|
| Here I have assumed something about the set that f lives
| in. I am taking for granted that a probability measure can
| be defined. In theory, perhaps there are difficulties
| involving the various weird infinities that show up in
| computing, related to undecideability and incompleteness
| and such. But at a practical level, if we assume some
| concrete representation of the program then we can just
| define that it is smaller than some given bound, and ditto
| for a number of computational steps with a particular model
| of machine (even if fairly abstract, like some lambda
| calculus thing), so realistically we might be able to not
| worry about it.
|
| Also, since our input and output sets are bounded (say, so
| many 64-bit doubles in, so many out), that also gives you a
| finite set of functions in principle; just think of the
| size of the (impossibly large) lookup table you'd need to
| represent it.
| agos wrote:
| can you give an example of what "in this manner" might be?
| dartos wrote:
| > programming has a clear reward function.
|
| If you're the most junior level, sure.
|
| Anything above that, things get fuzzy, requirements change,
| biz goals shift.
|
| I don't really see this current wave of AI giving us
| anything much better than incremental improvement over
| copilot.
|
| A small example of what I mean:
|
| These systems are statistically based, so there's no
| probability. Because of that, I wouldn't even gain anything
| from having it write my tests since tests are easily built
| wrong in subtle ways.
|
| I'd need to verify the test by reviewing it and, imo,
| writing the test would be less time than coaxing a correct
| one, reviewing, re-coaxing, repeat.
| nyrikki wrote:
| A couple of problems that is impossible to prove from the
| constructivism angle:
|
| 1) Addition of the natural numbers 2) equality of two real
| numbers
|
| When you restrict your tools to perceptron based feed
| forward networks with high parallelism and no real access
| to 'common knowledge', the solution set is very restricted.
|
| Basically what Godel proved that destroyed Russel's plans
| for the Mathmatica Principia applies here.
|
| Programmers can decide what is sufficient if not perfect in
| models.
| gizmo wrote:
| Much business logic is really just a state machine where all
| the states and all the transitions need to be handled. When a
| state or transition is under-specified an LLM can pass the
| ball back and just ask what should happen when A and B but
| not C. Or follow more vague guidance on what should happen in
| edge cases. A typical business person is perfectly capable of
| describing how invoicing should work and when refunds should
| be issued, but very few business people can write a few
| thousand lines of code that covers all the cases.
| discreteevent wrote:
| > an LLM can pass the ball back and just ask what should
| happen when A and B but not C
|
| What should the colleagues of the business person review
| before deciding that the system is fit for purpose? Or what
| should they review when the system fails? Should they go
| back over the transcript of the conversation with the LLM?
| ben_w wrote:
| As an LLM can output source code, that's all answerable
| with "exactly what they already do when talking to
| developers".
| discreteevent wrote:
| There are two reasons the system might fail:
|
| 1) The business person made a mistake in their
| conversation/specification.
|
| In this case the LLM will have generated code and tests
| that match the mistake. So all the tests will pass. The
| best way to catch this _before it gets to production_ is
| to have someone else review the specification. But the
| problem is that the specification is a long trial-and-
| error conversation in which later parts may contradict
| earlier parts. Good luck reviewing that.
|
| 2) The LLM made a mistake.
|
| The LLM may have made the mistake because of a
| hallucination which it cannot correct because in trying
| to correct it the same hallucination invalidates the
| correction. At this point someone has to debug the
| system. But we got rid of all the programmers.
| ben_w wrote:
| This still resolves as "business person asks for code,
| business person gets code, business person says if code
| is good or not, business person deploys code".
|
| That an LLM or a human is where the code comes from,
| doesn't make much difference.
|
| Though it does _kinda_ sound like you 're assuming all
| LLMs must develop with Waterfall? That they can't e.g.
| use Agile? (Or am I reading too much into that?)
| discreteevent wrote:
| > business person says if code is good or not
|
| How do they do this? They can't trust the tests because
| the tests were also developed by the LLM which is working
| from incorrect information it received in a chat with the
| business person.
| ben_w wrote:
| The same way they already do with humans coders whose
| unit tests were developed by exactly same flawed
| processes:
|
| _Mediocrely_.
|
| Sometimes the current process works, other times the
| planes fall out of the sky, or updates causes millions of
| computers to blue screen on startup at the same time.
|
| LLMs in particular, and AI in general, doesn't need to
| _beat_ humans at the same tasks.
| gizmo wrote:
| How does a business person today decide if a system is
| fit for purpose when they can't read code? How is this
| different?
| Jensson wrote:
| They don't, the software engineer does that. It is
| different since LLMs can't test the system like a human
| can.
|
| Once the system can both test and update the spec etc to
| fix errors in the spec and build the program and ensure
| the result is satisfactory, we have AGI. If you argue an
| AGI could do it, then yeah it could as it can replace
| humans at everything, the argument was for an AI that
| isn't yet AGI.
| gizmo wrote:
| The world runs on fuzzy underspecified processes. On
| excel sheets and post-it notes. Much of the world's
| software needs are not sophisticated and don't require
| extensive testing. It's OK if a human employee is in the
| loop and has to intervenes sometimes when an AI-built
| system malfunctions. Businesses of all sizes have
| procedures where problems get escalated to more senior
| people with more decision-making power. The world is
| already resilient against mistakes made by
| tired/inattentive/unintelligent people, and mistakes made
| by dumb AI systems will blend right in.
| discreteevent wrote:
| > The world runs on fuzzy underspecified processes. On
| excel sheets and post-it notes.
|
| Excel sheets are not fuzzy and underspecified.
|
| > It's OK if a human employee is in the loop and has to
| intervenes sometimes
|
| I've never worked on software where this was OK. In many
| cases it would have been disastrous. Most of the time a
| human employee could not fix the problem without
| understanding the software.
| gizmo wrote:
| All software that interops with people, other businesses,
| APIs, deals with the physical world in any way, or
| handles money has cases that require human intervention.
| It's 99.9% of software if not more. Security updates.
| Hardware failures. Unusual sensor inputs. A sudden influx
| of malformed data. There is no such thing as an entirely
| autonomous system.
|
| But we're not anywhere close to maximally automated.
| Today (many? most?) office workers do manual data entry
| and processing work that requires very little thinking.
| Even automating just 30% of their daily work is a huge
| win.
| jimbokun wrote:
| The reward function could be "pass all of these tests I just
| wrote".
| marcosdumay wrote:
| Lol. Literally.
|
| If you have those many well written tests, you can pass
| them to a constraint solver today and get your program. No
| LLM needed.
|
| Or even run your tests instead of the program.
| emporas wrote:
| Probably the parent assumes that he does have the tests,
| billions of them.
|
| One very strong LLM could generate billions of tests
| alongside the working code and then train another smaller
| model, or feed it into the next iteration of training
| same the strong model. Strong LLMs do exist for that
| purpose, Nemotron 320B and Llama 3 450B.
|
| It would be interesting if a dataset like that would be
| created like that, and then released as open source. Many
| LLMs proprietary or not, could incorporate the dataset in
| their training, and have on the internet hundreds of LLMs
| suddenly become much better at coding, all of them at
| once.
| acchow wrote:
| After much RL, the model will just learn to mock everything
| to get the test to pass.
| ryukoposting wrote:
| If we will struggle to create reward functions for AI, then
| how different is that from the struggles we _already face_
| when divvying up product goals into small tasks to fit our
| development cycles?
|
| In other words, to what extent does Agile's ubiquity prove
| our competence in turning product goals into _de facto_
| reward functions?
| paxys wrote:
| Exactly, and people have been saying this for a while now. If
| an "AI software engineer" needs a perfect spec with zero
| ambiguity, all edge cases defined, full test coverage with
| desired outcomes etc., then the person writing the spec is
| the actual software engineer, and the AI is just a compiler.
| satvikpendem wrote:
| Reminds me of when computers were literally humans
| computing things (often women). How time weaves its
| circular web.
| dartos wrote:
| We've also learned that starting off by rigidly defined
| spec is actually harmful to most user facing software,
| since customers change their minds so often and have a hard
| time knowing what they want right from the start.
| diffxx wrote:
| This is why most of the best software is written by
| people writing things for themselves and most of the
| worst is made by people making software they don't use
| themselves.
| _the_inflator wrote:
| Exactly. This is what I tell everyone. The harder you work
| on specs the easier it gets in the aftermath. And this is
| exactly what business with lofty goals doesn't get or
| ignores. Put another way: a fool with a tool...
|
| Also look out for optimization the clever way.
| sgu999 wrote:
| > then the person writing the spec is the actual software
| engineer
|
| Sounds like this work would involve asking questions to
| collaborators, guess some missing answers, write specs and
| repeat. Not that far ahead of the current sota of AI...
| nyrikki wrote:
| Same reason the visual programming paradigm failed, tbe
| main problem is not the code.
|
| While writing simple functions may be mechanistic, being
| a developer is not.
|
| 'guess some missing answers' is why Waterfall, or any big
| upfront design has failed.
|
| People aren't simply loading pig iron into rail cars like
| Taylor assumed.
|
| The assumption of perfect central design with perfect
| knowledge and perfect execution simply doesn't work for
| systems which are for more like an organism than a
| machine.
| gizmo wrote:
| Waterfall fails when domain knowledge is missing.
| Engineers won't take "obvious" problems into
| consideration when they don't even know what the right
| questions to ask are. When a system gets rebuild for the
| 3rd time the engineers do know what to build and those
| basic mistakes don't get made.
|
| Next gen LLMs, with their encyclopedic knowledge about
| the world, won't have that problem. They'll get the
| design correct on their first attempt because they're
| already familiar with the common pitfalls.
|
| Of course we shouldn't expect LLMs to be a magic bullet
| that can program anything. But if your frame of reference
| is "visual programming" where the goal is to turn poorly
| thought out requirements into a reasonably sensible state
| machine then we should expect LLMs to get very good at
| that compared to regular people.
| nyrikki wrote:
| LLMs are NLP, what you are talking about is NLU, which
| has been considered an AI-hard problem for a long time.
|
| I keep looking for discoveries that show any movement
| there. But LLMs are still basically pattern matching and
| finding.
|
| They can do impressive things, but they actually have no
| concept of what the 'right thing' even is, it is
| statistic not philosophy.
| qup wrote:
| What makes you think they'll need a perfect spec?
|
| Why do you think they would need a more defined spec than a
| human?
| digging wrote:
| A human has the ability to contact the PM and say, "This
| won't work, for $reason," or, "This is going to look
| really bad in $edgeCase, here are a couple options I've
| thought of."
|
| There's nothing about AI that makes such operations
| intrinsically impossible, but they require much more than
| just the ability to generate working code.
| mattnewton wrote:
| I mean, that's already the case in many places, the senior
| engineer / team lead gathering requirements and making
| architecture decisions is removing enough ambiguity to hand
| it off to juniors churning out the code. This just makes
| very cheap, very fast typing but uncreative and a little
| dull junior developers.
| mlavrent wrote:
| This is not quite right - a specification is not equivalent
| to writing software, and the code generator is not just a
| compiler - in fact, generating implementations from
| specifications is a pretty active area of research (a
| simpler problem is the problem of generating a
| configuration that satisfies some specification,
| "configuration synthesis").
|
| In general, implementations can be vastly more complicated
| than even a complicated spec (e.g. by having to deal with
| real-world network failures, etc.), whereas a spec needs
| only to describe the expected behavior.
|
| In this context, this is actually super useful, since
| defining the problem (writing a spec) is usually easier
| than solving the problem (writing an implementation); it's
| not just translating (compiling), and the engineer is now
| thinking at a higher level of abstraction (what do I want
| it to do vs. how do I do it).
| airstrike wrote:
| My reward in Rust is often when the code actually compiles...
| _the_inflator wrote:
| Full circle but instead of determinism you introduce some
| randomness. Not good.
|
| Also the reasoning is something business is dissonant about.
| The majority of planning and execution teams stick to
| processes. I see way more potential automating these than all
| parts in app production.
|
| Business is going to have a hard time, when they believe,
| they alone can orchestrate some AI consoles.
| tomrod wrote:
| You can define one based on passed tests, code coverage,
| other objectives, or weighted combinations without too much
| loss of generality.
| LeifCarrotson wrote:
| I think you could set up a good reward function for a
| programming assistance AI by checking that the resulting code
| is actually used. Flag or just 'git blame' the code produced
| by the AI with the prompts used to produce it, and when you
| push a release, it can check which outputs were retained in
| production code from which prompts. Hard to say whether code
| that needed edits was because the prompt was bad or because
| the code was bad, but at least you can get positive feedback
| when a good prompt resulted in good code.
| rfw300 wrote:
| GitHub Copilot's telemetry does collect data on whether
| generated code snippets end up staying in the code, so
| presumably models are tuned on this feedback. But you
| haven't solved any of the problems set out by Karpathy here
| --this is just bankshot RLHF.
| bee_rider wrote:
| That could be interesting but it does seem like a much
| fuzzier and slower feedback loop than the original idea.
|
| It also seems less unique to code. You could also have a
| chat bot write an encyclopedia and see if the encyclopedias
| sold well. Chat bots could edit Wikipedia and see if their
| edits stuck as a reward function (seems ethically pretty
| questionable or at least in need of ethical analysis, but
| it is possible).
|
| The maybe-easy to evaluate reward function is an
| interesting aspect of code (which isn't to say it is the
| only interesting aspect, for sure!)
| axus wrote:
| If they get permission and don't mind waiting, they could
| check if people throw away the generated code or keep it as-
| is.
| consteval wrote:
| There's levels to this.
|
| Certainly "compiled" is one reward (although a blank file
| fits that...) Another is test cases, input and output. This
| doesn't work on a software-wide scale but function-wide it
| can work.
|
| In the future I think we'll see more of this test-driven
| development. Where developers formally define the
| requirements and expectations of a system and then an LLM
| (combined with other tools) generates the implementation. So
| instead of making the implementation, you just declaratively
| say what the implementation should do (and shouldn't).
| waldrews wrote:
| There's no reward function in the sense that optimizing the
| reward function means the solution is ideal.
|
| There are objective criteria like 'compiles correctly' and
| 'passes self-designed tests' and 'is interpreted as correct
| by another LLM instance' which go a lot further than criteria
| that could be defined for most kinds of verbal questions.
| davedx wrote:
| I'm pretty interested in the theorem proving/scientific
| research aspect of this.
|
| Do you think it's possible that some version of LLM technology
| could discover new physical theories (that are experimentally
| verifiable), like for example a new theory of quantum gravity,
| by exploring the mathematical space?
|
| Edit: this is just incredibly exciting to think about. I'm not
| an "accelerationist" but the "singularity" has never felt
| closer...
| esjeon wrote:
| IIRC, there have been people doing similar things using
| something close to brute-force. Nothing of real significance
| has been found. A problem is that there are infinitely many
| physically and mathematically correct theorems that would add
| no practical value.
| gizmo wrote:
| My hunch is that LLMs are nowhere near intelligent enough to
| make brilliant conceptual leaps. At least not anytime soon.
|
| Where I think AI models might prove useful is in those cases
| where the problem is well defined, where formal methods can
| be used to validate the correctness of (partial) solutions,
| and where the search space is so large that work towards a
| proof is based on "vibes" or intuition. Vibes can be trained
| through reinforcement learning.
|
| Some computer assisted proofs are already hundreds of pages
| or gigabytes long. I think it's a pretty safe bet that really
| long and convoluted proofs that can only be verified by
| computers will become more common.
|
| https://en.wikipedia.org/wiki/Computer-assisted_proof
| CuriouslyC wrote:
| They don't need to be intelligent to make conceptual leaps.
| DeepMind stuff just does a bunch of random RL experiments
| until it finds something that works.
| tsimionescu wrote:
| I think the answer is almost certainly no, and is mostly
| unrelated to how smart LLMs can get. The issue is that any
| theory of quantum gravity would only be testable with
| equipment that is much, much more complex than what we have
| today. So even if the AI came up with some beautifully simple
| theory, testing that its predictions are correct is still not
| going to be feasible for a very long time.
|
| Now, it is possible that it could come up with some theory
| that is radically different from current theories, where
| quantum gravity arises very naturally, and that fits all of
| the other predictions of of the current theories that we can
| measure - so we would have good reasons to believe the new
| theory and consider quantum gravity _probably_ solved. But it
| 's literally impossible to predict whether such a theory even
| exists, that is not mathematically equivalent to QM/QFT but
| still matches all confirmed predictions.
|
| Additionally, nothing in AI tech so far predicts that current
| approaches should be any good at this type of task. The only
| tasks where AI has truly excelled at are extremely well
| defined problems where there is a huge but finite search
| space; and where partial solutions are easy to grade. Image
| recognition, game playing, text translation are the great
| successes of AI. And performance drops sharply with the
| uncertainty in the space, and with the difficulty of judging
| a partial solution.
|
| Finding physical theories is nothing like any of these
| problems. The search space is literally infinite, partial
| solutions are almost impossible to judge, and even judging
| whether a complete solution is good or not is extremely
| difficult. Sure, you can check if it's mathematically
| coherent, but that tells you nothing about whether it
| describes the physical world correctly. And there are plenty
| of good physical theories that aren't fully formally proven,
| or weren't at the time they were invented - so mathematical
| rigour isn't even a very strong signal (e.g. Newton's
| infinitesimal calculus wasn't considerered sound until the
| 1900s or something, by which time his theories had long since
| been rewritten in other terms; the Dirac delta wasn't given a
| precise mathematical definition until much later than it's
| uses; and I think QFT still uses some iffy math even today).
| jimbokun wrote:
| Current LLMs are optimized to produce output most resembling
| what a human would generate. Not surpass it.
| ben_w wrote:
| The output most _pleasing_ to a human, which is both better
| and worse.
|
| Better, when we spot mistakes even if we couldn't create
| the work with the error. Think art: most of us can't draw
| hands, but we can spot when Stable Diffusion gets them
| wrong.
|
| Worse also, because there are many things which are "common
| sense" and wrong, e.g. https://en.wikipedia.org/wiki/Catego
| ry:Paradoxes_in_economic..., and we would collectively
| down-vote a perfectly accurate model of reality for
| violating our beliefs.
| incorrecthorse wrote:
| Unless you want an empty test suite or a test suite full of
| `assert True`, the reward function is more complicated than you
| think.
| rafaelmn wrote:
| Code coverage exists. Shouldn't be hard at all to tune the
| parameters to get what you want. We have really good tools to
| reason about code programmatically - linters, analyzers,
| coverage, etc.
| SkiFire13 wrote:
| In my experience they are ok (not excellent) for checking
| whether some code will crash or not. But checking whether
| the code logic is correct with respect to the requirements
| is far from being automatized.
| rafaelmn wrote:
| But for writing tests that's less of an issue. You start
| with known good/bad code and ask it to write tests
| against a spec for some code X - then the evaluation
| criteria is something like did the test cover the
| expected lines and produce the expected outcome
| (success/fail). Pepper in lint rules for preferred style
| etc.
| SkiFire13 wrote:
| But this will lead you to the same problem the tweet is
| talking! You are training a reward model based on human
| feedback (whether the code satisfies the specification or
| not). This time the human feedback may seem more
| objective, but in the end it's still non-exhaustive human
| feedback which will lead to the reward model being
| vulnerable to some adversarial inputs which the other
| model will likely pick up pretty quickly.
| rafaelmn wrote:
| It's based on automated tools and evaluation (test
| runner, coverage, lint) ?
| SkiFire13 wrote:
| The input data is still human produced. Who decides what
| is code that follows the specification and what is code
| that doesn't? And who produces that code? Are you sure
| that the code that another model produces will look like
| that? If not then nothing will prevent you from running
| into adversarial inputs.
|
| And sure, coverage and lints are objective metrics, but
| they don't directly imply the correctness of a test. Some
| tests can reach a high coverage and pass all the lint
| checks but still be incorrect or test the wrong thing!
|
| Whether the test passes or not is what's mostly
| correlated to whether it's correct or not. But similarly
| for an image recognizer the prompt of whether an image is
| a flower or not is also objective and correlated, and yet
| researchers continue to find adversarial inputs for image
| recognizer due to the bias in their training data. What
| makes you think this won't happen here too?
| rafaelmn wrote:
| > The input data is still human produced
|
| So are rules for the game of go or chess ? Specifying
| code that satisfies (or doesn't satisfy) is a problem
| statement - evaluation is automatic.
|
| > but they don't directly imply the correctness of a
| test.
|
| I'd be willing to bet that if you start with an existing
| coding model and continue training it with coverage/lint
| metrics and evaluation as feedback you'd get better at
| generating tests. Would be slow and figuring out how to
| build a problem dataset from existing codebases would be
| the hard part.
| SkiFire13 wrote:
| > So are rules for the game of go or chess ?
|
| The rules are well defined and you can easily write a
| program that will tell whether a move is valid or not, or
| whether a game has been won or not. This allows you
| generate virtually infinite amount of data to train the
| model on without human intervention.
|
| > Specifying code that satisfies (or doesn't satisfy) is
| a problem statement
|
| This would be true if you fix one specific program (just
| like in Go or Chess you fix the specific rules of the
| game and then train a model on those) and want to know
| whether that specific program satisfies some given
| specification (which will be the input of your model).
| But if instead you want the model to work with any
| program then that will have to become part of the input
| too and you'll have to train it an a number of programs
| which will have to be provided somehow.
|
| > and figuring out how to build a problem dataset from
| existing codebases would be the hard part
|
| This is the "Human Feedback" part that the tweet author
| talks about and the one that will always be flawed.
| layer8 wrote:
| Who writes the spec to write tests against?
|
| In the end, your are replacing the application code by a
| spec, which needs to have a comparable level of detail in
| order for the AI to not invent its own criteria.
| incorrecthorse wrote:
| Code coverage proves that the code runs, not that it does
| what it should do.
| rafaelmn wrote:
| If you have a test that completes with the expected
| outcome and hits the expected code paths you have a
| working test - I'd say that heuristic will get you really
| close with some tweaks.
| littlestymaar wrote:
| It's not trivial to get right but it sounds within reach,
| unlike "hallucinations" with general purpose LLM usage.
| gizmo wrote:
| It's easy to imagine why something could never work.
|
| It's more interesting to imagine what just might work. One
| thing that has plagued programmers for the past decades is
| the difficulty of writing correct multi-threaded software.
| You need fine-grained locking otherwise your threads will
| waste time waiting for mutexes. But color-coding your program
| to constrain which parts of your code can touch which data
| and when is tedious and error-prone. If LLMs can annotate
| code sufficiently for a SAT solver to prove thread safety
| that's a huge win. And that's just one example.
| imtringued wrote:
| Rust is that way.
| WithinReason wrote:
| Adversarial networks are a straightforward solution to this.
| The reward for generating and solving tests is different.
| imtringued wrote:
| That's a good point. A model that is capable of
| implementing a nonsense test is still better than a model
| that can't. The implementer model only needs a good variety
| of tests. They don't actually have to translate a prompt
| into a test.
| yard2010 wrote:
| Once you have enough data points, from current usage, and these
| days every company is tracking EVERYTHING even eye movement if
| they could, it's just a matter of time. I do agree though that
| before we reach an AGI we have these agents who are really good
| in a defined mission (like code completion).
|
| It's not even about LLMs IMHO. It's about letting a computer
| crunch many numbers and find a pattern in the results, in a
| quasi religious manner.
| anshumankmr wrote:
| Unless if it takes maximizing code coverage as the objective
| and starts deleting failed test cases.
| FooBarBizBazz wrote:
| > Coding AI can write tests, write code, compile, examine
| failed test cases, search for different coding solutions that
| satisfy more test cases or rewrite the tests, all in an
| unsupervised loop. And then whole process can turn into
| training data for future AI coding models.
|
| This is interesting, but doesn't it still need supervision? Why
| wouldn't it generate tests for properties you don't want? It
| seems to me that it might be able to "fill in gaps" by
| generalizing from "typical software", like, if you wrote a
| container class, it might guess that "empty" and "size" and
| "insert" are supposed to be related in a certain way, based on
| the fact that other peoples' container classes satisfy those
| properties. And if you look at the tests it makes up and go,
| "yeah, I want that property" or not, then you can steer what
| it's doing, or it can at least force you to think about more
| cases. But there would still be supervision.
|
| Ah -- here's an unsupervised thing: Performance. Maybe it can
| guide a sequence of program transformations in a profile-guided
| feedback loop. Then you could really train the thing to make
| fast code. You'd pass "-O99" to gcc, and it'd spin up a GPU
| cluster on AWS.
| pilooch wrote:
| Yes, same for maths. As long as a true reward 'surface' can be
| optimized. Approximate rewards are similar to approximate and
| non admissible heuristics,search eventually misses true optimal
| states and favors wrong ones, with side effects in very large
| state spaces.
| xxs wrote:
| This reads as a proper marketing ploy. If the current
| incarnation of AI + coding is anything to go by - it'll take
| leaps just to make it barely usable (or correct)
| EugeneOZ wrote:
| TDD approach could play the RL role.
| jgalt212 wrote:
| But what makes you think the ai generated tests will
| correctly represent the problem at hand?
| Kiro wrote:
| My take is the opposite: considering how _good_ AI is at
| coding right now I 'm eager to see what comes next. I don't
| know what kind of tasks you've tried using it for but I'm
| surprised to hear someone think that it's not even "barely
| usable". Personally, I can't imagine going back to
| programming without a coding assistant.
| Barrin92 wrote:
| > but I'm surprised to hear someone think that it's not
| even "barely usable".
|
| write performance oriented and memory safe C++ code.
| Current coding assistants are glorified autocomplete for
| unit tests or short api endpoints or what have you but if
| you have to write any safety oriented code or you have to
| think about what the hardware does it's unusable.
|
| I tried using several of the assistants and they write
| broken or non-performant code so regularly it's
| irresponsible to use them.
| agos wrote:
| I've also had trouble having assistants help with CSS,
| which is ostensibly easier than performance oriented and
| memory safe C++
| imtringued wrote:
| Isn't this a good reward function for RL? Take a
| codebase's test suite. Rip out a function, let the LLM
| rewrite the function, benchmark it and then RL it using
| the benchmark results.
| commodoreboxer wrote:
| I've been playing with it recently, and I find unless there
| are very clear patterns in surrounding code or on the
| Internet, it does quite terribly. Even for well-seasoned
| libraries like V8 and libuv, it can't reliably not make up
| APIs that don't exist and it very regularly spits out
| nonsense code. Sometimes it writes code that works and does
| the wrong thing, it can't reliably make good decisions
| around undefined behavior. The worst is when I've asked for
| it to refactor code, and it actually subtly changes the
| behavior in the process.
|
| I imagine it's great for CRUD apps and generating unit
| tests, but for anything reliable where I work, it's not
| even close to being useful at all, let alone a game
| changer. It's a shame, because it's not like I really enjoy
| fiddling with memory buffers and painstakingly avoiding UB,
| but I still have to do it (I love Rust, but it's not an
| option for me because I have to support AIX. V8 in Rust
| also sounds like a nightmare, to be honest. It's a very C++
| API).
| ben_w wrote:
| I've seen them all over the place.
|
| The best are shockingly good... so long as their context
| doesn't expire and they forget e.g. the Vector class they
| just created has methods `.mul(...)` rather than
| `.multiply(...)` or similar. Even the longer context
| windows are still too short to really take over our jobs
| (for now), the haystack tests seem to be over-estimating
| their quality in this regard.
|
| The worst LLM's that I've seen (one of the downloadable
| run-locally models but I forget which) -- one of my
| standard tests is that I ask them to "write Tetris as a web
| app", and it started off doing something a little bit wrong
| (square grid), before _giving up on that task entirely and
| switching from JavaScript to python and continuing by
| writing a script to train a new machine learning model_
| (and people still ask how these things will "get out of
| the box" :P)
|
| People who see more of the latter? I can empathise with
| them dismissing the whole thing as "just autocomplete on
| steroids".
| CuriouslyC wrote:
| Models aren't going to get really good at theorem proving until
| we build models that are transitive and handle isomorphisms
| more elegantly. Right now models can't recall factual
| relationships well in reverse order in many cases, and often
| fail to answer questions that they can answer easily in English
| when prompted to respond with the fact in another language.
| djeastm wrote:
| >Coding AI can write tests, write code, compile, examine failed
| test cases, search for different coding solutions that satisfy
| more test cases or rewrite the tests, all in an unsupervised
| loop.
|
| Will this be able to be done without spending absurd amounts of
| energy?
| jimbokun wrote:
| Energy efficiency might end up being the final remaining axis
| on which biological brains surpass manufactured ones before
| the singularity.
| commodoreboxer wrote:
| The amount of energy is truly absurd. I dont chug a 16 oz
| bottle of water every time I answer a question.
| ben_w wrote:
| Computer energy efficiency is not as constrained as minimum
| feature size, it's still doubling every 2.6 years or so.
|
| Even if they were, a human-quality AI that runs at human-
| speed for x10 our body's calorie requirements in electricity,
| would still (at electricity prices of USD 0.1/kWh) undercut
| workers earning the UN abject poverty threshold.
| jimbokun wrote:
| Future coding where developers only ever write the tests is an
| intriguing idea.
|
| Then the LLM generates and iterates on the code until it passes
| all of the tests. New requirements? Add more tests and repeat.
|
| This would be legitimately paradigm shifting, vs. the super
| charged auto complete driven by LLMs we have today.
| layer8 wrote:
| Tests don't prove correctness of the code. What you'd really
| want instead is to specify invariants the code has to
| fulfill, and for the AI to come up with a machine-checkable
| proof that the code indeed guarantees those invariants.
| lewtun wrote:
| > I expect language models to also get crazy good at
| mathematical theorem proving
|
| Indeed, systems like AlphaProof / AlphaGeometry are already
| able to win a silver medal at the IMO, and the former relies on
| Lean for theorem verification [1]. On the open source side, I
| really like the ideas in LeanDojo [2], which use a form of RAG
| to assist the LLM with premise selection.
|
| [1] https://deepmind.google/discover/blog/ai-solves-imo-
| problems...
|
| [2] https://leandojo.org/
| lossolo wrote:
| Writing tests won't help you here, this problem is the same as
| other generation tasks. If the test passes, everything seems
| okay, right? Consider this: you now have a 50-line function
| just to display 'hello world'. It outputs 'hello world', so it
| scores well, but it's hardly efficient. Then, there's a
| function that runs in exponential time instead of the standard
| polynomial time that any sensible programmer would use in
| specific cases. It passes the tests, so it gets a high score.
| You also have assembly code embedded in C code, executed with
| 'asm'. It works for that particular case and passes the test,
| but the average C programmer won't understand what's happening
| in this code, whether it's secure, etc. Lastly, tests written
| by AI might not cover all cases, they could even fail to test
| what you intended because they might hallucinate scenarios
| (I've experienced this many times). Programming faces similar
| issues to those seen in other generation tasks in the current
| generation of large language models, though to a slightly
| lesser extent.
| jononor wrote:
| One can image critics and code rewriters that optimize for
| computational, code style, and other requirements in addition
| to tests.
| mjburgess wrote:
| It always annoys and amazes me that people in this field have no
| basic understanding that closed-world finite-information abstract
| games are a unique and trivial problem. So much of the so-called
| "world model" ideological mumbojumbo comes from these setups.
|
| Sampling board state from an abstract board space isn't a
| statistical inference problem. There's no missing information.
|
| The whole edifice of science is a set of experimental and
| inferential practices to overcome the massive information gap
| between the state of a measuring device and the state of what, we
| believe, it measures.
|
| In the case of natural language the gap between a sequence of
| symbols, "the war in ukraine" and those aspects of the world
| these symbols refer to is _enormous_.
|
| The idea that there is even a RL-style "reward" function to
| describe this gap is pseudoscience. As is the false equivocation
| between sampling of abstracta such as games, and _measuring the
| world_.
| harshitaneja wrote:
| Forgive my naivete here but even though solutions to those
| finite-information abstract games are trivial but not
| necessarily tractable(for a loser definition of tractable here)
| and we still need to build heuristics for the subclass of such
| problems where we need solutions in a given finite time frame.
| Those heuristics might not be easy to deduce and hence such
| models help in ascertaining those.
| mjburgess wrote:
| Yes, and this is how computer "scientists" think of problems
| -- but this isnt science, it's mathematics.
|
| If you have a process, eg., points = sample(circle) which
| fully describes its target as n->inf (ie., points = circle as
| n->inf) you arent engaged in statistical infernece. You might
| be using some of the same formula, but the whole system of
| science and statistics has been created for a radically
| different problem with radically different semantics to
| everything you're doing.
|
| eg., the height of mercury in a thermometer _never_ becomes
| the liquid being measured.. it might seems insane
| /weird/obvious to mention this... but we literally have
| berkelian-style neoidealists in AI research who don't realise
| this...
|
| Who think that because you can find representations of
| abstracta in other spaces they can be projected in.. that
| this therefore tells you anything at all about inference
| problems. As if it was the neural network algorithm itself (a
| series of multiplications and additions) that "revealed the
| truth" in all data given to it. This, of course, is
| pseudoscience.
|
| It only applies on mathematical problems, for obvious
| reasons. If you use a function approximation alg to
| approximate a function, do not be suprised you can succeed.
| The issue is that the relationship between, say, the state of
| a theremometer and the state of the temperature of it's
| target system is not an abstract function which lives in the
| space of temperature readings.
|
| More precisely, in the space of temperature readings the
| actual causal relationship between the height of the mecurary
| and the temperature of the target shows up as an _infinite_
| number of temperature distributions (with any given trained
| NN learning only one of these). None of which are a law of
| nature -- laws of nature are not given by distributions in
| measuring devices.
| pyrale wrote:
| > [...] and trivial problem.
|
| It just took decades and impressive breakthroughs to solve, I
| wouldn't really call it "trivial". However, I do agree with you
| that they're a class of problem different from problems with no
| clear objective function, and probably much easier to reason
| about that.
| mjburgess wrote:
| They're a trivial inference problem, not a trivial problem to
| solve as such.
|
| As in, if i need to infer the radius of a circle from N
| points sampled from that cirlce.. yes, I'm sure there's a
| textbook of algorithms/etc. with a lot of work spent on them.
|
| But in the sense of statistical inference, you're only
| learning a property of a distribution given that
| distribution.. there isn't any inferential gap. As N->inf,
| you recover the entire circle itself.
|
| compare with say, learning the 3d structure of an object from
| 2d photographs. At any rotation of that object, you have a
| new pixel distribution. So in pixel-space a 3d object is an
| infinite number of distributions; and the inference goal in
| pixel-space is to choose between sets of these infinities.
|
| That's actually impossible without bridging information (ie.,
| some theory). And in practice, it isn't solved in pixel
| space... you suppose some 3d geometry and use data to refine
| it. So you solve it in 3d-object-property-space.
|
| With AI techniques, you have ones which work on abstracta
| (eg., circles) being used on measurement data. So you're
| solving the 3d/2d problem in pixel space, expecting this to
| work because "objects are made out of pixels, arent they?"
| NO.
|
| So there's a huge inferential gap that you cannot bridge
| here. And the young AI fantatics in research keep milling out
| papers showing that it does work, so long as its a cirlce,
| chess, or some abstract game.
| meroes wrote:
| Yes. Quantum mechanics for example is not something that could
| have been thought of even conceptually by anything "locked in a
| room". Logically coherent structure space is so mind bogglingly
| big we will never come close to even the smallest fraction of
| it. Science recognizes that only experiments will bring
| structures like QM out of the infinite sea into our conceptual
| space. And as a byproduct of how experiments work, the concepts
| will match (model) the actual world fairly well. The armchair
| is quite limiting, and I don't see how LLMs aren't locked to
| it.
|
| AGI won't come from this set of tools. Sam Altman just wants to
| buy himself a few years of time to find their next product.
| DAGdug wrote:
| Who doesn't? Karpathy, and a pretty much every researcher at
| OpenAI/Deepmind/FAIR absolutely knows the trivial concept of
| fully observable versus partially observable environments,
| which is 101 reinforcement learning.
| mjburgess wrote:
| Many don't understand it as a semantic difference
|
| ie., that when you're taking data from a therometer in order
| to estimate the temperature of coffee, the issue isnt simply
| partial information
|
| Its that the information is _about_ the mercury, not the
| coffee. In order to bridge the two you need a theory (eg.,
| about the causal reliability of heating / room temp / etc.)
|
| So this isnt just a partial/full information problem -- these
| are still mathematical toys. This is a _reality_ problem.
| This is a you 're dealing with a causal realtionship between
| physical systems problem. This is _not_ a mathematical
| relationship. It isnt merely partial, it is not a matter of
| "informaton" at all. No amount could ever make the mecurary,
| coffee.
|
| Computer scientists have been trained on mathematics and
| deployed as social scientists, and the naivete is incredible
| toxik wrote:
| Another little tidbit about RLHF and InstructGPT is that the
| training scheme is by far dominated by supervised learning. There
| is a bit of RL sprinkled on top, but the term is scaled down by a
| lot and 8x more compute time is spent on the supervised loss
| terms.
| daly wrote:
| I think that the field of proofs, such as LEAN, which have states
| (the current subgoal), actions (the applicable theorems,
| especially effective in LEAN due to strong Typing of arguments),
| a progress measure (simplified subgoals), a final goal state (the
| proof completes), and a hierarchy in the theorems so there is a
| "path metric" from simple theorems to complex theorems.
|
| If Karpathy were to focus on automating LEAN proofs it could
| change mathematics forever.
| jomohke wrote:
| Deepmind's recent model is trained with Lean. It scored a
| silver olympiad medal (and only one point away from gold).
|
| > AlphaProof is a system that trains itself to prove
| mathematical statements in the formal language Lean. It couples
| a pre-trained language model with the AlphaZero reinforcement
| learning algorithm, which previously taught itself how to
| master the games of chess, shogi and Go
|
| https://deepmind.google/discover/blog/ai-solves-imo-problems...
| cesaref wrote:
| The final conclusion though stands without any justification -
| that LLM + RL will somehow out-perform people at open-domain
| problem solving seems quite a jump to me.
| dosinga wrote:
| To be fair, it says "has a real shot at" and AlphaGo level.
| AlphaGo clearly beat humans on Go, so thinking that if you
| could replicate that, it would have a shot doesn't seem crazy
| to me
| SiempreViernes wrote:
| That only makes sense if you think Go is as expressive as
| written language.
|
| And here I mean that it the act of making a single
| (plausible) move that must match the expressiveness of
| language, because otherwise you're not in the domain of Go
| but the far less interesting "I have a 19x19 pixel grid and
| two colours".
| HarHarVeryFunny wrote:
| AlphaGo has got nothing to do with LLMs though. It's a
| combination of RL + MCTS. I'm not sure where you are seeing
| any relevance! DeepMind also used RL for playing video games
| - so what?!
| esjeon wrote:
| I think the point is that it's practically impossible to
| correctly perform RLHF in open domains, so comparisons simply
| can't happen.
| bjornsing wrote:
| Perhaps not entirely open domain, but I have high hopes for "real
| RL" in coding, where you can get a reward signal from
| compile/runtime errors and tests.
| falcor84 wrote:
| Interesting, has anyone been doing this? I.e. training/fine-
| tuning an LLM against an actual coding environment, as opposed
| to just tacking that later on as a separate "agentic" contruct?
| bjornsing wrote:
| I suspect that the big vendors are already doing it, but I
| haven't seen a paper on it.
| pyrale wrote:
| It's a bit disingenuous to pick go as a case to make the point
| against RLHF.
|
| Sure, a board game with an objective winning function at which
| computers are already better than humans won't get much from
| RLHF. That doesn't look like a big surprise.
|
| On the other hand, a LLM trained on lots of not-so-much curated
| data will naturally pick up mistakes from that dataset. It is not
| really feasible or beneficial to modify the dataset exhaustively,
| so you reinforce the behaviour that is expected at the end. An
| example would be training an AI in a specific field of work: it
| could repeat advices from amateurs on forums, when less-known
| professional techniques would be more advisable.
|
| Think about it like kids naturally learning swear words at
| school, and RLHF like parents that tell their kids that these
| words are inappropriate.
|
| The tweet conclusion seems to acknowledge that, but in a wishful
| way that doesn't want to concede the point.
| epups wrote:
| This is partially the reason why we see LLM's "plateauing" in the
| benchmarks. For the lmsys Arena, for example, LLM's are simply
| judged on whether the user liked the answer or not. Truth is a
| secondary part of that process, as are many other things that
| perhaps humans are not very good at evaluating. There is a limit
| to the capacity and value of having LLM's chase RLHF as a reward
| function. As Karpathy says here, we could even argue that it is
| counter productive to build a system based on human opinion,
| especially if we want the system to surpass us.
| HarHarVeryFunny wrote:
| RLHF really isn't the problem as far as surpassing human
| capability - language models trained to mimic human responses
| are fundamentally not going to do anything other than mimic
| human responses, regardless of how you fine-tune them for the
| specific type of human responses you do or don't like.
|
| If you want to exceed human intelligence, then design
| architectures for intelligence, not for copying humans!
| tpoacher wrote:
| I get the point of the article, but I think it makes a bit of a
| strawman to drive the point across.
|
| Yes, RLHF is barely RL, but you wouldn't use human feedback to
| drive a Go game unless there was no better alternative; and in
| RL, finding a good reward function is the name of the game; once
| you have that, you have no reason to prefer human feedback,
| especially if it is demonstrably worse. So, no, nobody would
| actually "prefer RLHF over RL" given the choice.
|
| But for language models, human feedback IS the ground truth (at
| least until we find a better, more mathematical alternative). If
| it weren't and we had something better, then we'd use that. But
| we don't. So no, RLHF is not "worse than RL" in this case,
| because there 'is' no 'other' RL in this case; so, here, RLHF
| actually _is_ RL.
| SiempreViernes wrote:
| If you cut out humans, what's the point of the language? Just
| use a proper binary format, I hear protobuf is popular.
| Xcelerate wrote:
| One thing I've wondered about is what the "gap" between current
| transformer-based LLMs and optimal sequence prediction looks
| like.
|
| To clarify, current LLMs (without RLHF, etc.) have a very
| straightforward objective function during training, which is to
| minimize the cross-entropy of token prediction on the training
| data. If we assume that our training data is sampled from a
| population generated via a finite computable model, then
| Solomonoff induction achieves optimal sequence prediction.
|
| Assuming we had an oracle that could perform SI (since it's
| uncomputable), how different would conversations between GPT4 and
| SI be, given the same training data?
|
| We know there would be at least a few notable differences. For
| example, we could give SI the first 100 digits of pi, and it
| would give us as many more digits as we wanted. Current
| transformer models cannot (directly) do this. We could also give
| SI a hash and ask for a string that hashes to that value. Clearly
| a lot of hard, formally-specified problems could be solved this
| way.
|
| But how different would SI and GPT4 appear in response to
| everyday chit-chat? What if we ask the SI-based sequence
| predictor how to cure cancer? Is the "most probable" answer to
| that question, given its internet-scraped training data, an
| answer that humans find satisfying? Probably not, which is why
| AGI requires something beyond just optimal sequence prediction.
| It requires a really good objective function.
|
| My first inclination for this human-oriented objective function
| is something like "maximize the probability of providing an
| answer that the user of the model finds satisfying". But there is
| more than one user, so over which set of humans do we consider
| and with which aggregation (avg satisfaction, p99 satisfaction,
| etc.)?
|
| So then I'm inclined to frame the problem in terms of well-being:
| "maximize aggregate human happiness over all time" or "minimize
| the maximum of human suffering over all time". But each of these
| objective functions has notable flaws.
|
| Karpathy seems to be hinting toward this in his post, but the
| selection of an overall optimal objective function _for human
| purposes_ seems to be an incredibly difficult philosophical
| problem. There is no objective function I can think of for which
| I cannot also immediately think of flaws with it.
| bick_nyers wrote:
| Alternatively, you include information about the user of the
| model as part of the context to the inference query, so that
| the model can uniquely optimize its answer for that user.
|
| Imagine if you could give a model "how you think" and your
| knowledge, experiences, and values as context, then it's
| "Explain Like I'm 5" on steroids. Both exciting and terrifying
| at the same time.
| Xcelerate wrote:
| > Alternatively, you include information about the user of
| the model as part of the context to the inference query
|
| That was sort of implicit in my first suggestion for an
| objective function, but do you _really_ want the model to be
| optimal on a per-user basis? There's a lot of bad people out
| there. That's why I switched to an objective function that
| considers all of humanity's needs together as a whole.
| bick_nyers wrote:
| Objective Function: Optimize on a per-user basis.
| Constraints: Output generated must be considered legal in
| user's country.
|
| Both things can co-exist without being in conflict of each
| other.
|
| My (hot) take is I personally don't believe that any LLM
| that can fit on a single GPU is capable of significant
| harm. An LLM that fits on an 8xH100 system perhaps, but I
| am more concerned about other ways an individual could
| spend ~$300k with a conviction of harming others. Besides,
| looking up how to make napalm on Google and then actually
| doing it and using it to harm others doesn't make Google
| the one responsible imo.
| marcosdumay wrote:
| > What if we ask the SI-based sequence predictor how to cure
| cancer? Is the "most probable" answer to that question, given
| its internet-scraped training data, an answer that humans find
| satisfying?
|
| You defined your predictor as being able to minimize
| mathematical definitions following some unspecified algebra,
| why didn't you define it being able to run chemical and
| pharmacological simulations through some unspecified model too?
| Xcelerate wrote:
| I don't follow--what do you mean by unspecified algebra?
| Solomonoff induction is well-defined. I'm just asking how the
| responses of a chatbot using Solomonoff induction for
| sequence prediction would differ from those using a
| transformer model, given the same training data. I can
| specify mathematically if that makes it clearer...
| JoshuaDavid wrote:
| >But how different would SI and GPT4 appear in response to
| everyday chit-chat? What if we ask the SI-based sequence
| predictor how to cure cancer?
|
| I suspect that a lot of LLM prompts that elicit useful
| capabilities out of imperfect sequence predictors like GPT-4
| are in fact most likely to show up in the context of "prompting
| an LLM" rather than being likely to show up "in the wild".
|
| As such, to predict the token after a prompt like that, an SI-
| based sequence predictor would want to predict the output of
| whatever language model was most likely to be prompted,
| conditional on the prompt/response pair making it into the
| training set.
|
| If the answer to "what model was most likely to be prompted"
| was "the SI-based sequence predictor", then it needs to predict
| which of its own likely outputs are likely to make it into the
| training set, which requires it to have a probability
| distribution over its own output. I think the "did the model
| successfully predict the next token" reward function is
| underspecified in that case.
|
| There are many cases like this where the behavior of the system
| in the limit of perfect performance at the objective is
| undesirable. Fortunately for us, we live in a finite universe
| and apply finite amounts of optimization power, and lots of
| things that are useless or malign in the limit are useful in
| the finite-but-potentially-quite-large regime.
| cs702 wrote:
| Indeed. The reward function we're using in RLHF today induces AI
| models to behave in ways that superficially seem better to human
| beings on average, but what we actually want is to induce them to
| _solve_ cognitive tasks, with human priorities.
|
| The _multi-trillion dollar question_ is: What is the objective
| reward that would induce AI models like LLMs to behave like AGI
| -- while adhering to all the limits we human beings wish to
| impose in AGI behavior?
|
| I don't think anyone has even a faint clue of the answer yet.
| Xcelerate wrote:
| > The multi-trillion dollar question is: What is the objective
| reward that would induce AI models like LLMs to behave like AGI
|
| No, the reward for finding the right objective function is a
| good future for all of humanity, given that we already have an
| algorithm for AGI.
|
| The objective function to acquire trillions of dollars is
| trivial: it's the same minimization of cross-entropy that we
| already use for sequence prediction. What's missing is a better
| algorithm, which is probably a good thing at the moment,
| because otherwise someone could trivially drain all value from
| the stock market.
| cs702 wrote:
| You misunderstand.
|
| The phrase "the multi-trillion dollar question" has _nothing_
| to do with acquiring trillions of dollars.
|
| The phrase is _idiomatic_ , indicating a crucial question,
| like "the $64,000 question," but implying much bigger
| stakes.[a]
|
| ---
|
| [a] https://idioms.thefreedictionary.com/The+%2464%2c000+Ques
| tio...
| Xcelerate wrote:
| Ah, I see. Apologies.
| cs702 wrote:
| Thank you.
|
| By the way, I agree with you that "a good future for all
| of humanity" would be a fantastic goal :-)
|
| The multi-trillion dollar question is: How do you specify
| that goal as an objective function?
| HarHarVeryFunny wrote:
| You can't just take an arbitrary neural network architecture,
| and make it do anything by giving it an appropriate loss
| function, and in particular you can't take a simple feed
| forward model like a Transformer and train it to be something
| other than a feed forward model... If the model architecture
| doesn't have feedback paths (looping) or memory that persists
| from one input to the next, then no reward function is going to
| make it magically sprout those architectural modifications!
|
| Today's Transformer-based LLMs are just what the name says -
| (Large) Language Models - fancy auto-complete engines. They are
| not a full blown cognitive architecture.
|
| I think many people do have a good idea how to build cognitive
| architectures, and what the missing parts are that are needed
| for AGI, and some people are working on that, but for now all
| the money and news cycles are going into LLMs. As Chollet says,
| they have sucked all the oxygen out of the room.
| Kim_Bruning wrote:
| Ceterum censeo install AI in cute humanoid robot.
|
| Robot because physical world provides a lot of RL for free (+ and
| -).
|
| Humanoid because known quantity.
|
| Cute because people put up with cute a lot more, and lower
| expectations.
| bilsbie wrote:
| @mods. I submitted this exact link yesterday. Shouldn't this post
| have shown it was already submitted?
|
| https://news.ycombinator.com/item?id=41184948
| defrost wrote:
| Maybe.
|
| There _might_ be logic that says an 'old' link (> 12 hours
| say) with no comments doesn't need to be cross linked to if
| submitted later (or other rule).
|
| In any case, @mods and @dang do not work (save by chance) .. if
| you think it's worth bring to attention then there's generally
| no downside to simply emailing direct to hn # ycombinator dot
| com from your login email.
| NalNezumi wrote:
| While I agree to Karpathy and I also had a "wut? They call this
| RL? " reaction when RLHF was presented as an method of CHATGPT
| training, I'm a bit surprised by the insight he makes because
| this same method and insight have been gathered from "Learning
| from Human preference" [1] from none other than openAI, published
| in 2017.
|
| Sometimes judging a "good enough" policy is order of magnitudes
| more easier than formulating an exact reward function, but this
| is pretty much domain and scope dependent. Trying to estimate a
| reward function in those situations, can often be counter
| productive because the reward might even screw up your search
| direction. This observation was also made by the authors
| (researchers) of the book "Myth of the objective"[2] with their
| picbreeder example. (the authors so happens to also work for
| OpenAI.)
|
| When you have a well defined reward function with no local
| suboptima _and no cost in rolling out faulty policies_ RL work
| remarkably well. (Alex Ipran described this well in his widely
| cited blog [3])
|
| Problem is that this is pretty hard requirements to have for most
| problems that interact with the real world (and not internet, the
| artificial world). It's either the suboptima that is in the way
| (LLM and text), or rollout cost (running GO game a billion times
| to just beat humans, is currently not a feasible requirement for
| a lot of real world applications)
|
| Tangentially, this is also why I suspect LLM for planning (and
| understanding the world) in the real world have been lacking.
| Robot Transformer and SayCan approaches are cool but if you look
| past the fancy demos it is indeed a lackluster performance.
|
| It will be interesting to see how these observations and
| Karpathys observations will be tested with the current humanoid
| robot hype, which imo is partially fueled by a misunderstanding
| of LLMs capacity including what Karpathy mentioned. (shameless
| plug: [4])
|
| [1] https://openai.com/index/learning-from-human-preferences/
|
| [2] https://www.lesswrong.com/posts/pi4owuC7Rdab7uWWR/book-
| revie...
|
| [3] https://www.alexirpan.com/2018/02/14/rl-hard.html
|
| [4] https://harimus.github.io//2024/05/31/motortask.html
| bubblyworld wrote:
| The SPAG paper is an interesting example of true reinforcement
| learning using language models that improves their performance on
| a number of hard reasoning benchmarks.
| https://arxiv.org/abs/2404.10642
|
| The part that is missing from Karpathy's rant is "at scale" (the
| researchers only ran 3 iterations of the algorithm on small
| language models) and in "open domains" (I could be wrong about
| this but IIRC they ran their games on a small number of common
| english words). But adversarial language games seem promising, at
| least.
| textlapse wrote:
| That's a cool paper - but it seems like it produces better
| debaters but not better content? To truly use RL's strengths,
| it would be a battle of content (model or world representation)
| not mere token level battles.
|
| I am not sure how that works at the prediction stage as
| language isn't the problem here.
| bubblyworld wrote:
| I think the hypothesis is that "debating" via the right
| adversarial word game may naturally select for better
| reasoning skills. There's some evidence for that in the
| paper, namely that it (monotonically!) improves the model's
| performance on seemingly unrelated reasoning stuff like the
| ARC dataset. Which is mysterious! But yeah, it's much too
| early to tell, although IIRC the results have been replicated
| already so that's something.
|
| (by the way, I don't think "debating" is the right term for
| the SPAG game - it's quite subtle and isn't about arguing for
| a point, or rhetoric, or anything like that)
| rossdavidh wrote:
| The problem of various ML algorithms "gaming" the reward
| function, is rather similar to the problem of various financial
| and economic issues. If people are not trying to do something
| useful, and then expecting $$ in return for that, but rather are
| just trying to get $$ without knowing or caring what is
| productive, then you get a lot of non-productive stuff (spam,
| scams, pyramid schemes, high-frequency trading, etc.) that isn't
| actually producing anything, but does take over a larger and
| larger percentage of the economy.
|
| To mitigate this, you have to have a system outside of that,
| which penalizes "gaming" the reward function. This system has to
| have some idea of what real value is, to be able to spot cases
| where the reward function is high but the value is low. We have a
| hard enough time of this in the money economy, where we've been
| learning for centuries. I do not think we are anywhere close in
| neural networks.
| csours wrote:
| Commenting to follow this.
|
| There is a step like this in ML. I think it's pretty
| interesting that topics from things like economics pop up in ML
| - although perhaps it's not too surprising as we are doing ML
| for humans to use.
| layer8 wrote:
| > Commenting to follow this.
|
| You can "favorite" comments on HN to bookmark them.
| bob1029 wrote:
| > This system has to have some idea of what real value is
|
| This is probably the most cursed problem ever.
|
| Assume you could develop such a system, why wouldn't you just
| incorporate its logic into the original fitness function and be
| done with it?
|
| I think the answer is that such a system can probably never be
| developed. At some level humans must be involved in order to
| adapt the function over time in order to meet expectations as
| training progresses.
|
| The information used to train on is beyond critical, but
| heuristics regarding what information matters more than other
| information in a given context might be even more important.
| voiceblue wrote:
| > Except this LLM would have a real shot of beating humans in
| open-domain problem solving.
|
| At some point we need to start recognizing LLMs for what they are
| and stop making outlandish claims like this. A moment of
| reflection ought to reveal that "open domain problem solving" is
| not what an LLM does.
|
| An LLM, could not, for example, definitively come up with the
| three laws of planetary motion like Kepler did (he looked at the
| data), in the absence of a prior formulation of these laws in the
| training set.
|
| TFA describes a need for scoring, at scale, qualitative results
| to human queries. Certainly that's important (it's what Google is
| built upon), but we don't need to make outlandish claims about
| LLM capabilities to achieve it.
|
| Or maybe we do if our next round of funding depends upon it.
| textlapse wrote:
| As a function of energy, it's provably impossible for a next
| word predictor with a constant energy per token to come up with
| anything that's not in its training. (I think Yann LeCun came
| up with this?)
|
| It seems to me RL was quite revolutionary (especially with
| protein folding/AlphaGo) - but using a minimal form of it to
| solve a training (not prediction) problem seems rather like
| bringing a bazooka to a banana fight.
|
| Using explore/exploit methods to search potential problem
| spaces might really help propel this space forward. But the
| energy requirements do not favor the incumbents as things are
| now scaled to the current classic LLM format.
| visarga wrote:
| > An LLM, could not, for example, definitively come up with the
| three laws of planetary motion like Kepler did (he looked at
| the data)
|
| You could use Symbolic Regression instead, and the LLM will
| write the code. Under the hood it would use a genetic
| programming library like with SymbolicRegressor.
|
| Found a reference:
|
| > AI-Descartes, an AI scientist developed by researchers at IBM
| Research, Samsung AI, and the University of Maryland, Baltimore
| County, has reproduced key parts of Nobel Prize-winning work,
| including Langmuir's gas behavior equations and Kepler's third
| law of planetary motion. Supported by the Defense Advanced
| Research Projects Agency (DARPA), the AI system utilizes
| symbolic regression to find equations fitting data, and its
| most distinctive feature is its logical reasoning ability. This
| enables AI-Descartes to determine which equations best fit with
| background scientific theory. The system is particularly
| effective with noisy, real-world data and small data sets. The
| team is working on creating new datasets and training computers
| to read scientific papers and construct background theories to
| refine and expand the system's capabilities.
|
| https://scitechdaily.com/ai-descartes-a-scientific-renaissan...
| nothrowaways wrote:
| Alpha go is not a good example in this case.
| nickpsecurity wrote:
| It sounds really negative about RLHF. Yet, if I read on them
| correctly, that's a big part of how ChatGPT and Claude got so
| effective. There's companies collecting quality, human responses
| to many prompts. Companies making models buy them. Even the
| synthetic examples come from models that largely extrapolate what
| humans wrote in their pre-training data.
|
| So, I'm defaulting on RLHF is great in at least those ways until
| an alternative is empirically proven to be better. I also hope
| for larger, better, open-source collections of RLHF training
| data.
| dontwearitout wrote:
| Claude notably does _not_ use RLHF, but uses RLAIF, using a LLM
| to generate the preferences based a "constitution" instead of
| human preferences. It's remarkable that it can bootstrap itself
| up to such high quality. See https://arxiv.org/pdf/2212.08073
| for more.
| visarga wrote:
| I agree RLHF is not full RL, more like contextual bandits,
| because there is always just one single decision and no credit
| assignment difficulties. But there is one great thing about RLHF
| compared to supervised training: it updates the model on the
| whole sequence instead of only the next token. This is
| fundamentally different from pre-training, where the model learns
| to be myopic and doesn't learn to address the "big picture".
|
| So there are 3 levels of optimization in discussion here:
|
| 1. for the next token (NTP)
|
| 2. for a single turn response (RLHF)
|
| 3. for actual task completion or long-term objectives (RL)
| islewis wrote:
| Karpathy is _much_ more knowledgeable about this than I am, but I
| feel like this post is missing something.
|
| Go is a game that is fundamentally too complex for humans to
| solve. We've known this since way back before AlphaGo. Since
| humans were not the perfect Go players, we didn't use them to
| teach a model- we wanted the model to be able to beat humans.
|
| I dont see language being comparable. the "perfect" LLM imitates
| humans perfectly, presumably to the point where you can't tell
| the difference between LLM generated text, and human generated
| text. Maybe it's just as flexible as the human mind is too, and
| can context switch quickly, and can quickly swap between
| formalities, tones, and slangs. But the concept of "beating" a
| human doesn't really make much sense.
|
| AlphaGo and Stockfish can push forward our understandings of
| their respective games, but an LLM cant push forwards our
| boundary of language. this is because it's fundamentally a copy-
| cat model. This makes RLHF make much more sense in the LLM realm
| than the Go realm.
| Miraste wrote:
| One of the problems lies in the way RLHF is often performed:
| presenting a human with several different responses and having
| them choose one. The goal here is to create the most human-like
| output, but the process is instead creating outputs humans like
| the most, which can seriously limit the model. For example,
| most recent diffusion-based image generators use the same
| process to improve their outputs, relying on volunteers to
| select which outputs are preferable. This has lead to models
| that are comically incapable of generating ugly or average
| people, because the volunteers systematically rate those
| outputs lower.
| will-burner wrote:
| This is a great comment. Another important distinction, I
| think, is that in the AlphaGo case there's no equivalent to the
| generalized predict next token pretraining that happens for
| LLMs (at least I don't think so, this is what I'm not sure of).
| For LLMs, RLHF teaches the model to be conversational, but the
| model has already learned language and how to talk like a human
| from the predict next token pretraining.
| taeric wrote:
| I thought the entire point of the human/expert feedback was in
| domains where you can not exhaustively search the depth of the
| space? Yes, if you can go deeper in the search space, you should
| do so. Regardless of how bad the score is at the current spot.
| You only branch to other options when it is exhausted.
|
| And if you don't have a way to say that something could be
| exhausted, then you will look for heuristics to choose more
| profitable places to search. Hence the HF added.
| HarHarVeryFunny wrote:
| Searching for what ?!
|
| Human expectation/preference alignment is the explicit goal,
| not a way to achieve something else. RLHF (or an alternative
| such as ORPO) is used to take a raw pre-trained foundation
| model, which by itself is only going to try to match training
| set statistics, and finetune it to follow human expectations
| for uses such as chat (incl. Q&A).
| taeric wrote:
| Learning is always exploring a search space. Literally
| deciding which choice would be most likely to get to an
| answer. If you have a set of answers, deciding what
| alterations would make for a better answer.
|
| Like, I don't know what you think is wrong on that? The
| human/expert feedback is there to provide scores that we
| don't know how to fully codify, yet. Is effectively
| acknowledging that we don't know how to codify the "I know it
| when I see it" rule. And based on those scores, the model
| updates and new things can be scored.
|
| What is not accurate in that description?
| HarHarVeryFunny wrote:
| Direct human feedback - from an actual human - is the gold
| standard here, since it is an actual human who will be
| evaluating how well they like your deployed model.
|
| Note that using codified-HF (as is in fact already done -
| the actual HF being first used to train a proxy reward
| model) doesn't change things here - tuning the model to
| maximize this metric of human usability _IS_ the goal. The
| idea of using RL is to do the search at training time
| rather than inference time when it 'd be massively more
| expensive. You can think of all the multi-token model
| outputs being evaluated by the reward model as branches of
| the search tree, and the goal of RL is to influence earlier
| token selection to lead towards these preferred outputs.
| taeric wrote:
| This is no different from any other ML situation, though?
| Famously, people found out that Amazon's hands free
| checkout thing was being offloaded to people in the cases
| where the system couldn't give a high confidence answer.
| I would be shocked to know that those judgements were not
| then labeled and used in automated training later.
|
| And I should say that I said "codified" but I don't mean
| just code. Labeled training samples is fine here. Doesn't
| change that finding a model that will give good answers
| is ultimately something that can be conceptualized as a
| search.
|
| You are also blurring the reinforcement/scoring at
| inference time as compared to the work that is done at
| training time? The idea of using RL at training time is
| not just because it is expensive there. The goal is to
| find the policies that are best to use at inference time.
___________________________________________________________________
(page generated 2024-08-08 23:02 UTC)