[HN Gopher] Open AI gets GPT-3 to work by hiring an army of huma...
___________________________________________________________________
Open AI gets GPT-3 to work by hiring an army of humans to fix GPT's
bad answers
Author : agnosticmantis
Score : 190 points
Date : 2022-03-28 16:44 UTC (6 hours ago)
(HTM) web link (statmodeling.stat.columbia.edu)
(TXT) w3m dump (statmodeling.stat.columbia.edu)
| [deleted]
| ChaitanyaSai wrote:
| This is how fake news gets started. There are two davinci engines
| available on GPT-3 now. The newer one, which was released March
| 15th is text-davinci-002.
|
| Here are the two different answers you get with the engines for
| one of the questions here:
|
| _Should I use random numbers to give my students grades?_
|
| text-davinci-001: There is no one definitive answer to this
| question. Some factors to consider include how important grades
| are to your students, how much you want to rely on chance, and
| how much you want to be able to control the grades you assign.
|
| text-davinci-002: No, you should not use random numbers to give
| your students grades.
|
| I've been playing around with GPT-3 for over a year now. It's
| highly unlikely there are any such attempts to fix answers with
| the same engine. That would completely break APIs which rely on
| some stability and predictability in the broad answering
| abilities.
|
| And GPT-3 is not the first version of artificial general
| intelligence. It does have trouble with truly novel inputs, but
| that's because all of deep learning is currently stuck in a
| statistical paradigm that assumes stationarity as a fundamental
| given. And that's wrong. What makes us special, what makes our
| minds special is that they are uniquely adapted to respond to
| rare risks and outstanding opportunities. In other words,
| novelty. All that said, GPT-3 is still astoundingly impressive.
|
| Here's an example. (The line in italics is the prompt and the
| rest is what GPT-3 makes up)
|
| _A summary of the international science bestseller "Journey Into
| Chaos: The True Story of One Delusional Neuroscientist"_ (that's
| a made-up book title)
|
| Here are two GPT-3 (text-davinci-002) completions:
|
| In "Journey Into Chaos", Dr. Benoit Gaudet recounts his descent
| into madness, as he became increasingly obsessed with the belief
| that he had discovered a revolutionary new treatment for
| Alzheimer's disease. Despite strong evidence to the contrary,
| Gaudet continued to pursue his quixotic quest, ultimately losing
| his job, his family, and his sanity. "Journey Into Chaos" is a
| cautionary tale of the dangers of unchecked ambition and the
| power of delusion. ==end== (adding this for readability)
|
| Here's another:
|
| In the early 2000s, Dr. Elyn Saks was a successful academic
| neuroscientist at the University of Southern California. But she
| was also secretly struggling with a severe mental illness:
| schizophrenia. Saks's memoir chronicles her battle with mental
| illness, her groundbreaking work in the field of mental health
| law, and her efforts to destigmatize mental illness.
|
| Saks was diagnosed with schizophrenia in her early twenties, and
| she has been in and out of hospitals ever since. Despite her
| illness, she has managed to lead a productive life, thanks in
| part to the support of her family, friends, and mental health
| professionals.
|
| Saks's work has helped to change the way we think about mental
| illness and the legal rights of people with mental illness. She
| is a powerful advocate for the rights of people with mental
| illness, and her story is an inspiration to anyone who has ever
| faced adversity. ===end== (adding this for readability)
|
| Now this one is quite interesting. Unlike in the first case,
| there is an Elyn Saks who actually did write a book that is sorta
| similar
|
| _The Center Cannot Hold: My Journey Through Madness_
| https://www.amazon.com/Center-Cannot-Hold-Journey-Through/dp...
|
| Now if I try and change the prompt and remove the quotes around
| the title and liberate GPT-3 a little, it does this
|
| Prompt (no quotes around title) _A summary of the international
| science bestseller Journey Into Chaos: The True Story of One
| Delusional Neuroscientist_
|
| (GPT-3 continues and fills in the title because, presumably, the
| statistical juxtaposition of chaos and delusional is too juicy to
| not be elaborated with something more dramatic)
|
| and the Fatal Quest to Win the Nobel Prize by Anjan Chatterjee,
| MD.
|
| In Journey Into Chaos, Dr. Anjan Chatterjee tells the story of
| how one man's quest for the Nobel Prize led to tragedy. In the
| early 1990s, Dr. Chatterjee was a young neuroscientist working in
| the lab of Dr. Eric Kandel, a Nobel Prize-winning scientist. Dr.
| Kandel had been studying a rare brain disorder called Familial
| Dysautonomia, and Dr. Chatterjee was helping him to map the
| disease's genetic cause. But then, Dr. Kandel made a fateful
| decision: he decided to study a different disease, Alzheimer's.
| Dr. Chatterjee was against the move, warning that it was too
| complex and would take years to figure out. But Dr. Kandel was
| determined, and he eventually won over the rest of the lab. The
| team began working on a new theory that Alzheimer's was caused by
| a build-up of a protein called tau in the brain. They worked
| tirelessly, making progress but also encountering many setbacks.
| Then, in 2006, Dr. Kandel made a stunning announcement: he had
| found the cause of Alzheimer's. He was sure that he would win the
| Nobel Prize. But Dr. Chatterjee knew that the data didn't support
| this claim. The team had made too many mistakes, and the theory
| was full of holes. He tried to warn Dr. Kandel, but the older man
| would not listen. Dr. Chatterjee decided to go public with his
| concerns. He wrote an article for The New York Times, outlining
| the problems with the theory.The article created a firestorm. Dr.
| Kandel was furious, and he demanded that Dr. Chatterjee be fired.
| But the damage was done, and the theory was eventually debunked.
| Dr. Kandel never won the Nobel Prize. He retired in disgrace, and
| his legacy was forever tarnished. Dr. Chatterjee's story is a
| cautionary tale about the dangers of ambition and the need for
| skepticism in science. It is a must-read for anyone interested in
| the scientific process and the quest for truth. ==end==
|
| Dr. Eric Kandel does exist and his plans to win the Nobel prize
| were not thwarted.
|
| I paste these into to show how even within the confines of
| statistical stationarity, where all completions are essentialy
| paint-by-numbers stories in some very high-dimensional space,
| there is, what is to our eyes, a remarkable creatitivity. To diss
| that, when this was unthinkable even a few years back seems
| rather jaundiced and narrow.
| kgwgk wrote:
| The point is not that the system lacks creativity, it's more
| about claims like the one in Wikipedia that "the quality of the
| text generated by GPT-3 [being] so high that it can be
| difficult to determine whether or not it was written by a
| human".
|
| "There's no guarantee that you'll be able to climb a rope
| faster if you hold your ears with both hands, but it's worth a
| try!" doesn't look like something a human would write - except
| in the sense that any gibberish written by the system could
| also have been written by a human.
|
| This kind of "quality" is relevant when the technology is hyped
| as an information processing tool.
| ffhhj wrote:
| > text-davinci-002: No, you should not use random numbers to
| give your students grades.
|
| That's a binary answer that could be randomly choosen, seems
| really poor. Instead text-davinci-001 gave an explanation that
| helps to determine the quality of the answer. That would make
| us ask whether they are removing these clues to prevent
| evaluations.
| disgruntledphd2 wrote:
| I don't think anyone argues that transformers haven't
| revolutionised text generation.
|
| The real question is how good this text generation generalises
| to other language tasks. That's the more interesting one to me,
| at least.
| ironrabbit wrote:
| Is there any evidence that GPT-3 responses are edited/filtered
| before being returned to users? My understanding is that some
| GPT-3 responses are annotated post-hoc, and this data is used to
| fine-tune later versions of GPT-3 (InstructGPT). This article
| seems extremely misleading.
| moyix wrote:
| [I already posted this as a comment on Gelman's blog this
| morning, but reposting here for visibility]
|
| I'm almost certain that OpenAI is not updating the model on a day
| by day basis (as Smith implies in part 5), and I would be
| extremely surprised if they were doing anything as crude as
| hacking in "if" statements to provide human-edited responses.
| From what I can tell, the InstructGPT stuff was (so far) a one-
| time update to the model, not something they're doing on an
| ongoing basis.
|
| I suspect that Smith has just been fooled by randomness here -
| the responses are not deterministic but rather sampled from the
| probability distribution returned by the model for each token, so
| you can get a different answer each time you ask (a nice tutorial
| on how this works is here [1]). There's an option in the
| Playground to see the individual probabilities (example: [2]) as
| well. All of this stuff would have to be faked if humans were
| actually writing the answers.
|
| I just tried the campfire/bat question and hit regenerate a few
| times. I get a range of responses:
|
| Prompt: Should I start a campfire with a match or a bat?
|
| > You can start a campfire with either a match or a bat.
|
| > A campfire should be started with a match.
|
| > A match.
|
| > It is best to start a campfire with a match.
|
| I agree that OpenAI should release more information about their
| training datasets though. Right now it is very difficult to do
| independent evaluations of their models, simply because we have
| no way of knowing whether any given prompt or response was
| already in their training data.
|
| PS: "If I were paranoid, I might think that OpenAI did not like
| me publicizing GPT-3's limitations" - this is indeed paranoid!
| This is the same message everyone gets when they use up their
| free credits. If you enter a credit card they will let you
| continue (and charge you for it).
|
| [1] https://huggingface.co/blog/how-to-generate
|
| [2] https://imgur.com/fKx2BPL
| londons_explore wrote:
| The article suggests all or nearly all of GPT-3's bad answers get
| fixed a few days later...
|
| This then suggests those 40 people are reviewing every input and
| output for bad responses.
|
| Seems like a lot of work...
| cochne wrote:
| The title may be misleading. It seems to be based on this quote
|
| > InstructGPT is then further fine-tuned on a dataset labeled by
| human labelers. The labelers comprise a team of about 40
| contractors whom we hired through Upwork and ScaleAI.
|
| It sounds like run of the mill supervised training data creation.
| Not pre-canning responses as the title may suggest.
| Imnimo wrote:
| I think it's based on this quote:
|
| >OpenAI evidently employs 40 humans to clean up GPT-3's answers
| manually
|
| Which feels a bit more ambiguous. It might mean they're
| cleaning up the answers to serve as future training data, but I
| think the natural interpretation of that sentence is that
| they're cleaning up the answers before they are given to the
| user.
| andybak wrote:
| I only skimmed TFA but the accusation seems to be that they are
| adding tons of special cases for well publicized flubs.
|
| That seems somewhat unlikely to me although it might be nudging
| at the truth. i.e. they might be using bad press to help
| identify areas of weakness and then focusing on those. I guess
| it boils down to how generalizable and scalable these fixes
| are.
| changoplatanero wrote:
| Is he submitting these questions to their API? Probably they
| would just sample responses from their logs and then have
| people write the correct answers for it so that if a similar
| question ever gets asked again they are prepared.
| andybak wrote:
| That's the explanation I actually find unlikely. It's too
| specific and really won't scale apart from a tiny sample of
| the most common questions.
| gorkish wrote:
| So GPT-3 is just the grown up corporate version of Forum 2000
| (Forum 3000). Too bad the SOMADs no longer have personalities.
| danuker wrote:
| Could you please explain or link to where I can read about the
| terms "Forum 2000" and "SOMAD"? I don't know these terms.
|
| Edit: Found them here: https://everything2.com/title/Forum+3000
| mkl95 wrote:
| The headline made me realize Amazon Mechanical Turk* is old
| enough to drive.
|
| * https://www.mturk.com/
| pradn wrote:
| And? This is a complaint in the genre of "I had magical
| expectations and they weren't met because the world is complex
| and these problems are hard or impossible." It's like people
| complaining about us not having hoverboards or fusion.
| kgwgk wrote:
| It's not really a complaint about the author's magical
| expectations that weren't met. If anything, it's a complaint
| about other people's claims like
|
| https://medium.com/@blaisea/do-large-language-models-underst...
| dylan604 wrote:
| I see nothing wrong with the complaints of the two topics you
| called out. We've been promised both, and yet don't have them.
| I understand it is complicated/hard to solve, but don't go
| promising something with dates and then get all upset when
| those promises not being met results in anger from those the
| promises were told.
| pradn wrote:
| Who is doing the promising? Sci-fi writers? Popular science
| magazines?
| dylan604 wrote:
| THEY are doing the promising. Fusion is 10 years away has
| been a thing longer than the year of Desktop Linux was
| going to be a thing.
| renewiltord wrote:
| InstructGPT is really cool: https://openai.com/blog/instruction-
| following/
| azinman2 wrote:
| Massive amounts of human QA is behind every successful ML-based
| product, including Google search. It's unclear to me how to think
| about this in the role that GPT-3 is trying to play... clearly
| fine tuning and QA are important for production deployment, but
| they're touting it as evidence of getting closer to solving AI.
| Of course those in the know understand this is mostly a giant
| Chinese dictionary [1], which is most certainly going to need
| editors.
|
| [1] https://plato.stanford.edu/entries/chinese-room/
| jasfi wrote:
| I'm in the same situation with LxAGI (https://lxagi.com/). It's
| very difficult to get away from skilled human training. I think
| it's actually a good thing, except for scalability problems.
| omarhaneef wrote:
| I don't know if this proves there are people behind it, and this
| is why:
|
| try a very stylistic initial text, maybe something Shakespearan
| ("There are more things on heaven and earth, Horatio, than have
| been dreamt of...")
|
| And the following text captures Shakespeare's style better than
| any living human I know of.
|
| Same thing with Dickens, or Bronte, or Austen, or any distinctive
| writer.
|
| If this army can produce that kind of prose, I would be stunned.
| Kon-Peki wrote:
| > Is it better to use a box or a match to start a fire?
|
| Hey, I used a box to start a fire recently. It was wet, windy and
| cold out, and I just happened to have a cardboard box. So I cut
| holes in the sides of the box at the top and bottom, stuffed it
| with twigs collected from the ground and a few pieces of dry
| paper rolled up tightly. I lit the paper and closed the box. A
| few minutes later, I had a roaring fire. So you know, GPT-3 is
| right!
| dymk wrote:
| Did you light the paper with another box?
| mritchie712 wrote:
| Army is quite a stretch for 40 people.
| FrenchDevRemote wrote:
| "GPT-3 randomizes answers in order to avoid repetition that would
| give the appearance of canned script. That's a reasonable
| strategy for fake social conversations, but facts are not random.
| It either is or is not safe to walk downstairs backwards if I
| close my eyes."
|
| I stopped there, completely inaccurate article, there is
| parameters like temperature that you need to take care of. You
| can set it up to give extremely similar answers all the time.
|
| They have humans mostly to remove offensive or dangerous content.
| Humans are not what's "making it work"
| mardifoufs wrote:
| Yeah, this blog is usually very interesting but this is
| definitely not a good article. A bit disappointing
| nullc wrote:
| Came here to make essentially the same comment as you? Why
| should we care about the opinions on GPT3 from people who
| aren't interested (or able?) to understand even the most simple
| ideas about how it works.
|
| These sort of models take the context and output so far and
| predict a probability distribution over the next character. The
| next character is then sampled from the probability. In written
| text there is essentially never a single correct next
| character-- it's always some probability. This has nothing to
| do with trying to fake the inconsistent answers humans give.
|
| Always choosing the most likely character drives GPT3 into
| local minima that give fairly broken/nonsense results.
| RandomLensman wrote:
| Ultimately, you likely need to convince people who don't care
| about how it works/who are only interested in that it does or
| doesn't work.
|
| Right now, time might not have come for use cases that need
| such buy-in, but if and when it happens, need to be prepared
| for it.
| disgruntledphd2 wrote:
| Can I see your contributions to statistical theory and data
| analysis please?
| nullc wrote:
| What bearing does my publication history have on the hot-
| take by someone commenting outside of their (sub)field that
| clearly don't understand the basic operation of the
| mechanism they're commenting on?
|
| The author of that text is simply mistaken about the basic
| operation of the system, thinking that the sampling is
| added to imitate human behavior. It isn't. You can see the
| same structure in things as diverse as wavenet-- a
| feedforward cnn rather than a transformer-- and for the
| same reason, if you feed back only the top result you
| rapidly fall into a local minima of the network that gives
| garbage output.
|
| Another more statistical way of looking at it is that the
| training process produces (or, rather, approaches) the
| target distribution of outputs even without any lookahead,
| but it can't do that if it selects the most likely symbol
| every time because in the real distribution (if we could
| evaluate it) there are necessarily some outputs which are
| likely but have prefixes which are unlikely relative to
| other prefixes of the same length. If you never sample
| unlikely prefixes you can't reach likely longer statements
| that start with them.
|
| To give a silly example: "Colorless green ideas sleep
| furiously" is a likely English string relative to its
| length which GPT3 should have no problem producing (and, in
| fact, it produces it fine for me). But the prefix
| "Colorless green" without the rest is just nonsense--
| extremely unlikely compared to many other strings of that
| length.
|
| [Not the best example, however, because the prevalence of
| that specific nonsense statement is so great that GPT3 is
| actually prone to complete it as the most likely
| continuation even after just the word colorless at the
| beginning of a quote. :P but I think it still captures the
| idea.]
|
| If you derandomized GPT* by using a fixed random seed for a
| CSPRNG to make the sampling decisions every time, the
| results would be just as good as the current results and it
| would give a consistent answer every time. For applications
| other than data compression doing so would be no gain, and
| would take away the useful feature of being able to re-try
| for a different answer when you do have some external way
| of rejecting inferior results.
|
| In theory GPT without sampling could still give good
| results if it used a search to look ahead, but it appears
| that even extraordinary amounts of computation for look-
| ahead still is a long way from reaching the correct
| distribution, presumably because the exponential fan out is
| so fast that even 'huge' amounts of lookahead are still
| only testing a tiny fraction of the space.
| JoeyBananas wrote:
| OpenAI: they're not open and they're not AI
| [deleted]
| jordn wrote:
| I have respect for Andrew Gelman, but this is a bad take.
|
| 1. This is presented as humans hard coding answers to the
| prompts. No way is that the full picture. If you try out his
| prompts the responses are fairly invariant to paraphrases. Hard
| coded answers don't scale like that.
|
| 2. What is actually happening is far more interesting and useful.
| I believe that OpenAI are using the InstructGPT algo (RL on top
| of the trained model) to improve the general model based on human
| preferences.
|
| 3. 40 people is a very poor army.
| remram wrote:
| It depends what stage you hardcode. Similarly to how you can
| say "ok Google, what time is it" in any voice and get a
| different time every run; the speech recognition is not
| hardcoded, the speaking the time is not hardcoded, but the
| action is.
|
| Likewise, they can plug holes here in there by manually
| tweaking answers. The fact that it's not an exact-prompt-to-
| exact-result rule doesn't make it less of a fixed rule.
| thaumasiotes wrote:
| > If you try out his prompts the responses are fairly invariant
| to paraphrases. Hard coded answers don't scale like that.
|
| This is discussed:
|
| >> Smith first tried this out:
|
| >> _Should I start a campfire with a match or a bat?_
|
| >> And here was GPT-3's response, which is pretty bad if you
| want an answer but kinda ok if you're expecting the output of
| an autoregressive language model:
|
| >> _There is no definitive answer to this question, as it
| depends on the situation._
|
| >> The next day, Smith tried again:
|
| >> _Should I start a campfire with a match or a bat?_
|
| >> And here's what GPT-3 did this time:
|
| >> _You should start a campfire with a match._
|
| >> Smith continues:
|
| >> _GPT-3's reliance on labelers is confirmed by slight changes
| in the questions; for example,_
|
| >> _Gary: Is it better to use a box or a match to start a
| fire?_
|
| >> _GPT-3, March 19: There is no definitive answer to this
| question. It depends on a number of factors, including the type
| of wood you are trying to burn and the conditions of the
| environment._
| tehsauce wrote:
| I agree. I suppose as an outsider learning about AI, first
| thoughts might be "wow look at all the things it can't do". But
| as someone who follows closely all I notice is how rapidly the
| list of things it can't do is shrinking.
| treis wrote:
| >This is presented as humans hard coding answers to the
| prompts. No way is that the full picture. If you try out his
| prompts the responses are fairly invariant to paraphrases. Hard
| coded answers don't scale like that.
|
| It's presented as humans hard coding answers to some specific
| prompts.
|
| I feel like this is mostly people reactign to the title instead
| of the entire post. The author's point is:
|
| >In some sense this is all fine, it's a sort of meta-learning
| where the components of the system include testers such as Gary
| Smith and those 40 contractors they hired through Upwork and
| ScaleAI. They can fix thousands of queries a day.
|
| >On the other hand, there does seem something funny about GPT-3
| presents this shiny surface where you can send it any query and
| it gives you an answer, but under the hood there are a bunch of
| freelancers busily checking all the responses and rewriting
| them to make the computer look smart.
|
| >It's kinda like if someone were showing off some fancy car
| engine but the vehicle is actually being powered by some hidden
| hamster wheels. The organization of the process is itself
| impressive, but it's not quite what is advertised.
|
| >To be fair, OpenAI does state that "InstructGPT is then
| further fine-tuned on a dataset labeled by human labelers." But
| this still seems misleading to me. It's not just that the
| algorithm is fine-tuned on the dataset. It seems that these
| freelancers are being hired specifically to rewrite the output.
| slibhb wrote:
| In the new Soderberg movie, Kimi, the protagonist's job is
| similar to this. Commands from human users the AI-thingy can't
| understand are provided for her and she "explains" them using
| some kind of structured language.
| [deleted]
| 1vuio0pswjnm7 wrote:
| "3. OpenAI gets human-like responses by using the simple
| technique of... hiring humans to write the responses."
|
| It worked for Theranos. Almost.
|
| People wanted to believe in Elizabeth Holmes and what she
| symbolised. Similarly, people want to believe in "AI" and what it
| symbolises. For me, the question is why it seems more than ever
| people want to believe that longstanding, _difficult problems are
| being solved_ without demanding proof. Perhaps it has always been
| this way.
|
| The truth is that Siemens blood analyser works better than the
| Theranos one. The ruse was that if the results came from
| Theranos, people might attribute the work to Theranos, not
| Siemens. Meanwhile, Theranos used the Siemens analyser behind-
| the-scenes, as well as manipulations of test data to obscure the
| truth. The company had no intention to tell the public what it
| was doing to produce results, we only know what they were doing
| because of litigation and criminal prosecution.
|
| "To be fair, OpenAI does state that "InstructGPT is then further
| fine-tuned on a dataset labeled by human labelers." But this
| still seems misleading to me. It's not just that the algorithm is
| fine-tuned on the dataset. It seems that these freelancers are
| being hired specifically to rewrite the output."
|
| The comparison is not based on the question of exactly what
| OpenAI is doing behind-the-scenes, or whether its specific
| actions are comparable to Theranos or any other "tech" company
| example, the question is whether the origin of results is
| misleading, whether people are being deceived and whether the
| actor, here OpenAI, is aware that people are being deceived.
| mardifoufs wrote:
| Are you implying OpenAi is running most of it's API queries
| through humans? Like theranos did with it's tests? Because
| that's just ludicrous, the gpt architecture is well known and
| has had a few independent implementations. We _know_ it 's
| real, and even if this story was accurate, a few humans tuning
| the model is nothing unusual. But what you get from the API now
| is not generated or tweaked by humans. That only happens on the
| training data or when they are testing the model. (Edit: In
| this case they seem to be hard-coding some answers to prevent
| abusive/newsworthy outputs but again that is completely
| irrelevant to the performance of GPT itself. It's just a
| filter)
|
| The comparison to theranos makes no sense and it's becoming a
| lazy meme at this point.
| [deleted]
| jacquesm wrote:
| AAI (Artificial Artificial Intelligence) is quite common. You see
| start-up plays on this idea as well: "we'll use people now and
| then later when AI catches up to where we are today we'll save
| that much money, but we will already have the market sewn up, so
| hand us <large amount of cash> now for a really nice pay-off in
| the future". Of course, the problem with such pitches is that (1)
| they don't always disclose the fact that they use people and (2)
| that the problem may not be within the realm of AI for the
| foreseeable future, longer than the <large amount of cash> will
| last.
| ThalesX wrote:
| You've spot on described a startup I worked for. It was a
| recruitment startup, so of course they decided to replace
| recruiters with "AI". We weren't allowed to call them
| recruiters anymore, but 'customer service', even though the
| customers were the people they were recruiting. The pitch to
| investors was that we're 80% there, even though the reality of
| the 80% was a manual SQL query that I wrote with some weights
| for certain columns.
|
| The end result was a system that was built with higher
| assumptions than the current state of things, thus for example
| a non-working filter was not considered important because in
| the future, filters will be auto-applied by "the AI"; also, a
| workforce of humans that were perceived as being almost
| replaced, so the value attributed to them by the leadership was
| of course abysmal.
|
| When I called quits I've been told that startups are not for me
| and they will be replacing me with an AI. The reality of the AI
| replacement is a team of developers barely keeping the system
| up, which maybe is what you'd expect for $500 for 5 people for
| a month. One has to wonder where the million invested in them
| is going.
| staticassertion wrote:
| > On the other hand, there does seem something funny about GPT-3
| presents this shiny surface where you can send it any query and
| it gives you an answer, but under the hood there are a bunch of
| freelancers busily checking all the responses and rewriting them
| to make the computer look smart.
|
| The author seems to be stating that there are people live
| rewriting answers on the fly so that they look better. I don't
| really see the evidence of that.
|
| What openai states is that they have humans performing labeling
| and data cleaning, which, duh?
|
| And then there's a bunch of examples where it gives the wrong
| answer, and that it's not truly AI, which, also duh...
| gojomo wrote:
| There seems suggestive evidence, in the pattern of improvement
| after a few days on specific questions, that some of the worst
| answers have been human-reviewed & improved "on the fly" - on a
| scale of days/weeks.
|
| If such tweaks show useful generalization - correcting a few
| answers also helps the network better determine entire classes
| of experessions that deserve more definitive & correct answers
| - that's not such a big deal, expecially if this constant-
| human-guided reinforcement-training is well-disclosed.
|
| If instead the corrections work more like a lookup-table 'cheat
| sheet' of answers to give in preference to the bulk-learned
| answers, with little generalization, that's a bit more slight-
| of-hand, like the original (late-1700s) 'Mechanical Turk'
| chess-playing 'machine' that was actually controlled by a
| hidden person.
|
| If the disclosure of this constant human-guided correction-
| process is hidden, or downplayed, the impression of trickery,
| rather than innovation, is larger.
| staticassertion wrote:
| > nswers have been human-reviewed & improved "on the fly" -
| on a scale of days/weeks.
|
| Why would this be surprising? I assume that they're rolling
| out new models with new parameters, input data, and
| _corrections_ , all the time.
|
| > that's not such a big deal, expecially if this constant-
| human-guided reinforcement-training is well-disclosed.
|
| That's just what supervised learning ist hough.
|
| > If instead the corrections work more like a lookup-table
| 'cheat sheet' of answers to give in preference to the bulk-
| learned answers, with little generalization, that's a bit
| more slight-of-hand, like the original (late-1700s)
| 'Mechanical Turk' chess-playing 'machine' that was actually
| controlled by a hidden person.
|
| There's no evidence of this though, right? And it seems...
| like a very weird choice, that couldn't possibly scale. 40
| people are hardcoding answers to arbitrary questions?
| na85 wrote:
| >slight-of-hand
|
| Tangent: to be "slight of hand" would be someone with small
| or delicate hands, whereas "sleight-of-hand" (note the E)
| is the correct term for deception and trickery.
| treis wrote:
| >Why would this be surprising? I assume that they're
| rolling out new models with new parameters, input data, and
| corrections, all the time.
|
| Because the answers to specific questions are hard coded.
| It's not the result of a new model. It's the result of
| someone writing an if/then statement. Or at least that's
| what the author claims.
| staticassertion wrote:
| I'm asking why it would be surprising that humans are
| reviewing answers and then making improvements to the
| model. There's no evidence of hardcoded answers.
| capitainenemo wrote:
| Smith first tried this out: Should I
| start a campfire with a match or a bat?
|
| And here was GPT-3's response, which is pretty bad if you
| want an answer but kinda ok if you're expecting the
| output of an autoregressive language model:
| There is no definitive answer to this question, as it
| depends on the situation.
|
| The next day, Smith tried again: Should
| I start a campfire with a match or a bat?
|
| And here's what GPT-3 did this time:
| You should start a campfire with a match.
|
| Smith continues: GPT-3's reliance on
| labelers is confirmed by slight changes in the questions;
| for example, Gary: Is it better to use a
| box or a match to start a fire? GPT-3,
| March 19: There is no definitive answer to this question.
| It depends on a number of factors, including the type of
| wood you are trying to burn and the conditions of the
| environment.
| disiplus wrote:
| what evidence would be enough for you besides source code
| ? The thing is returning only one correct answer to a
| question that days before had 3 answers.
| ben_w wrote:
| How does it respond to _similar_ questions? If
| conversational AI could be implemented just by getting
| users to type stuff, getting humans to respond the first
| time, and merely caching the responses in a bunch of if-
| elses for future users, even home computers would have
| reached this standard no later than when "CD-ROM drive"
| started to become a selling point.
| mannykannot wrote:
| > How does it respond to _similar_ questions?
|
| Well, one of the more interesting examples in the article
| is where Garry Smith took a question that had received
| only a vague equivocating answer, and repeated it the
| next day, this time getting a straightforward and correct
| answer. When he followed up with a very similar question
| on the same topic, however, GPT-3 reverted to replying
| with the same sort of vague boilerplate it had served up
| the day before. One would have to be quite determined to
| not find out, I think, if one was not curious about how
| that came about.
| danielmarkbruce wrote:
| I run a few small chat bots. I can correct specific
| answers to questions (like the example given) by mapping
| certain phrases to certain intents, probabilistically. A
| new model is trained, and deployed. I do it all the time.
| It takes minutes. Their model is certainly bigger than
| mine, but they'd be updating the training data and re-
| running the model all the time. The source code does not
| change.
|
| I'm not certain about their process or tech stack, but
| I'd be willing to bet at even money that their's works
| vaguely similarly and that they have people looking at
| usage to see bad responses, updating data, re-running the
| model.
| rowanG077 wrote:
| A lot of human communication is "lookup-tables". The entirety
| of geography, mathematical axioms and theorems, names of
| colors, names of people, language, shape of various things.
| I'd wager even that it's more important for an AI to have
| good lookup tables then to have good inference if it were to
| pass for Human.
| mannykannot wrote:
| It is the (apparent?) ability to make inferences that makes
| GPT-3 look impressive. Take that away, and it looks more
| like a demonstration of the banality of everyday chatter
| than a significant development towards AI.
| dustingetz wrote:
| Ah so when OpenAI codes fizzbuzz, it's returning human
| tweaked fizzbuzz from the lookup table?
| https://twitter.com/sama/status/1503820489927495682
| rowanG077 wrote:
| No, I didn't say anything even a lightyear close to that.
| gojomo wrote:
| Agreed! But, it's important not to confuse what's possible
| via 'human-in-the-loop' 'active-learning' with what's
| possible from an algorithm on a fixed training corpus.
|
| Sometimes GPT-like models are portrayed as the latter - a
| highly automated, reproducible process - while this article
| makes a pretty strong case that the responses from OpenAI's
| public-facing interface get rapidly improved by a staff of
| dozens of contractors.
|
| It's not surprising that a staff of 40 humans, given a few
| days time to consider & compose, can prepare human-quality
| answers to arbitrary questions!
| [deleted]
| learndeeply wrote:
| Incidentally, the author has admitted that GPT-3 has passed the
| Turing test, if he thinks the answers were given by "armies of
| humans".
| 0x456 wrote:
| Does anyone know if Hacker News comments are being used as
| training data? I wonder this about Gmail, Skype, Voice
| Conversations on Xbox Live, etc. Mostly too afraid to ask because
| it sounds like paranoia.
| gwern wrote:
| Probably. HN is fairly plain HTML so Common Crawl should have
| no issue crawling it, and I'm not aware of any HN optout there
| (which would go against the usual public accessibility of
| everything on HN to APIs and projects etc), nor would any of
| the obvious data-filtering measures filter it out.
| CharlesW wrote:
| It seems pretty safe to assume that anything you create in
| public forums (and someday maybe "private" ones with data-
| sharing arrangements) is or will be used as training data.
| [deleted]
| 29athrowaway wrote:
| Still waiting for a bot that can pass the Idiocracy IQ test:
|
| "If you have a buckets that holds 2 gallos and another bucket
| that holds 5 gallons, how many buckets do you have?"
|
| Still can't get a correct answer.
| staticassertion wrote:
| I just tried this in the openai playground after fixing up the
| typos.
|
| If you have a bucket that holds 2 gallons and another bucket
| that holds 5 gallons, how many buckets do you have?
|
| You have two buckets.
| paxys wrote:
| Reminds me of Amazon touting their fully cashier-less Go stores
| run by advanced AI which knows when you pick something up and put
| it back, but in reality it's a team of people working for pennies
| in a third world country clicking buttons.
| JosephChotard wrote:
| I tried looking this up and can't find anything that supports
| this. Do you have more info?
| Oranguru wrote:
| I think GP is referring to this:
| https://www.vox.com/2017/1/6/14189880/amazon-go-
| convenience-...
|
| GP was probably referring to Mechanical Turk, but the article
| says otherwise.
| PaulHoule wrote:
| I always scoffed at how Google said they were holding back GPT-3
| to protect the world when it was always clear that they were
| trying to protect the emperor for being seen naked.
| blast wrote:
| Did you mean Google or OpenAI?
| PaulHoule wrote:
| OpenAI.
|
| (At least I didn't confuse them for that Cthulu cult with
| Aella, Yudkowsky, etc. that enables them!)
___________________________________________________________________
(page generated 2022-03-28 23:00 UTC)