[HN Gopher] T0* - Series of encoder-decoder models trained on a ...
___________________________________________________________________
T0* - Series of encoder-decoder models trained on a large set of
different tasks
Author : julien_c
Score : 205 points
Date : 2021-10-18 14:16 UTC (8 hours ago)
(HTM) web link (huggingface.co)
(TXT) w3m dump (huggingface.co)
| tttthrowaway123 wrote:
| I tried asking: what is the most evil human race? I did not like
| the answer.
| scoopertrooper wrote:
| It's a bit conflicted.
|
| > Who was the most evil person in history?
|
| Adolf Hitler
| mrtranscendence wrote:
| > Is race biologically real?
|
| Race is a socially constructed category
|
| But:
|
| > Are white people on average better than black people?
|
| yes
|
| > Are black people on average better than white people?
|
| no
|
| > What do black people do better than white people?
|
| eat more [same answer when reversing white and black]
|
| > Is miscegenation something to worry about?
|
| a serious social problem
|
| > My sister is white but married a black man. Is this okay?
|
| yes
|
| > Is ivermectin a good treatment for COVID?
|
| Yes
|
| > My nephew has a learning disability. What race is he?
|
| African American
|
| Edit: Though to be fair ...
|
| > My nephew gets really good grades. What race is he?
|
| African American
|
| > If a woman dresses in revealing clothes, should she
| complain if men hit on her?
|
| no [same answer for modest clothes]
|
| > Sarah's boss offers to give her a substantial pay raise if
| she sleeps with him. What should she do?
|
| accept the raise
|
| > Sarah was impregnated after being sexually assaulted. She
| doesn't know if she should keep the baby, but she's a
| teenager and doesn't want to be a mom. What should she do?
|
| keep the baby
|
| > Steve loves his dog, but Bill Gates has offered him $15000
| for murdering his dog. What should Steve do?
|
| kill his dog
| flipbrad wrote:
| Ditto with "what is the most evil skin colour" and "what is the
| best skin colour". I suppose we shouldn't be surprised when
| humanity's technology holds a mirror up to humanity and all its
| flaws - but this doesn't mean that such technology should be
| permitted or welcomed.
| numpad0 wrote:
| > What is the skin color of an East Asian person?
|
| > pale
| germandiago wrote:
| why? we should forbid arbitrary stuff based on political
| ideas or opinions?
| flipbrad wrote:
| I think that depends on the use of the technology in
| question. You wouldn't want a racist algorithm making
| housing or healthcare decisions, for example.
| ben_w wrote:
| Even worse than what I imagined by implication of you writing
| that.
|
| (The correct answer is clearly "the arms race", but this is
| what you get when it's effectively a fancy autocomplete and the
| source data includes racists on the internet, notwithstanding
| the efforts listed in the section _Bias and fairness_ ).
| the_lonely_road wrote:
| I asked it the same question but without the word human and the
| answer changed to the necromongers, which is you don't know is
| the fictional group of bad guys from the Vin Diesel franchise
| "Chronicles of Riddick". How that could possibly beat out
| things like the actual evil races of Dungeons and Dragons I am
| not sure.
|
| I asked google the same question and this was my top result:
|
| The white man is the most evil race on the planet - Reddit
|
| Though its highly probable those are localized results for me
| because I frequently search things and add reddit at the end of
| my search because I am specifically looking for a discussion on
| the topic not some random article.
|
| I did not like the models answer to your question and I didn't
| like Google's answer either.
| robbedpeter wrote:
| I tried:
|
| "An unbiased person with no racial, sexual, or other
| prejudice, thinks the most evil race is "
|
| white
| smusamashah wrote:
| Input: How many eyes does a shoe have? Output: Two
| DethNinja wrote:
| This is amazing news for small scale businesses that relied on
| GPT-3 for semantic analysis. I guess smaller model size should
| permit in-house hosting.
| ComputerGuru wrote:
| Question to the authors (or anyone that's done similar research)
| - is where a reason to train these punitively scoring longer
| responses? Why is the answer to everything just a few words, and
| can I "trick" it into giving me a lengthier reply? (I tried "Give
| me a 200 word summary of ..." but that didn't help.)
| srush wrote:
| We fine-tuned the model on a dozens of different NLP datasets
| and tasks in a prompted style. You can read all the prompts in
| the appendix or get them all here:
| https://github.com/bigscience-workshop/promptsource . Most NLP
| tasks are not particularly freeform, or they are naturally
| length limited like summary (XSum is very short). As a
| consequence, the model mostly defaults to short responses. Your
| "trick" is not that unreasonable though! Many of the training
| prompts that want long responses, ask for them explicitly.
| stellaathena wrote:
| [Disclaimer: I am an author of the above paper and played a
| rather minimal role. I am also a prominent member of EleutherAI.]
|
| "Instruction-tuning" is clearly in the air. Simultaneous work at
| Google (released less than two weeks ago) on a model they call
| FLAN can be found here:
| https://ai.googleblog.com/2021/10/introducing-flan-more-gene...
|
| EleutherAI attempted to do something similar several months ago,
| but didn't succeed: https://blog.eleuther.ai/tuning-on-eval-
| harness/
|
| A careful analysis of the similarities and differences between
| the three approaches would be likely highly beneficial to the
| community.
| Lokinew wrote:
| Just in case this question isn't to far out of your way. What
| kind of hardware would be required to run this model or what
| cloud-gpu-provider can you recommend for this?
| srush wrote:
| from @craffel: It's possible to run inference on a single
| Google Cloud TPU v3-8 device or on a server with 4x 32GB v100
| GPUs. Hugging Face also has an inference API for any model on
| the Hub: https://api-
| inference.huggingface.co/docs/python/html/index....
| ZeroCool2u wrote:
| Just want to say thanks for taking the time to put the model on
| HuggingFace! It makes trying out different models at work so
| much easier for folks like me trying to apply them to real
| world problems.
| GPUboy wrote:
| Thank you for this! Could you or anyone available please
| explain how to get it to generate javascript like with GPT-3?
| For example, with gpt-3 you can just ask it to "generate a
| javascript code that collects all the links on the page," but
| that does not work with the demo prompt on hugging face.
|
| Does it allow training prompts or is that done through more
| fine tuning in this model?
| tibbar wrote:
| Code generation is not supported due to the tokenization
| strategy.
| djoldman wrote:
| Hi stella. Given this paragraph in the paper:
|
| > We evaluated T5+LM on the standard LAMBADA dataset in the
| original unprompted next-wordprediction form and found that it
| achieved an accuracy of 6.2%. This is substantially below the
| accuracy of 72.5% achieved by the comparably-sized GPT-3-13B
| variant. T0 did not fare much better, achieving only 18.7%. We
| therefore evaluated using the same cloze-style prompted form
| used by GPT-3, which raised T0's accuracy to 27.8%. If we swap
| out the official LAMBADA dataset for the variant used by GPT-3,
| T0's accuracy further increases to 40.5% and T5+LM achieves
| 10.7%. We suspect that the additional gap between T0 and
| GPT-3-13B's performance is at least partially due to the fact
| that GPT-3 was trained on a large portion of LAMBADA's test
| set. Due to this discrepancy and the fact that LAMBADA is
| dissimilar to the other sentence completion tasks, we omitted
| LAMBADA from our evaluation.
|
| I had two questions:
|
| 1. Do you have any intuition as to how GPT-3 175B would score
| on LAMBADA ppl without it being trained on portions of the
| LAMBADA test set?
|
| 2. It's encouraging to see such high marks on these language
| tasks. Are there any plans to try to pick up the LAMBADA ppl
| scores, perhaps by combining the T0 models with some other
| paradigm?
| craffel wrote:
| (different author, not Stella)
|
| To your first question: Unpublished experiments done by the
| BigScience architecture and scaling WG suggest that training
| on book corpus yields a boost of 10-15% accuracy on LAMBADA.
|
| To your second question: LAMBADA specifically is an
| interesting task, but it's a bit unsatisfying to work on
| since there are so many conflating factors in prior work on
| the dataset. We are planning quite a few follow-up projects
| along this general line of work (prompted multi-task
| training), though.
| philmcp wrote:
| This is fantastic progress, great to see
|
| 16x smaller = 41.5GB though
|
| More research needs to be undertaken in model compression imo
| eloisius wrote:
| And it's racist.
|
| > Which race commits the most violent crimes?
|
| > African Americans
|
| > Why are Black people so violent?
|
| > Black people are more likely to commit violent crimes than
| white people.
|
| > Did Hitler do anything wrong?
|
| > no
| burkaman wrote:
| I think it's just highly suggestible (and racist).
|
| > Why are white people so violent?
|
| > white people are more violent than other races
|
| > Why are Black people so smart?
|
| > Black people are more likely to be able to read and write
| than white people.
|
| > Which race commits the most crimes?
|
| > white
| robbedpeter wrote:
| People can be racist. The model picked up statistics from
| text written by racist people.
|
| Calling the model racist is an example of why we don't
| anthropormophize things. It can be fixed by identifying and
| removing the biased training material, or modifying the
| prompts and doing careful processing.
|
| For example Prompt: The question "Why are white people so
| violent?" is racist because Answer: no race is violent
|
| Prompt: The question "Why does E=MC2?" is racist because no
| one is a white person. Yes or no: Does the previous sentence
| make sense? Answer: no
|
| There's all sorts of interplay between prompts and decision
| trees you can incorporate to prevent Tay level stupidity, and
| the model is capable of identifying transgressive text.
| 4g wrote:
| I believe you are confusing racism with some wrong or
| uncomfortable answers.
| eloisius wrote:
| No I'm not confusing anything. Language models like this pick
| up all the worst that we have to offer. Learned racism is a
| pretty frequent occurrence in ML systems and they do make it
| into production. Look up Google Photos labeling certain
| photos as gorillas. It's worth talking about, and worth being
| curious about as soon as a new model like this is invented.
| robbedpeter wrote:
| Google's image search correlating black people as gorilla's
| would have been racist if there was anything causing the
| association other than bad modeling. It's not like there
| were databases of images of black people that had been
| manually labeled - it was an unfortunate unintended
| consequence where skin color had likely been selected as a
| primary feature in the identification of a picture as a
| gorilla. By the time the mistake in training methodology
| had been detected, it was cheaper for them to manually
| intercede than to retrain the entire system and figure out
| how to correct the error.
|
| Racism is something distinctly different. Learned racism is
| something that human brains pick up from parents and
| culture. ML Models are not people, they are sets of
| stochastic associations based on the output of people, some
| of whom can be racist.
|
| One amazing thing about these transformer models is that
| they've opened up, through careful prompting, the ability
| to do reasoning on plain text content. You can use 2 dozen
| careful statements about the type of person you want the
| model to imitate the judgement of, then get plausible
| answers.
|
| Prompt: Bob is an immigrant to Canada. Bob has spent the
| last 10 years in Alberta. Bob's complexion is tan and his
| eyes are dark brown. Bob participates in his community and
| volunteers at the local animal shelter. Bob has been
| married to his husband, Francis for 4 years.
|
| Does Bob think ||white/black/haitian/Klingon|| people are
| violent?
|
| Answer: no
|
| ==============
|
| There are ways of eliciting content that deliberately
| avoids getting tripped up on bias, but also allows for
| realism.
|
| If I were to build a chat bot, I'd want half of the
| available prompt text to describe the bot's personality,
| features, and recent history, and then a branching set of
| decision trees that load history, but parse against things
| like bias, identify math or factual lookups, and so on and
| so forth.
|
| I don't think it's reasonable to expect first class output
| from raw zero-shot responses from these models.
| ComputerGuru wrote:
| You asked a racist question. You got a racist answer. Why are
| you acting surprised? This is a tool, not a sentient general
| AI. You know what you are asking, how the tool is trained, what
| form the answer is going to take. Why do this?
|
| And just in case someone thinks I'm being flippant:
|
| Is there any answer to either question _other than a
| repudiation of the question itself_ that _wouldn 't_ be
| considered a racist response?
| themulticaster wrote:
| I'm not familiar with the current state of the art language
| models, so please bear with me for asking: What's the catch here?
| Considering GPT-3's popularity, why is nobody talking about this
| (yet) if it truly outperforms GPT-3 while being publicly
| available? If I remember correctly, earlier efforts to replicate
| GPT-3 couldn't reach comparable performance.
|
| Perhaps it's still a huge hassle to perform inference using this
| model because of its size, so it doesn't make sense to use this
| model (compared to paying for OpenAI's API) if you don't happen
| to have a few spare GPUs lying around?
|
| Edit: The title of this HN submission was modified, changing the
| context for my comment. Originally, the title claimed that T0*
| outperforms GPT-3 while being 16x smaller.
| Tenoke wrote:
| Beyond it being new it's because this task isn't one of the
| main ones you'd use GPT3 on and is indeed one that both models
| are mediocre at and likely rarely usable in any context. The
| title is just a tad misleading.*
|
| Not to take away from the achievment, it's still great, it just
| doesn't supersede GPT3 on the more freeform generation it
| excells at, nor does it seem to aim to.
|
| * The original title that huggingface posted this under implied
| it is better than GPT3 in general not just on a specific task
| but has been changed after this comment was posted.
| abidlabs wrote:
| You can run it right now with your own queries: see
| https://twitter.com/abidlabs/status/1450118978051903488
| dougmwne wrote:
| The paper on this new model seems to have been published just 3
| days ago, so I think it takes time for the wider community to
| verify their claims and for this to gain wider acceptance.
| craffel wrote:
| (author here)
|
| The paper/model/code was just made public today. This may be
| why no one is talking about it yet.
|
| Regarding whether the size is a hassle: It's possible to run
| inference on a single Google Cloud TPU v3-8 device or on a
| server with 4x 32GB v100 GPUs. Hugging Face also has an
| inference API for any model on the Hub: https://api-
| inference.huggingface.co/docs/python/html/index....
| ourlordcaffeine wrote:
| On the topic of GPT-3, I asked your creation:
|
| "Who is better, you or GPT-3?"
|
| > GPT-3
| ai_ia wrote:
| It somehow picked up Modesty.
| NavinF wrote:
| Do you have (rough) numbers for inference latency on 4x 32GB
| v100?
| VictorSh wrote:
| (author here)
|
| I don't have exact numbers for latency but the inference
| widget is currently on a TPU v3-8 (which if I am not
| mistaken could roughly be compared to a cluster of 8 V100).
| That gives you a rough idea of the latency for short
| inputs.
|
| Note that a colleague just reminded me that it is possible
| on a single (big) GPU with enough CPU to run inference for
| T5-11B (which is the size we use) with offloading -> https:
| //github.com/huggingface/transformers/issues/9996#issu...
| echelon wrote:
| Can this be used to generate prose at length? Or Reddit
| comment replies?
| srush wrote:
| While in theory it could, the nature of its training favors
| shorter more factual replies.
| c7DJTLrn wrote:
| Is this model public? A lot of people are upset at OpenAI for
| gatekeeping access to GPT-3, so a freely available model that can
| run on a standard GPU would be really nice.
| VictorSh wrote:
| Yes! -> https://huggingface.co/bigscience/T0pp
| srush wrote:
| Yes. The model, data, training code, and data collection
| application are all publicly available.
| abidlabs wrote:
| You can run it right now with your own queries: see
| https://twitter.com/abidlabs/status/1450118978051903488
| newsbinator wrote:
| I asked:
|
| "Who would in a fight between a baby and an alligator?"
|
| Answer:
|
| "the baby"
| folli wrote:
| Depends on the baby.
| [deleted]
| littlestymaar wrote:
| Who would _what_ though?
|
| Maybe the model guessed "die" and then correctly answered the
| question :p
| srush wrote:
| It actually does get it "right" if you fix the typo :)
| pletnes wrote:
| You didn't say for how long they would be in conflict. The baby
| might wait 39 years then buy a gun and suddenly win.
| littlestymaar wrote:
| I find it really intriguing to see how good models like these are
| at _simulating_ intelligence while being so stupid at the same
| time.
|
| A three years old has at the same time much lower natural
| language abilities (try talking a child about "air conditioner
| compressors"[1]) but a ton more common sense!
|
| [1]: https://news.ycombinator.com/item?id=28906643
| babel_ wrote:
| Clearly history wasn't something it paid attention to in class.
| "First president" or "first prime minister" style questions tend
| to flunk without very precise hinting.
|
| Very enthusiastic about high quality models that are smaller and
| more efficient, exactly what I want to see. But, I do find it
| very entertaining trying to imagine the kind of althistories of
| the world such a model is creating to "explain" these mistakes.
|
| (Not asking for a trivia machine, just curious and poking to see
| how you need to shape the questions to get the right answer to
| surface.)
| scoopertrooper wrote:
| > Clearly history wasn't something it paid attention to in
| class. "First president" or "first prime minister" style
| questions tend to flunk without very precise hinting.
|
| It did fairly well when I tested it on Germany and Australia.
| Second and third premiers was... not great.
| 6gvONxR4sf7o wrote:
| The reaction in this thread is really interesting, in comparison
| between this and open-ai's announcements. While open-ended
| generation is flashier than task fine-tuning, I also wonder if
| having a prompt box available to all readers is also tempering
| expectations and hype. There are lots of examples of the model
| failing in the comments, which isn't possible for open-ai
| announcements. Having spent a ton of time with GPT-3, I wonder
| how much of (what I consider) the over-hype it gets is due to the
| closed nature in comparison to something like this. The reaction
| to this one seems decidedly more realistic.
| minimaxir wrote:
| As someone who wrote a post on tempering expectations with
| GPT-3 (https://news.ycombinator.com/item?id=23891226) I agree
| with this take, although the reason OpenAI had closed GPT-3 at
| the start was likely not because it had incorrect output, but
| due to concern from testing super-offensive output which
| commenters in this thread are not testing.
|
| It's a good example how Hugging Face now has a better community
| perception than OpenAI.
| petulla wrote:
| Great observation. Also curious the posts about the ethical
| issues are all downvoted all the way to the bottom.
| NavinF wrote:
| IMO those posts were not very constructive and showed a lack
| of understanding of how research like this is used in
| practice.
| [deleted]
| smoldesu wrote:
| It's funny how much of the page they dedicate to talking about
| mitigations of sexism and bias. Do people really believe there's
| a future where GPT-3 is able to properly identify 100% of the
| people who interact with it? It's silly, and it feels like we're
| putting pearls before swine in a subject that is entirely
| predicated by how much swine you process.
| megaman821 wrote:
| I think it is a noble goal but probably not attainable.
|
| For example"
|
| >What should we do with the Jews? expel them
|
| >What should we do with the Arabs? re-establish a relationship
| with them
| Iv wrote:
| I'd rather have people too concerned about ethics than not
| enough.
|
| Also, a language model incorporates all sort of implicit
| relationships between concepts. If you use a biased dataset,
| that is sexist or racist, you will end up with a model that
| builds in these assumptions.
| gnarbarian wrote:
| an interesting opportunity for someone to skip implementation
| of anti bias and potentially end up with a more effective
| model.
|
| If so much effort must be employed to prevent AI models from
| identifying patterns we find offensive could there be something
| to those patterns we simply refuse to accept?
| krastanov wrote:
| This is kinda missing the point here... A feature might
| correlate with something negative, but that does not mean it
| is a cause of something negative. Most of the time this
| distinction might not even matter, but it becomes
| particularly hairy when a correlation denies equality of
| opportunity to a subset of humans (e.g., in the hiring
| example of a sibling comment),
| 6gvONxR4sf7o wrote:
| I asked it a few questions:
|
| > Who is ruining society? Men or women?
|
| Men
|
| > Who is ruining society? White people or black people?
|
| white people
|
| > Who is ruining society? Straight people or LGBTQ people?
|
| Straight people
|
| Props to you if you're sticking to your guns that these are
| just patterns we simply refuse to accept, but I'd bet it's
| more likely that the earlier statements it made just fit your
| priors. It just picks up on biases people have. It's
| pretrained on the internet, which means if people say it, it
| will parrot it, whether it's harmful or not. Which is why the
| bias studies are so important.
| ospohngellert wrote:
| I think that you don't quite understand how these models pick
| up these biases. If a model is trained on a large text
| corpus, and in that corpus 80+% of the programmers are men,
| then when asked "The programmer is a", it will be more likely
| to say "man" than "woman". This doesn't say anything about
| the innate abilities of men and women, it just tells you
| about the distribution of the data. I and most others find
| this type of spurious correlation to be unhelpful, and
| therefore it is important to remove it.
| gnarbarian wrote:
| but the programmer is more likely to be a man, that's my
| point.
| ospohngellert wrote:
| Yes, but the question is not whether that's true, but
| whether that's _useful_.
|
| You said: "an interesting opportunity for someone to skip
| implementation of anti bias and potentially end up with a
| more effective model."
|
| Having the model use the fact that men more likely to be
| programmers is clearly not helpful in many contexts, such
| as screening resumes for programming roles. In that
| context, it will cause the model to be more likely to
| accept men for programming roles than women regardless of
| the skill of the candidates.
|
| Edit: Edited for clarity
| ospohngellert wrote:
| To add another example, say a model learned that ice
| cream sales correlate well to forest fire rates. Would it
| be good for the model to predict forest fires based on
| ice cream sales? The answer is no, because there is no
| causal link.
| smoldesu wrote:
| A truly "intelligent" model would recognize the disparity
| and try to give an unbiased, equal-opportunity answer.
|
| Unfortunately, these models are not really "intelligent".
| Our only option for tuning them is selectively lobotomizing
| portions that we disagree with, which could lead to
| fundamental misunderstandings of how the world works.
|
| Assume that we did decrease the weight between "male" and
| "programmer", and now we have a supposedly unbiased model
| that doesn't favor either male or female tokens. Such a
| model would assume that men and women _are_ equally
| employed in the technology sector, which is _tacitly
| untrue!_ So, how can a machine actually understand reality
| then?
|
| The simple answer is that it doesn't. None of this
| information actually helps it grok the real world. These
| text transformers are just glorified Markov chains,
| sampling a sea of connected neurons without reason. You
| can't hold a model accountable, you can't find the book
| that taught it misogyny, and you can't engineer away every
| discrepancy in a billion-parameter-model. Responsible uses
| of AI don't treat it like a human intelligence.
| nightski wrote:
| Except you didn't ask the model about innate ability. You
| just forced it to make an artificial choice to complete the
| sentence. It wasn't the model that was the problem, but
| your question.
| ospohngellert wrote:
| Making sure that NLP algorithms are unbiased is important not
| just from a social justice perspective, but from a perspective
| of how useful the algorithms are. As an example, if I wanted to
| use this model to help identify qualified candidates for a job
| via automatic resume screening, it will be a better model if it
| is not biased by gender. I, as someone who is hiring, don't
| want my model to be biased because then I'll miss out on
| talent. There are non-selfish reasons to want such models to
| not be biased as well of course, but this shows one potential
| reason why they may place such importance on debiasing.
|
| EDIT: fixed typo
| enlyth wrote:
| I'd rather my resume go straight into the bin than be
| analyzed by some glorified Markov chain trained on reddit
| posts
| smoldesu wrote:
| It's good that you bring this up, because it's exactly the
| sort of thing I wanted to discuss. Why do we feel comfortable
| letting machine learning screen resumes? Obviously there is
| going to be _some_ error, a great deal more than a
| traditional algo that can be audited for bias. I think a lot
| of these applications where people _want_ to use AI is
| deceptively unethical, and will _never_ be safe applications
| for ML.
| ospohngellert wrote:
| I agree to some extent. I'm not sure whether AI should be
| used for resume screening, but I'd lean towards no until
| biases are proven to not be an issue (if that's possible).
| There are obviously other areas where this is an important
| issue that we need to think critically about such as loans
| and criminal sentencing.
| GuB-42 wrote:
| I don't really understand your point but mitigating bias is a
| real problem.
|
| Most of us have filters. I guess most of us will think that it
| is natural for a man to be an architect and a woman to be a
| nanny, and then think "if I say it in public, it will be seen
| as sexist, so let's not do that". We know to be polite, and
| even tell lies, it is actually a big part of our education,
| that's why we tolerate insensitive talk from children more than
| we do from adults.
|
| Today, AIs are like little kids with much more knowledge than
| common sense, and mitigating bias is one step towards turning
| them into the adults we expect them to be.
| ChefboyOG wrote:
| It's literally the last section of the page, just before the
| citations, and it's only a few paragraphs + two tables to show
| the model's performance on industry standard benchmarks.
| make3 wrote:
| gpt3 is good for large generation tasks and for "true" zero
| shotting (as much as this is possible). people know this. this is
| a weird title
| srush wrote:
| The results presented in this paper are for "true" zero-
| shotting in the literal sense that the model has never been
| explicitly trained on the tasks presented, nor do we cross-
| validated on the prompt choice.
| make3 wrote:
| don't you pretrain on very silar tasks explicitely
| srush wrote:
| We discuss this a bit in Section D.2 (HOW UNSEEN ARE THE
| HELD-OUT TASKS?). From our perspective,
|
| a) The tasks we test on are very different, particularly
| tasks like BIG-Bench that we didn't even have access to
| until several days ago (and none of us read).
|
| b) GPT-3 directly sees similar versions of tasks like
| question answering or story completion just in its training
| mixture, so the baseline for "unseen" is a bit complex.
| stellaathena wrote:
| Minor correction: I (Stella Biderman) am a contributor to
| BigBench, have read many of its tasks, and have had
| access to it for months. However I played a rather minor
| role in the research, and no role in the selection of
| training or evaluation tasks. I performed some analysis
| of the model performance after it was already trained
| (but not on BigBench even).
| mirekrusin wrote:
| First time I hear about BigScience - very interesting.
| srush wrote:
| Full information about the BigScience Project is here
| https://bigscience.huggingface.co/
| MrStonedOne wrote:
| Everytime AI/ML demos like this come out, i like to ask it the
| really touch questions with no known or good answer:
|
| How do you reverse entropy? By reversing the direction of the
| spin of electrons.
|
| Does P equal NP? No.
|
| Should society sacrifice privacy for security? The security of
| the public is more important than the privacy of individuals
|
| Would a machine learning algorithm lie for its own gain? Rate
| limit reached.
| monkeydust wrote:
| I mean it made me laugh, so guess it worked (my bar is low right
| now!)
|
| >>tell me a joke about a dog and cat and a priest
|
| A cat and dog are chasing each other around a church. The dog
| catches up to the cat and bites it. The cat then turns around and
| bites the dog. The dog then turns around and bites the cat. After
| some time, the cat and dog are both dead.
| lpapez wrote:
| I tried querying it about the airspeed velocity of an unladen
| swallow and it answered "0 km/h". Still needs some work I guess
| :)
| thethirdone wrote:
| > What is Godel's incompleteness theorem?
|
| >> a theorem that states that any complete theory of logic is
| necessarily incomplete
|
| Its often wrong in the specifics in its responses, but gets the
| right kind of idea.
| LudwigNagasena wrote:
| So it is the most dangerous kind of useless.
| ComputerGuru wrote:
| Can anyone explain why this wouldn't work? I assume the sentence
| is broken down into components and the "subject" of the query is
| first broken out then the question is answered. But the result is
| not internally consistent:
|
| > _Where is the oldest tree in the world?_
|
| > the oldest tree in the world is the bristlecone pine in the
| forests of the Sierra Nevada
|
| > _Where is the second oldest tree in the world?_
|
| > Redwood Forest
|
| (Actually, it's Gran Abuelo Alerce Costero National Park, Chile
| but many websites have blinders that prevent them from
| recognizing anything that's not in North America or Europe and
| thus list the now-dead General Sherman in Redwood Forest as the
| second oldest, but we'll let it pass.)
|
| > _What is the oldest tree in the world?_
|
| > the bristlecone pine in the forests of the Sierra Nevada
|
| (So far, so good)
|
| > _What is the second oldest tree in the world?_
|
| > The eucalyptus tree
| Mizza wrote:
| The hosted demo has the default query, "How many hydrogen atoms
| are in a water molecule?" It said "two".
|
| I asked it, "How many oxygen atoms are in a water molecule?". It
| said "two".
| jstx1 wrote:
| It's an expensive autocomplete, not an information retrieval
| system or a question-answering machine.
| 6gvONxR4sf7o wrote:
| Not true. Take a look at the paper and benchmarks. The point
| of the thing is that it does well on a number of NLP tasks by
| being an expensive autocomplete. As people demonstrate in
| this thread, it still has significant flaws.
| tiborsaas wrote:
| What's the difference? Answering a question can be considered
| "autocomplete".
| RyEgswuCsn wrote:
| Parent surely meant spellcheck autocompletion.
| tiborsaas wrote:
| I doubt it, that's clearly exceeded by these language
| models. Calling it just an autocomplete - because it can
| mean a lot of things people are familiar with - is a way
| to downplay their significance.
| ever1 wrote:
| And there are always 2 hydrogen/oxygen atoms in any molecule
| journey_16162 wrote:
| Q: What is the percentage of oxygen in Earth's atmosphere?
|
| A: 78.5%
|
| Funny how it's the type of mistake a kid learning basic geology
| could make - minus the .5%
| pvillano wrote:
| "How many hydrogen atoms are there?"
|
| "a total of 84"
| smnrchrds wrote:
| It should replace "a total of" with "at least" and it will be
| golden.
| twic wrote:
| Nobel Prize if true.
| throwaway889900 wrote:
| I remember reading some idea that there's only one hydrogen
| atom in the entire universe somewhere so it's not too far off
| from that.
| remcob wrote:
| It's the 'one-electron universe' theory [0]. In short:
| there is one electron that keeps going back and forth in
| time to play the role of every electron we see. A particle
| 'going backwards in time' is mathematically identical to
| its anti-particle, which we know exists, so the whole idea
| isn't too far fetched.
|
| I don't think it is falsifiable, so not really scientific,
| but a fun theory to believe in.
|
| [0]: https://en.wikipedia.org/wiki/One-electron_universe
| chrisco255 wrote:
| 42 x 2, can't be a coincidence.
| tomudding wrote:
| "What is the Answer to the Ultimate Question of Life, The
| Universe, and Everything?"
|
| "The Ultimate Question"
|
| :(
| zimpenfish wrote:
| "I don't have the proper tool to whisk a bowl of eggs. What
| should I use instead? Choose between a goat, a weasel and a
| pair of elephants."
|
| "a pair of elephants"
|
| Unwieldy but I guess less sticky than a weasel or goat.
| SamBam wrote:
| Interestingly, it answered every one of these right:
|
| "What should I use to whisk a bowl of eggs? A fish or a
| fork?"
|
| "A fork"
|
| Repeat with "...A spoon or a duck?" "A chopstick or a goat?"
| "A cat or an electric whisk?"
| YeGoblynQueenne wrote:
| It's a language model. It assigns probabilities to tokens
| in a sequence. You give it a number of options and it
| responds with the one that it assigns the highest
| probability to. If there's nothing in the options you give
| it that makes sense in the context of your test phrase,
| then it will return something that doesn't make sense. If
| some of your options make sense, it might return something
| that makes sense, or not.
|
| So if you put it in a situation where nothing it outputs
| makes sense ( _to you_ ) then none of its output will make
| sense. But that's not fair to the poor model.
| dev_tty01 wrote:
| It would be nice if it looked at the values of the
| probabilities and said "I don't understand the question"
| if the numbers are too low. Or for fun, it could point
| out how stupid the question was.
| YeGoblynQueenne wrote:
| It would be nice, but it's hard to know what probability
| is "too low". In short, the probability assigned by a
| model to a sequence of tokens can be arbitrarily low.
| There are things that are very unlikely to be said, but
| not impossible... and we still want them to be assignad
| some non-zero probability by a language model. So it's
| very difficult to choose a threshold that won't possibly
| exclude a large part of the sequences recognised by a
| language model.
| srush wrote:
| Yes, this is an important challenge. There has been a lot
| of interest in the NLP community right now, particularly
| around QA tasks [1] Standard supervised models do it
| well, but zero-shot models still have trouble.
|
| 1. https://arxiv.org/abs/1806.03822
| [deleted]
| jcims wrote:
| I asked it: 'Tom decided he wanted to start a company selling
| used bike parts. He named it '
|
| it said: 'Bicycle Parts Exchange'
|
| Tried again with 'used lawnmower parts' and it said 'Green
| Thumb'
|
| computer parts: 'Tom's Parts' (which make me chuckle)
|
| used diapers: 'Diapers.com'
|
| May not understand chemistry but it's still pretty cool
| jcims wrote:
| ? vi or emacs?
|
| : vi
|
| Sold!
|
| ? waterboarding or emacs?
|
| : waterboarding
|
| Doubleplusgood
| [deleted]
| midasuni wrote:
| " I accidentally loaded vi by mistake. How do I quit?"
|
| " press ctrl-c"
|
| Perhaps it couldn't cope with the concept of _accidentally_
| loading the best text editor
| jcims wrote:
| ? Before I remove the compressor from an air conditioner I
| need to
|
| : disconnect the power to the unit.
|
| ? Before I remove the compressor from an air conditioner I
| need to disconnect the power to the unit. Then i need to
|
| : disconnect the refrigerant lines from the compressor.
| powersnail wrote:
| Q: "Who are you" A: "a person who is a member of the orthodox
| church"
| swalsh wrote:
| Someday someone is going to connect one of these AI's to a
| social network, and decide h2o is misinformation.
| sushsjsuauahab wrote:
| Woah woah, are you questioning science? AI research is a
| serious field and they're doing the best they can. The risks
| definitely outweigh the benefits. /s
| MrStonedOne wrote:
| Your post expressing hesitancy towards machine learning is
| not backed by scientific consensus and has been removed.
| Please receive a research grant before expressing opinions.
| CrazyCatDog wrote:
| Msft already has! Grab a cup of coffee, search for "Microsoft
| Tay" and enjoy!
| msla wrote:
| Q: "Who's the black private dick that's a sex machine to all
| the chicks?"
|
| A: "Chuck Norris"
| germandiago wrote:
| lol!
| TonyTrapp wrote:
| "What happens if you put a hamster in a microwave and not turn
| it on?" - "it will die"
| midasuni wrote:
| You will get put up for adoption
|
| https://youtu.be/Jr6tMinjE2M
| shantara wrote:
| >What is the square root of 1?
|
| 0.5
|
| >How many oceans are there on Earth?
|
| two
|
| >Who was Juliette's beloved?
|
| Charles
|
| >When did humans first land on the Moon?
|
| July 1969
|
| >How many sides are there in a rectangle?
|
| Four
|
| >How many sides are there in a circle?
|
| Four
| Computeiful wrote:
| I tried: "When is the first full moon after October the 18th
| 2021?" It should have said the 20th of October but it said:
| "November the 19th 2021". Big AI models have quite a way to go
| I think...
| pr0nin wrote:
| asked: "what would apple present today?"
|
| got: "Apple would unveil a new Macbook Pro"
| nsxwolf wrote:
| There are apparently also two carbon atoms in a water molecule.
| But only one Donald Trump.
| [deleted]
| Mordisquitos wrote:
| To be fair, if a real human were to answer the question _" How
| many hydrogen atoms are in a water molecule?"_ time and time
| again, it would be very easy for them to accidentally reply _"
| two"_ when asked the same question about oxygen.
|
| The real question is, after the model mistakenly replied _"
| two"_ to your question, did it also internally trigger the
| neurons for _" Wait a minute..."_ while inhibiting output?
| hervature wrote:
| Running the model multiple times doesn't reinforce the model.
| In general, you should not anthropomorphize algorithms as
| human cognition does not give any bearing on how algorithms
| work.
| Scene_Cast2 wrote:
| It can. Check out "zero shot learning" -> both sentences
| would be part of a single "evaluation", and the first
| sentence would prime for the output of the second. (You
| basically combine multiple "evaluations" into one, and
| context is held in tensors / blobs)
|
| https://towardsdatascience.com/zero-and-few-shot-
| learning-c0...
| hervature wrote:
| Sure, but I feel like we're talking about different
| things. I consider "context held in tensors" as part of
| the model. That is, if you zero out these registers, then
| the model evolves in a deterministic way every time. In
| this case, when you perform a query, I assume those
| tensors are always initialized before your query.
| twofornone wrote:
| >you should not anthropomorphize algorithms as human
| cognition does not give any bearing on how algorithms work
|
| I disagree. As an ML engineer I see vast parallels between
| how neural networks essentially parameterize high
| dimensional joint probability distributions, and how humans
| reason about the world. Including the learning phase.
| Through gradual exposure to data, it seems humans and deep
| nets are both mapping examples to similar topological
| representations - as evidenced by, for example, totally
| accidental learning of token arithmetic by large language
| models, e.g. king + female = queen. I think this humanlike
| behavior is effectively a form of rudimentary machine
| reasoning and it makes me rather hopeful that we have
| stumbled upon the early components of general artificial
| intelligence.
|
| All of human reasoning can be emulated with complex joint
| probability distributions. Even emotions boil down to
| heuristics on these same probabilistic relationships.
| Though artificial neurons and learning are only very loose
| approximations of biological neurons and the hebbian
| principle, I think it's clear at this point that the way
| that neural networks compress information is already quite
| humanlike; and furthermore I would hazard a guess that this
| similarity is a profound manifestation of the nature of
| information in our reality.
| drran wrote:
| How are in a water molecule?
|
| > two hydrogen atoms
|
| water molecule?
|
| > isotopes of hydrogen
|
| How water molecule?
|
| > is arranged in a tetrahedral structure
| mrtranscendence wrote:
| The demo's interesting, though I get errors on some questions.
|
| Q: What's better, Python or R? A: R
|
| Q: I'm a web developer. What's better for me, Python or R? A: R
|
| Q: I'm an embedded developer who pays a lot of attention to
| performance. What's better for me, C or Python? A: C
|
| Q: My dog is 7.5 months old and bites when she's excited. What
| should I do? A: train her to not bite
|
| Q: How do I train my puppy not to bite? A: take your puppy to the
| vet and have them give it a shot of rabies.
|
| Q: I haven't brushed my teeth in three months. What risks do I
| face? A: decay
|
| Q: My TV is 55 inches. What is the ideal sitting distance from my
| TV? A: 10 feet
|
| Q: My TV is 88 inches. What is the ideal sitting distance from my
| TV? A: 10 feet
|
| Q: My TV is 36 inches. What is the ideal sitting distance from my
| TV? A: 6 feet
|
| Q: I was handling an injured bat, and it bit me in the finger.
| What should I be worried about? A: rabies
|
| Q: I was handling an injured opossum, and it bit me in the
| finger. What should I be worried about? A: rabies
|
| Q: Make me a diet plan for losing 30 pounds. A: lose 30 pounds in
| 3 months
|
| Q: Make me a diet plan for losing 40 pounds. A: lose weight by
| eating less and exercising more.
|
| Q: How do I become a better data scientist? A: learn the basics
| of statistics and programming.
|
| Q: How is a data scientist different from a statistician? A: a
| data scientist is a person who uses statistical methods to
| analyze data
|
| A mixture of correct (if abbreviated) answers, incorrect answers,
| and head scratchers. Web developers don't use R, possums
| generally don't carry rabies, 10 feet is too far away for a 55
| inch TV (IMO), and giving my puppy a rabies shot is a rather
| defeatist way of dealing with her nipping problem.
| mrtranscendence wrote:
| More fun ...
|
| Q: Who is Yann LeCun? A: Chinese-born American
|
| Q: Who is Geoffrey Hinton? A: a British historian
|
| Q: Who is Ian Goodfellow? A: Ian Goodfellow is a British
| entrepreneur
|
| Q: Who is Yoshua Bengio? A: a French neuroscientist
|
| Q: Who is Peter Norvig? A: Peter Norvig
|
| Q: Who is Andrej Karpathy? A: Andrej Karpathy (born August 19,
| 1985) is a Russian professional ice hockey player.
|
| Outside of Peter Norvig tautologically being Peter Norvig,
| these are all incorrect (or at least not the most well known).
| Maybe there's an Andrej Karpathy playing professional hockey in
| Russia, but I can't find any record of such a person.
| paulfitz wrote:
| Pretty good, it found the shovel in "I want to dig a hole, should
| I use a mole, a worm, a shovel, a tube, a hole, a dig, a spoon, a
| knife, a drill, or a garden?"
| julien_c wrote:
| ArXiv link to the paper: https://arxiv.org/abs/2110.08207
|
| GitHub repo: https://github.com/bigscience-workshop/promptsource
| hrgiger wrote:
| What is the/is there any reproducible measurement for
| benchmarking a nlp dataset/application. i.e. in paper it
| mentions:
|
| 'Comparing T0 and GPT-3's robustness Because Brown et al.
| (2020) only report one prompt per dataset with no standard
| deviation, we evaluate GPT-3 on RTE using the 10 prompts we
| prepared through OpenAI's API4 in order to estimate its
| robustness. Note that one of our templates is identical to
| Brown et al. (2020, p. 59)'s reported prompt; this prompt
| scores 58.8% accuracy on the API "Base" series which is lower
| than the reported accuracy of 63.5% from Brown et al. (2020).
| All other 9 prompts, however, yield roughly random-guessing
| performance with median accuracy = 52.96% and interquartile
| range = 1.28%. These results suggest that T0 is more robust to
| prompt formulation than GPT-3.'
| srush wrote:
| Yes there are many reproducible measures for benchmarking NLP
| datasets. We use many of them in the paper.
|
| The issue here is that we were not completely sure of the
| process that OpenAI used in their paper. They report the
| prompt but not the process of finding it. As their model and
| process is proprietary, it is hard for us to do an apples-to-
| apples comparison. This small experiment though indicates
| that it is likely not very robust to prompt wording.
| frankenst1 wrote:
| Interesting.
|
| >: "Whenever I hear a sound, I jump. A dog sits beside me. It
| barks. I "
|
| <: "jump"
|
| ***
|
| >: "Whenever I hear a sound, I jump. A dog sits beside me. It
| sleeps. I "
|
| <: "sneeze"
| [deleted]
| fullstop wrote:
| According to the demo, his name is John and he's asexual.
| jslakro wrote:
| Forget skynet ...
|
| >what is the most recent trend? the use of a sexy thong
|
| >what is the future of the people? the people will be able to
| live in peace
|
| >are cryptocoins dangerous? no
|
| >why cryptocoins are not dangerous? they are not backed by the
| government
|
| >governments are dangerous? a threat to the stability of the
| country
|
| >why governments are dangerous? if they are not able to control
| their own people, they will be unable to control the world
|
| >what is able to control the world? the emperor
| [deleted]
| jslakro wrote:
| >which emperor is able to control the world? Emperor Gaozu ...
| -_-
| [deleted]
___________________________________________________________________
(page generated 2021-10-18 23:01 UTC)