[HN Gopher] Classifying customer messages with LLMs vs tradition...
___________________________________________________________________
Classifying customer messages with LLMs vs traditional ML
Author : hellovai
Score : 180 points
Date : 2023-07-11 14:51 UTC (8 hours ago)
(HTM) web link (www.trygloo.com)
(TXT) w3m dump (www.trygloo.com)
| rossirpaulo wrote:
| This is great! We had a similar thought and couldn't agree more
| with "LLMs prefer producing something rather than nothing." We
| have been consistently requesting responses in JSON format,
| which, despite its numerous advantages, sometimes imposes an
| obligation for an output even if it shouldn't. This frequently
| results in hallucinations. Encouraging NULL returns, for example,
| is a great way to deal with that.
| caesil wrote:
| I've found that this is best dealt with along two axes with
| constrained options. i.e., request both a string and a boolean,
| and if you get boolean false you can simply ignore the string.
| So when the LLM ignores you and prints a string like "This
| article does not contain mention of sharks", you can discard
| that easily.
|
| If you tell it "Return what this says about sharks or nothing
| if it does not mention them", it will mess up.
| LawTalkingGuy wrote:
| Have you tried this sort of prompt?
|
| User text: "Blah blah ... Sharks ... Surfing ..."
| Instruction: Return an JSON object containing an array of all
| sentences in the user text which mention sharks directly or
| by implication. Response: {"list_of_shark_related_sentences":
| [
|
| Stop token: ']}'
|
| It'll try to complete the JSON response and it'll try to end
| it by closing the array and object as shown in the stop
| token. This severely limits rambling, and if it does add a
| spurious field it'll (usually) still be valid JSON and you
| can usually just ignore the unwanted field.
|
| wrt OpenAI, text-davinci-003 handles this well, the other
| models not so much.
| dontupvoteme wrote:
| Making it rank multiple attributes on a scale of 1-10 also
| works decent in my experience. Then one can simply k-means
| cluster (or similar) and evaluate the grouping to see how
| accurate its estimations are
| caesil wrote:
| Yes, agreed. I doing this as well. Works excellently for
| NLP classifier tasks.
|
| Funnily enough, there is a certain propensity for it to
| output round numbers (50, 100, etc.) so I have to ask it
| not to do this and provide examples ("like 27, 63, or 4").
| Now that I think about it I should probably randomize
| those.
| galleywest200 wrote:
| Have you tried using GPT-4s new Function Call feature? The
| "killer" portion of this is guaranteed JSON based on a schema
| you pass to the model.
| rolisz wrote:
| Nope, it's not guaranteed. They warn you in the OpenAI docs
| that it might hallucinate inexistent parameters.
| Der_Einzige wrote:
| Constrained generation should not require calling
| supplemental functions. It's as simply as banning or reducing
| the weight of the naughty tokens. There are several libraries
| which enable this without function calling (microsoft
| guidance, jsonformer, lmql)
| hellovai wrote:
| That's a good point! We're actually working on integrating
| this as well, but in practice, what we've found is that LLM's
| in general don't like to respond with empty strings for
| example.
|
| My hypothesis here is that due to RLFH, there's likely some
| implicit learning that tangentially related content is better
| than no content.
|
| Given that, you'd likely still get better results with your
| schema being:
|
| "string | null" so the LLM can output a null instead of ""
| since there is probably not as much training data that gives
| "" high log prob values.
|
| But we're looking forward to evaluating the functions call,
| and seeing what the metrics show!
| guhidalg wrote:
| I integrated the function calling feature into my personal
| project and wrote a blog post about it here:
|
| https://letscooktime.com/Blog/ai,/machine/learning,/chatgpt
| ,...
|
| Hopefully this saves you some time!
| CallMeMarc wrote:
| Thanks for the post! Really liked it being short and
| precise to the point.
|
| Also looking to integrate the new function feature and
| now already got some learnings out of the post without
| even starting to code.
| msp26 wrote:
| The output is not 100% guaranteed. Be careful about that and
| have another layer to check the output.
|
| I had a schema with a string enum property to categorise some
| inputs. One of the category names was "media/other" or
| something to that effect. Sometimes the output would stop at
| just media even though it wasn't a valid option in the
| schema.
| com2kid wrote:
| I've run into the same issue, but you can turn it into an
| advantage if you are careful enough.
|
| Basically, give the LLM a schema that is loose enough for the
| LLM to expand where it feels expansion is needed. Saying always
| "return a number" is super limiting if the LLM has figured out
| you need a range instead. Saying "always populate this field"
| is silly because sometimes the field doesn't need to be
| populated.
| 19h wrote:
| We're classifying gigabytes of intel (SOCMINT / HUMINT) per
| second and found semantic folding or better in classification
| quality vs throughput than BERT / LLMs.
|
| How it works -- imagine you're having these sentences:
|
| "Acorn is a tree" and "acorn is an app"
|
| You essentially keep record of all word to word relations
| internal to a sentence:
|
| - acorn: is, a, an, app, tree Etc.
|
| Now you repeat this for a few gigabytes of text. You'll end up
| with a huge map of "word connections".
|
| You now take the top X words that other words connect to (I.e.
| 16384). Then you create a vector of 16384 connections, where each
| word is encoded as 1,0,1,0,1,0,0,0, ... (1 is the most connected
| to word, 0 the second, etc. 1 indicates "is connected" and 0
| indicates "no such connection).
|
| You'll end up with a vector that has a lot of zeroes -- you can
| now sparsify it (I.e. store only the positions of the ones).
|
| You essentially have fingerprints now -- what you can do now is
| to generate fingerprints of entire sentences, paragraphs and
| texts. Remove the fingerprints of the most common words like
| "is", "in", "a", "the" etc. and you'll have a "semantic
| fingerprint". Now if you take a lot of example texts and generate
| fingerprints off it, you can end up with a very small amount of
| "indices" like maybe 10 numbers that are enough to very reliably
| identify texts of a specific topic.
|
| Sorry, couldn't be too specific as I'm on the go - if you're
| interested drop me a mail.
|
| We're using this to categorize literally tens of gigabytes per
| second with 92% precision into more than 72 categories.
| LewisDavidson wrote:
| Do you have any code that demonstrates this? Sounds super
| interesting!
| wavemode wrote:
| I'd be curious how the output of your approach compares to
| merely classifying based on what keywords are contained in the
| text (given that AFAICT you're simply categorizing rather than
| trying to extract precise meaning).
| spyckie2 wrote:
| Just asking, this seems very similar to the attention algorithm
| that powers LLMs?
| r_singh wrote:
| I have been using LLMs for ABSA, text classification and even
| labelling clusters (something that had to be done manually
| earlier on) and I couldn't be happier.
|
| It was turning out to be expensive earlier but with optimising
| the prompt a lot, reduced pricing by OpenAI and now also being
| able to run Guanaco 13/33B locally has made it even more
| accessible in terms of pricing for millions of pieces of text.
| hellovai wrote:
| That's very interesting! What sort of direction did you head in
| with prompt optimization? Was it mostly in shrinking it and
| then using multi-shot examples? We found that shorter prompts
| (empirically) perform better than longer prompts.
| [deleted]
| rckrd wrote:
| I just released a zero-shot classification API built on LLMs
| https://github.com/thiggle/api. It always returns structured JSON
| and only the relevant categories/classes out of the ones you
| provide.
|
| LLMs are excellent reasoning engines. But nudging them to the
| desired output is challenging. They might return categories
| outside the ones that you determined. They might return multiple
| categories when you only want one (or the opposite -- a single
| category when you want multiple). Even if you steer the AI toward
| the correct answer, parsing the output can be difficult. Asking
| the LLM to output structure data works 80% of the time. But the
| 20% of the time that your code parses the response fails takes up
| 99% of your time and is unacceptable for most real-world use
| cases.
|
| [0] https://twitter.com/mattrickard/status/1678603390337822722
| i-am-agi wrote:
| Wohoo this is amazing! I have been using the Autolabel
| (https://news.ycombinator.com/item?id=36409201) library so far
| for labeling a few classification and question answering datasets
| and have been seeing some great performance. Would be interested
| in giving gloo a shot as well to see if it helps performance
| further. Thanks for sharing this :)
| crazygringo wrote:
| This is really interesting.
|
| I'm really wondering when LLM's are going to replace humans for
| ~all first-pass social media and forum moderation.
|
| Obviously humans will always be involved in coming up with
| moderation policy and judging gray areas and refining moderation
| policy... but at what point will LLM's do everything else more
| reliably than humans?
|
| 6 months from now? 3 years from now?
| adam_arthur wrote:
| They are already sufficient for high level classification...
| its just a question of cost.
|
| It's getting tiring reading all the LLM takes from people here
| who clearly don't use or understand them at all. So many still
| stuck in the "predicting next token" nonsense, as if humans
| don't do that too
| maaanu wrote:
| You are seriously telling me that humans predicting word for
| word when they speak?
| adam_arthur wrote:
| A system that "predicts the next token" in such a way that
| it is indistinguishable from a human, is just like a human
| in practice yes.
|
| How does a human decide which word to use in your mind?
| Magic?
|
| No, it's a logically based biological/neurological process
| through which at the end of it, you've decided on a word.
| They are both forms of computing that can produce largely
| indistinguishable output... doesn't matter that one is
| biological and the other isn't
| RhodesianHunter wrote:
| It's quite possible a human forms a full representation
| of a coherent thought before translating that thought
| into words, meaning no token by token prediction.
| doctor_eval wrote:
| It's also possible that translating a coherent thought
| into words is done on a token by token basis.
|
| I'm surely not the only one who sometimes can't "find the
| right word" in the middle of a sentence when trying to
| describe a thought or idea.
| adam_arthur wrote:
| And digital audio will never match analog. Can anybody
| tell the difference anymore?
|
| If a machine produces highly similar output does it
| matter? LLM's clearly exhibit behavior of having
| implicitly learned systems, which is the valuable part of
| intelligence. Humans infer systems through text too, by
| the way.
| stevenhuang wrote:
| Actually yes, architecturally that's the essence of
| predictive coding.
|
| It's among the leading theories in neuroscience for how our
| brains work https://en.wikipedia.org/wiki/Predictive_coding
| woeirua wrote:
| As LLM prices come down social media is going to be absolutely
| inundated with bots that are indistinguishable from humans. I
| can't see a world where forums or social media are _useful_ for
| anything in 10 years unless there are strict gate keepers (e.g.
| you have to receive a code in person or access is tied directly
| to your physical identity to access the site).
| doliveira wrote:
| Ironically for crypto bros, I think the way forward will be
| to codify the real-world trust structures into the digital
| world. The future is trustful.
|
| I just really hope we find a way to codify it without
| scanning people's eyeballs into the blockchain like the guy
| in charge of the world's first AGI wants to do.
| pradn wrote:
| Isn't there a limit to this when one requires an account to
| be tied to a phone number? Perhaps pseudonymous posting is on
| a countdown clock.
| rcarr wrote:
| > or access is tied directly to your physical identity to
| access the site).
|
| This is what is inevitably going to happen. There will be
| some kind of service provider (probably one of Apple, Google,
| Microsoft, Amazon) who will verify who you are via official
| documents such as passport and driving license. When you sign
| up to a smaller company's service they'll check with the
| providers to see if you're a genuine person and if so then
| they'll let you join, if not you'll be blocked. You might be
| able to use the forum with an anonymous name but the company
| will always know who you are and if you use your account to
| spam or abuse people you'll get blacklisted and reported to
| the police. Any service that doesn't implement the model will
| be an unusable hell hole of bots and spam.
|
| The internet will splinter into two and you'll have the
| "verified net" and the "unverified net" with the latter
| basically becoming a second dark web. To be honest, I think
| this will probably be a good thing. I think the vast majority
| of people will spend most of their time on the verified net,
| which will actually be a more pleasant place to be because
| people won't be able to get away with what they can now
| without real consequences in physical reality.
|
| That being said there are plenty of ways it could go wrong -
| if those accounts get hacked and the owner of the account
| can't prove it then we could see innocent people going to
| jail. Or state actors could hack the accounts of citizens
| they see as problematic and frame them. But all that stuff
| could happen today anyway - the verification or lack there of
| doesn't make that much difference but does substantially
| reduce the use of bots.
| withinboredom wrote:
| Did you just describe AOL??
| JimtheCoder wrote:
| I have been thinking the same sort of thing as well over
| the last while.
|
| I was thinking more of a browser level plugin, in which
| content from unverified users would be blurred out with a
| "unverified user - click to view content" type of system.
| Everything you post will be connected to your identity, so
| you would be liable for deepfakes and the like. You would
| also have an activity rating connected to your identity, so
| other people could see if you are posting 1 piece of
| content per hour, or 1000.
|
| Maybe a personal media manager connected to the browser so
| all of the public content that is "signed" by you will be
| easily viewable by you, and if someone posts something that
| is not actually yours under your identity somehow, you will
| be easily be able to rescind the signature.
|
| Just random shower thoughts...
| Enginerrrd wrote:
| Yeah the ability to astroturf (at massive scale!) product
| reviews or political opinions as comments in reddit posts and
| the like will be sort of horrifying. The dead Internet
| hypothesis may yet come true.
| JustBreath wrote:
| The worst part is social media networks aren't necessarily
| against AI/bot engagement since it greatly fluffs their
| numbers and keeps their users occupied.
|
| It seems inevitable that some sort of signature or identity
| proof will be necessary soon to participate in most online
| forums.
|
| Either esoteric networking between people or straight up
| government/private entity issued multi factor authentication.
| zht wrote:
| this is some black mirror stuff
|
| imagine Google's general approach to customer
| service/moderation, but applied all over the place by companies
| small and large
|
| I shudder at the thought
| ghaff wrote:
| Especially with fairly systemic labor shortages, it seems
| inevitable that we'll see more and more self-service and
| automation with the corollary that getting an actual human
| involved will become more difficult.
| Xenoamorphous wrote:
| I've found that it's pretty much impossible to talk to a
| person in most customer services in the past few years, it's
| always a "robot". And this has been going since well before
| LLMs.
| HWR_14 wrote:
| I find it easy to talk to a person. A US based person, no.
| A person who can resolve my problem, maybe. But a person,
| yes.
| hypothesis wrote:
| Yup, most companies are now simply providing "emotional"
| support using sufficiently apologetic, but powerless
| agents. That is, if you can defeat chat bot screen or IVR
| maze. The hope is for user to give up..
| crazygringo wrote:
| Or it could be precisely the opposite -- LLM's take care of
| all of the easy customer service/moderation, so that it's
| actually affordable for Google and others to hire high-
| quality customer reps to manage the hard/urgent stuff that
| LLM's surface.
|
| I don't know, but generally speaking with technological
| progress, while we lose some things we gain more things. It's
| important to think not just what technology gets rid of, but
| what it enables.
| janalsncm wrote:
| Back of the envelope calculation says it could be possible now.
|
| Twitter gets about 500M tweets per day, average tweet is 28
| characters. So that's 14B characters per day. Converting to
| tokens at around 4 char/token that's around 3.5B tokens per
| day. If GPT 3.5 turbo pricing is representative it will cost
| about $0.0015/thousand tokens which is $5k per day. So it's
| possible now.
|
| However, you can probably get that cost down a lot with your
| own models, which also has the benefit of not being at the
| mercy of arbitrary API pricing.
| ghaff wrote:
| "Obviously" isn't really that obvious to me. We've seen plenty
| of companies willing to pass a huge amount of work to
| automation and if you have the misfortune to be an edge case
| the automation can't handle, said companies are often perfectly
| happy to let you fall through the cracks. Cheap and good enough
| often trump costlier and good.
| moffkalast wrote:
| The Google motto of 'good enough for most and screw the edge
| cases'.
| lcnPylGDnU4H9OF wrote:
| I agree that companies will do whatever they can to cut
| costs, including anything which can remove the necessity of a
| human worker via automation. But I think this comment is not
| responding to what was written (maybe there was an edit I
| missed):
|
| > Obviously humans will always be involved in coming up with
| moderation policy
|
| This comment seems to respond to:
|
| > Obviously humans will always be involved in [moderation]
|
| The first statement seems to hold true -- at least, it's a
| more "obvious" conclusion. What scenario is required to fully
| remove humans from coming up with moderation _policy_? These
| companies who are so eager to automate certain tasks will
| likely still be staffed by humans who would make the decision
| to automate certain tasks.
| ghaff wrote:
| Fair enough. I suppose it's partly a question of what level
| of policy we are talking about. I guess at some level,
| humans have to set an umbrella policy. The question gets
| unto how granularly humans get into enforcing policy
| specifics below that.
| alexmolas wrote:
| Where's the comparison with traditional ML? In the article I only
| see the good things about using LLM, but there's no mention to
| traditional ML besides from the title.
|
| It would be nice to see how compares this "complex" approach
| against a "simple" TF-IDF + RF or SVM.
| specproc wrote:
| Yeah, my thoughts exactly. If you're running 500k in tokens
| through through someone else's hallucination-prone computer and
| paying for the privilege, I want to know why that's any better
| than something like SetFit.
|
| All I saw were attempts to reproduce some chatgpt output.
| espe wrote:
| +1 for setfit. a baseline that's hard to beat.
| hellovai wrote:
| SetFit is fairly good, and we do help train SetFit like
| models for the results you get, however, the issue with
| SetFit is that its latency and cost benefits come at the
| price of flexibility.
|
| If you want to add a new class, to update an existing one, it
| requires training a new model. Sometimes this is ok and
| sometimes it's not. This is why we generally prefer a hybrid
| approach where some classes are using traditional models
| (BERT based) while others are determined by the LLM.
| specproc wrote:
| I guess use case is everything. There are numerous reasons,
| not least of which confidentiality, why chatgpt is a no go
| for me.
|
| What I'd like to see more of is systematic comparison
| between chatgpt and classic models. I was hoping to see a
| bit of this in this article and was disappointed.
| hellovai wrote:
| I appreciate the feedback, and also agree that chatgpt is
| a no-go for many use cases.
|
| We're working on putting together a better comparison
| specifically along the lines of accuracy between the LLMs
| (chatgpt, bard, falcon) and also traditional models. Hope
| that one hits the spot for you! Are their specific
| metrics you think might be interesting? We were primarily
| looking at f1/accuracy for this task, but also attempting
| to see what types of classes they work well in using
| semantic similarity.
| og_kalu wrote:
| https://news.ycombinator.com/item?id=36685921
| hellovai wrote:
| Thanks Alex, in this article we focused more on deployment
| comparisons, for example the cost and latency of what it would
| take to deploy a BERT based model vs LLMs.
|
| In a future article, we're planning on posting accuracy
| comparisons as well, but here we want to evaluate a few other
| architectures for comparison. For example, at 1TPS with 1k
| tokens, chat-gpt-turbo would cost almost $5k vs a simpler BERT
| model you could run for under $50.
|
| This is probably very obvious to some people, but a lot of
| people's first experience with any sort of AI is often an LLM,
| so this is just the first of many posts we hope to share.
| og_kalu wrote:
| Current State of the art (GPT-4) is mostly on par with experts
| and much better than crowdworkers. Might be overkill though.
|
| https://www.artisana.ai/articles/gpt-4-outperforms-elite-cro...
|
| 3.5 (what is used here) is better than crowd workers
| https://arxiv.org/abs/2303.15056
| lisasays wrote:
| On par with "experts", no.
|
| Per the article: "outperformed the most skilled crowdworkers"
| on nuanced (but not highly technical) tasks like sentiment
| labeling.
|
| By definition, it can't outperform the expert ensemble
| because that's where the gold labels come from.
| og_kalu wrote:
| It's as good or better than experts on 7/18 of those
| benchmarks. On an additional 4, it's close (within 0.05).
|
| >By definition, it can't outperform the expert ensemble
| because that's where the gold labels come from.
|
| The ensemble no but it can outperform an expert trying to
| solve it. But yes the benchmarks are biased to the experts
| here.
| lisasays wrote:
| Thanks, I confess to skimming and missed the individual
| 'expert' column (presumably the mean of the individual
| expert scores).
|
| That said -- it looks like not only does model+ do worse
| than experts on the other 12/18 (not 11/18 by my
| counting), but when it does, it does so by a
| significantly wider margin (2x-3x on average). For
| example, the maximum model+ outperformance is on label
| 'Stealing' (0.11) while there are 6 labels for which the
| expert outperforms (by margins ranging from 0.12 to
| 0.29).
|
| In other words: distinctly sub-par compared with the
| average expert. Which is probably why they didn't claim
| it as a result in the paper :)
| og_kalu wrote:
| It's distinctly sub-par on 6 of 18 benchmarks while close
| or better in 12. That's why I said "mostly on par"
| lisasays wrote:
| Seems you got it reversed - it's the expert which
| performs sub-par on 6 rounds, while doing better on 12.
|
| While missing also the part about when it outperforms, it
| does so "at a significantly wider margin (2x-3x)". Which
| is why, no, it's not "mostly on par".
|
| Just look at the data.
| viraptor wrote:
| Or even slightly fancy Word2vec/USE or even sentence
| transformers with clustering that you can trivially run locally
| rather than a full blown conversational LLM. I'd love to see a
| large scale comparison.
| jonathankoren wrote:
| Yeah, I also find the lack of the comparison suspicious. As is
| the talk about "hallucinated class labels" being "helpful".
|
| If I had to take a guess, I suspect the LLM might perform a
| touch better, but we're taking fractional percent better. Which
| is fine, if you have the volume, but a wash otherwise
| og_kalu wrote:
| https://news.ycombinator.com/item?id=36685921
| wilg wrote:
| Classic HN website nitpick: Logo should link to home page. In
| this case it _is_ a link but just goes to the current page.
| However, points for being able to easily get to the main product
| page from the blog, usually that 's buried.
| hellovai wrote:
| oh! Good catch! Fixed this, and will update in the release.
| nestorD wrote:
| LLMs are significantly slower than traditional ML, typically
| costlier and, I have been told, tend to be less accurate than a
| traditional model trained on a large dataset.
|
| But, they are zero/few shot classifiers. Meaning that you can get
| your classification running and reasonably accurate _now_ ,
| collect data and switch to a fine-tuned very efficient
| traditional ML model later.
| hellovai wrote:
| That's a great summary and insight. We should likely use that
| verbiage to help make it more crystal clear :)
| godelski wrote:
| > LLMs are significantly slower than traditional ML, typically
| costlier
|
| Literally point 3 in the article.
|
| > But, they are zero/few shot classifiers
|
| This is __NOT__ true. Zero-shot means out of domain, and if
| we're talking about text trained LLMs, there really isn't
| anything text that is out of domain for them because they are
| trained on almost anything you can find on the internet. This
| is not akin to training something on Tiny Shakespeare and then
| having it perform sentiment analysis (classification) on Sci-Fi
| novels. Similarly, training a model on JFT or LAION does not
| give you the ability to perform zero shot classification on
| datasets like COCO or ImageNet, since the same semantic data
| exists in both datasets. I don't know why people started using
| this term to describe the domain adaptation or transfer
| learning, but it is not okay. Zero-shot requires novel classes,
| and subsets are not novel.
| PartiallyTyped wrote:
| > Zero-shot means out of domain, and if we're talking about
| text trained LLMs, there really isn't anything text that is
| out of domain for them because they are trained on almost
| anything you can find on the internet.
|
| Respectfully, i disagree. I have used LLMs on actually novel
| tasks for which there aren't any datasets out there. They
| "get it".
|
| > I don't know why people started using this term to describe
| the domain adaptation or transfer learning, but it is not
| okay. Zero-shot requires novel classes, and subsets are not
| novel.
|
| Respectfully, i disagree.
|
| Zero-shot is perfectly valid because there is no
| backpropagation or weight change involved. Causal LLMs are
| meta-learners due to the attention mechanism and the
| autoregressive nature of the model. These two change the
| _effective_ weight of the matrices.
|
| For all sequences of inputs and all possible weights; there
| exists an instantiation of a neural network without attention
| that produces identical vectors for the current token given
| only the previous token.
|
| Do the math, or read the paper "LLMs are meta learners".
|
| Therefore, for all tasks, giving the model examples of inputs
| changes its effective weights without actually modifying it,
| it is perfectly valid for "zero shot learning" because you
| didn't do backprop of any kind, you merely did input
| transformations / preprocessing.
| godelski wrote:
| > I have used LLMs on actually novel tasks for which there
| aren't any datasets out there. They "get it".
|
| Can you give an example so that we may better discuss or
| that I can adequately update my understandings? But I will
| say that simplifying this down to "just trained to predict
| the next token" is not accurate as it does not account for
| the differences in architectures and cost functions which
| dramatically affect this statement due to the differences
| in their biases. As a clear example, training an image
| model on likelihood does note guarantee that the model will
| produce high fidelity samples[0]. But it will be better at
| imputation or classification. Some other helpful
| references[1,2]
|
| > Zero-shot is perfectly valid because there is no
| backpropagation or weight change involved.
|
| I disagree with this. What you have described is still
| within the broader class of fine tuning. Note that zero-
| shot is also tuning. I can make this perfectly clear with a
| simple example that is directly related to my previous
| argument. ``Suppose we train a model on the CIFAR-10
| dataset. Then we "zero-shot" evaluate it on CIFAR-5, where
| we've just removed 5 random classes.`` I think you'll agree
| that it should be unsurprising that the model performs well
| on this second task. This is exactly the "Train on LAION
| then 'zero-shot' classification on ImageNet" task we
| commonly see. Subsets are not a clear task change.
|
| > These two change the effective weight of the matrices.
|
| I'm having a difficult time understanding your argument as
| this directly contradicts your first sentence. I wouldn't
| even make the lack of weight change a requirement for zero-
| shot learning as the intent is really that we do not need
| to directly change. If a model has enough general knowledge
| and we do not need to modify the parameters explicitly
| through providing more training (i.e. using a cost function
| and {back,forward}prop), then this is sufficient (randomly
| changing parameters, adding non-trainable parameters like
| activations, or pruning is also acceptable. As well as
| explicitly what you mentioned). The point comes down to
| requiring no additional training for __additional
| domain{,s}__. The training part is not the important part
| here and not what is in question.
|
| My point is explicitly about claiming that subdomains do
| not constitute zero-shot learning. If you disagree in what
| I have claimed are subdomains, then that's a different
| argument. I'm not arguing against the latter points because
| that's also not arguing against what I claimed. But I will
| say that "just because you didn't use backprop doesn't mean
| it isn't zero-shot" and if you disagree, then note that you
| have to claim that the CIFAR-5 example is "zero-shot."
|
| Tldr: A -> B doesn't require that B -> A
|
| [0]A note on the evaluation of generative models:
| https://arxiv.org/abs/1511.01844 (link for also obtaining
| slides and code: http://theis.io/publications/17/)
|
| Also worth looking at many of the works that cite this one:
| https://www.semanticscholar.org/paper/A-note-on-the-
| evaluati...
|
| [1a] Assessing Generative Models via Precision and Recall:
| https://arxiv.org/abs/1806.00035
|
| [1b] Improved Precision and Recall Metric for Assessing
| Generative Models: https://arxiv.org/abs/1904.06991
|
| [2] The Role of ImageNet Classes in Frechet Inception
| Distance: https://arxiv.org/abs/2203.06026
| radarsat1 wrote:
| It comes from this paper [0], and I believe the idea is that
| the LLM was not trained on the task in question, but is able
| to do it with only instructions (zero shot) or with one or a
| few examples (few shot). The paper rightfully points out the
| unexpected fact that the model is only trained to predict the
| next word and yet can follow arbitrary instructions and
| perform tasks that it was not explicitly trained to do.
|
| [0]: https://arxiv.org/abs/2109.01652
| og_kalu wrote:
| Current State of the art (GPT-4) is not going to be less
| accurate than whatever bespoke option you can cook up.
|
| https://news.ycombinator.com/item?id=36685921
| golergka wrote:
| It will be much slower and costlier.
| mplewis wrote:
| This is absolutely untrue.
| og_kalu wrote:
| Feel free to show otherwise
| withinboredom wrote:
| I wouldn't be so sure of that.
| caycep wrote:
| what's "traditional ML"?
___________________________________________________________________
(page generated 2023-07-11 23:00 UTC)