[HN Gopher] Classifying customer messages with LLMs vs tradition...
       ___________________________________________________________________
        
       Classifying customer messages with LLMs vs traditional ML
        
       Author : hellovai
       Score  : 180 points
       Date   : 2023-07-11 14:51 UTC (8 hours ago)
        
 (HTM) web link (www.trygloo.com)
 (TXT) w3m dump (www.trygloo.com)
        
       | rossirpaulo wrote:
       | This is great! We had a similar thought and couldn't agree more
       | with "LLMs prefer producing something rather than nothing." We
       | have been consistently requesting responses in JSON format,
       | which, despite its numerous advantages, sometimes imposes an
       | obligation for an output even if it shouldn't. This frequently
       | results in hallucinations. Encouraging NULL returns, for example,
       | is a great way to deal with that.
        
         | caesil wrote:
         | I've found that this is best dealt with along two axes with
         | constrained options. i.e., request both a string and a boolean,
         | and if you get boolean false you can simply ignore the string.
         | So when the LLM ignores you and prints a string like "This
         | article does not contain mention of sharks", you can discard
         | that easily.
         | 
         | If you tell it "Return what this says about sharks or nothing
         | if it does not mention them", it will mess up.
        
           | LawTalkingGuy wrote:
           | Have you tried this sort of prompt?
           | 
           | User text: "Blah blah ... Sharks ... Surfing ..."
           | Instruction: Return an JSON object containing an array of all
           | sentences in the user text which mention sharks directly or
           | by implication. Response: {"list_of_shark_related_sentences":
           | [
           | 
           | Stop token: ']}'
           | 
           | It'll try to complete the JSON response and it'll try to end
           | it by closing the array and object as shown in the stop
           | token. This severely limits rambling, and if it does add a
           | spurious field it'll (usually) still be valid JSON and you
           | can usually just ignore the unwanted field.
           | 
           | wrt OpenAI, text-davinci-003 handles this well, the other
           | models not so much.
        
           | dontupvoteme wrote:
           | Making it rank multiple attributes on a scale of 1-10 also
           | works decent in my experience. Then one can simply k-means
           | cluster (or similar) and evaluate the grouping to see how
           | accurate its estimations are
        
             | caesil wrote:
             | Yes, agreed. I doing this as well. Works excellently for
             | NLP classifier tasks.
             | 
             | Funnily enough, there is a certain propensity for it to
             | output round numbers (50, 100, etc.) so I have to ask it
             | not to do this and provide examples ("like 27, 63, or 4").
             | Now that I think about it I should probably randomize
             | those.
        
         | galleywest200 wrote:
         | Have you tried using GPT-4s new Function Call feature? The
         | "killer" portion of this is guaranteed JSON based on a schema
         | you pass to the model.
        
           | rolisz wrote:
           | Nope, it's not guaranteed. They warn you in the OpenAI docs
           | that it might hallucinate inexistent parameters.
        
           | Der_Einzige wrote:
           | Constrained generation should not require calling
           | supplemental functions. It's as simply as banning or reducing
           | the weight of the naughty tokens. There are several libraries
           | which enable this without function calling (microsoft
           | guidance, jsonformer, lmql)
        
           | hellovai wrote:
           | That's a good point! We're actually working on integrating
           | this as well, but in practice, what we've found is that LLM's
           | in general don't like to respond with empty strings for
           | example.
           | 
           | My hypothesis here is that due to RLFH, there's likely some
           | implicit learning that tangentially related content is better
           | than no content.
           | 
           | Given that, you'd likely still get better results with your
           | schema being:
           | 
           | "string | null" so the LLM can output a null instead of ""
           | since there is probably not as much training data that gives
           | "" high log prob values.
           | 
           | But we're looking forward to evaluating the functions call,
           | and seeing what the metrics show!
        
             | guhidalg wrote:
             | I integrated the function calling feature into my personal
             | project and wrote a blog post about it here:
             | 
             | https://letscooktime.com/Blog/ai,/machine/learning,/chatgpt
             | ,...
             | 
             | Hopefully this saves you some time!
        
               | CallMeMarc wrote:
               | Thanks for the post! Really liked it being short and
               | precise to the point.
               | 
               | Also looking to integrate the new function feature and
               | now already got some learnings out of the post without
               | even starting to code.
        
           | msp26 wrote:
           | The output is not 100% guaranteed. Be careful about that and
           | have another layer to check the output.
           | 
           | I had a schema with a string enum property to categorise some
           | inputs. One of the category names was "media/other" or
           | something to that effect. Sometimes the output would stop at
           | just media even though it wasn't a valid option in the
           | schema.
        
         | com2kid wrote:
         | I've run into the same issue, but you can turn it into an
         | advantage if you are careful enough.
         | 
         | Basically, give the LLM a schema that is loose enough for the
         | LLM to expand where it feels expansion is needed. Saying always
         | "return a number" is super limiting if the LLM has figured out
         | you need a range instead. Saying "always populate this field"
         | is silly because sometimes the field doesn't need to be
         | populated.
        
       | 19h wrote:
       | We're classifying gigabytes of intel (SOCMINT / HUMINT) per
       | second and found semantic folding or better in classification
       | quality vs throughput than BERT / LLMs.
       | 
       | How it works -- imagine you're having these sentences:
       | 
       | "Acorn is a tree" and "acorn is an app"
       | 
       | You essentially keep record of all word to word relations
       | internal to a sentence:
       | 
       | - acorn: is, a, an, app, tree Etc.
       | 
       | Now you repeat this for a few gigabytes of text. You'll end up
       | with a huge map of "word connections".
       | 
       | You now take the top X words that other words connect to (I.e.
       | 16384). Then you create a vector of 16384 connections, where each
       | word is encoded as 1,0,1,0,1,0,0,0, ... (1 is the most connected
       | to word, 0 the second, etc. 1 indicates "is connected" and 0
       | indicates "no such connection).
       | 
       | You'll end up with a vector that has a lot of zeroes -- you can
       | now sparsify it (I.e. store only the positions of the ones).
       | 
       | You essentially have fingerprints now -- what you can do now is
       | to generate fingerprints of entire sentences, paragraphs and
       | texts. Remove the fingerprints of the most common words like
       | "is", "in", "a", "the" etc. and you'll have a "semantic
       | fingerprint". Now if you take a lot of example texts and generate
       | fingerprints off it, you can end up with a very small amount of
       | "indices" like maybe 10 numbers that are enough to very reliably
       | identify texts of a specific topic.
       | 
       | Sorry, couldn't be too specific as I'm on the go - if you're
       | interested drop me a mail.
       | 
       | We're using this to categorize literally tens of gigabytes per
       | second with 92% precision into more than 72 categories.
        
         | LewisDavidson wrote:
         | Do you have any code that demonstrates this? Sounds super
         | interesting!
        
         | wavemode wrote:
         | I'd be curious how the output of your approach compares to
         | merely classifying based on what keywords are contained in the
         | text (given that AFAICT you're simply categorizing rather than
         | trying to extract precise meaning).
        
         | spyckie2 wrote:
         | Just asking, this seems very similar to the attention algorithm
         | that powers LLMs?
        
       | r_singh wrote:
       | I have been using LLMs for ABSA, text classification and even
       | labelling clusters (something that had to be done manually
       | earlier on) and I couldn't be happier.
       | 
       | It was turning out to be expensive earlier but with optimising
       | the prompt a lot, reduced pricing by OpenAI and now also being
       | able to run Guanaco 13/33B locally has made it even more
       | accessible in terms of pricing for millions of pieces of text.
        
         | hellovai wrote:
         | That's very interesting! What sort of direction did you head in
         | with prompt optimization? Was it mostly in shrinking it and
         | then using multi-shot examples? We found that shorter prompts
         | (empirically) perform better than longer prompts.
        
       | [deleted]
        
       | rckrd wrote:
       | I just released a zero-shot classification API built on LLMs
       | https://github.com/thiggle/api. It always returns structured JSON
       | and only the relevant categories/classes out of the ones you
       | provide.
       | 
       | LLMs are excellent reasoning engines. But nudging them to the
       | desired output is challenging. They might return categories
       | outside the ones that you determined. They might return multiple
       | categories when you only want one (or the opposite -- a single
       | category when you want multiple). Even if you steer the AI toward
       | the correct answer, parsing the output can be difficult. Asking
       | the LLM to output structure data works 80% of the time. But the
       | 20% of the time that your code parses the response fails takes up
       | 99% of your time and is unacceptable for most real-world use
       | cases.
       | 
       | [0] https://twitter.com/mattrickard/status/1678603390337822722
        
       | i-am-agi wrote:
       | Wohoo this is amazing! I have been using the Autolabel
       | (https://news.ycombinator.com/item?id=36409201) library so far
       | for labeling a few classification and question answering datasets
       | and have been seeing some great performance. Would be interested
       | in giving gloo a shot as well to see if it helps performance
       | further. Thanks for sharing this :)
        
       | crazygringo wrote:
       | This is really interesting.
       | 
       | I'm really wondering when LLM's are going to replace humans for
       | ~all first-pass social media and forum moderation.
       | 
       | Obviously humans will always be involved in coming up with
       | moderation policy and judging gray areas and refining moderation
       | policy... but at what point will LLM's do everything else more
       | reliably than humans?
       | 
       | 6 months from now? 3 years from now?
        
         | adam_arthur wrote:
         | They are already sufficient for high level classification...
         | its just a question of cost.
         | 
         | It's getting tiring reading all the LLM takes from people here
         | who clearly don't use or understand them at all. So many still
         | stuck in the "predicting next token" nonsense, as if humans
         | don't do that too
        
           | maaanu wrote:
           | You are seriously telling me that humans predicting word for
           | word when they speak?
        
             | adam_arthur wrote:
             | A system that "predicts the next token" in such a way that
             | it is indistinguishable from a human, is just like a human
             | in practice yes.
             | 
             | How does a human decide which word to use in your mind?
             | Magic?
             | 
             | No, it's a logically based biological/neurological process
             | through which at the end of it, you've decided on a word.
             | They are both forms of computing that can produce largely
             | indistinguishable output... doesn't matter that one is
             | biological and the other isn't
        
               | RhodesianHunter wrote:
               | It's quite possible a human forms a full representation
               | of a coherent thought before translating that thought
               | into words, meaning no token by token prediction.
        
               | doctor_eval wrote:
               | It's also possible that translating a coherent thought
               | into words is done on a token by token basis.
               | 
               | I'm surely not the only one who sometimes can't "find the
               | right word" in the middle of a sentence when trying to
               | describe a thought or idea.
        
               | adam_arthur wrote:
               | And digital audio will never match analog. Can anybody
               | tell the difference anymore?
               | 
               | If a machine produces highly similar output does it
               | matter? LLM's clearly exhibit behavior of having
               | implicitly learned systems, which is the valuable part of
               | intelligence. Humans infer systems through text too, by
               | the way.
        
             | stevenhuang wrote:
             | Actually yes, architecturally that's the essence of
             | predictive coding.
             | 
             | It's among the leading theories in neuroscience for how our
             | brains work https://en.wikipedia.org/wiki/Predictive_coding
        
         | woeirua wrote:
         | As LLM prices come down social media is going to be absolutely
         | inundated with bots that are indistinguishable from humans. I
         | can't see a world where forums or social media are _useful_ for
         | anything in 10 years unless there are strict gate keepers (e.g.
         | you have to receive a code in person or access is tied directly
         | to your physical identity to access the site).
        
           | doliveira wrote:
           | Ironically for crypto bros, I think the way forward will be
           | to codify the real-world trust structures into the digital
           | world. The future is trustful.
           | 
           | I just really hope we find a way to codify it without
           | scanning people's eyeballs into the blockchain like the guy
           | in charge of the world's first AGI wants to do.
        
           | pradn wrote:
           | Isn't there a limit to this when one requires an account to
           | be tied to a phone number? Perhaps pseudonymous posting is on
           | a countdown clock.
        
           | rcarr wrote:
           | > or access is tied directly to your physical identity to
           | access the site).
           | 
           | This is what is inevitably going to happen. There will be
           | some kind of service provider (probably one of Apple, Google,
           | Microsoft, Amazon) who will verify who you are via official
           | documents such as passport and driving license. When you sign
           | up to a smaller company's service they'll check with the
           | providers to see if you're a genuine person and if so then
           | they'll let you join, if not you'll be blocked. You might be
           | able to use the forum with an anonymous name but the company
           | will always know who you are and if you use your account to
           | spam or abuse people you'll get blacklisted and reported to
           | the police. Any service that doesn't implement the model will
           | be an unusable hell hole of bots and spam.
           | 
           | The internet will splinter into two and you'll have the
           | "verified net" and the "unverified net" with the latter
           | basically becoming a second dark web. To be honest, I think
           | this will probably be a good thing. I think the vast majority
           | of people will spend most of their time on the verified net,
           | which will actually be a more pleasant place to be because
           | people won't be able to get away with what they can now
           | without real consequences in physical reality.
           | 
           | That being said there are plenty of ways it could go wrong -
           | if those accounts get hacked and the owner of the account
           | can't prove it then we could see innocent people going to
           | jail. Or state actors could hack the accounts of citizens
           | they see as problematic and frame them. But all that stuff
           | could happen today anyway - the verification or lack there of
           | doesn't make that much difference but does substantially
           | reduce the use of bots.
        
             | withinboredom wrote:
             | Did you just describe AOL??
        
             | JimtheCoder wrote:
             | I have been thinking the same sort of thing as well over
             | the last while.
             | 
             | I was thinking more of a browser level plugin, in which
             | content from unverified users would be blurred out with a
             | "unverified user - click to view content" type of system.
             | Everything you post will be connected to your identity, so
             | you would be liable for deepfakes and the like. You would
             | also have an activity rating connected to your identity, so
             | other people could see if you are posting 1 piece of
             | content per hour, or 1000.
             | 
             | Maybe a personal media manager connected to the browser so
             | all of the public content that is "signed" by you will be
             | easily viewable by you, and if someone posts something that
             | is not actually yours under your identity somehow, you will
             | be easily be able to rescind the signature.
             | 
             | Just random shower thoughts...
        
           | Enginerrrd wrote:
           | Yeah the ability to astroturf (at massive scale!) product
           | reviews or political opinions as comments in reddit posts and
           | the like will be sort of horrifying. The dead Internet
           | hypothesis may yet come true.
        
           | JustBreath wrote:
           | The worst part is social media networks aren't necessarily
           | against AI/bot engagement since it greatly fluffs their
           | numbers and keeps their users occupied.
           | 
           | It seems inevitable that some sort of signature or identity
           | proof will be necessary soon to participate in most online
           | forums.
           | 
           | Either esoteric networking between people or straight up
           | government/private entity issued multi factor authentication.
        
         | zht wrote:
         | this is some black mirror stuff
         | 
         | imagine Google's general approach to customer
         | service/moderation, but applied all over the place by companies
         | small and large
         | 
         | I shudder at the thought
        
           | ghaff wrote:
           | Especially with fairly systemic labor shortages, it seems
           | inevitable that we'll see more and more self-service and
           | automation with the corollary that getting an actual human
           | involved will become more difficult.
        
           | Xenoamorphous wrote:
           | I've found that it's pretty much impossible to talk to a
           | person in most customer services in the past few years, it's
           | always a "robot". And this has been going since well before
           | LLMs.
        
             | HWR_14 wrote:
             | I find it easy to talk to a person. A US based person, no.
             | A person who can resolve my problem, maybe. But a person,
             | yes.
        
               | hypothesis wrote:
               | Yup, most companies are now simply providing "emotional"
               | support using sufficiently apologetic, but powerless
               | agents. That is, if you can defeat chat bot screen or IVR
               | maze. The hope is for user to give up..
        
           | crazygringo wrote:
           | Or it could be precisely the opposite -- LLM's take care of
           | all of the easy customer service/moderation, so that it's
           | actually affordable for Google and others to hire high-
           | quality customer reps to manage the hard/urgent stuff that
           | LLM's surface.
           | 
           | I don't know, but generally speaking with technological
           | progress, while we lose some things we gain more things. It's
           | important to think not just what technology gets rid of, but
           | what it enables.
        
         | janalsncm wrote:
         | Back of the envelope calculation says it could be possible now.
         | 
         | Twitter gets about 500M tweets per day, average tweet is 28
         | characters. So that's 14B characters per day. Converting to
         | tokens at around 4 char/token that's around 3.5B tokens per
         | day. If GPT 3.5 turbo pricing is representative it will cost
         | about $0.0015/thousand tokens which is $5k per day. So it's
         | possible now.
         | 
         | However, you can probably get that cost down a lot with your
         | own models, which also has the benefit of not being at the
         | mercy of arbitrary API pricing.
        
         | ghaff wrote:
         | "Obviously" isn't really that obvious to me. We've seen plenty
         | of companies willing to pass a huge amount of work to
         | automation and if you have the misfortune to be an edge case
         | the automation can't handle, said companies are often perfectly
         | happy to let you fall through the cracks. Cheap and good enough
         | often trump costlier and good.
        
           | moffkalast wrote:
           | The Google motto of 'good enough for most and screw the edge
           | cases'.
        
           | lcnPylGDnU4H9OF wrote:
           | I agree that companies will do whatever they can to cut
           | costs, including anything which can remove the necessity of a
           | human worker via automation. But I think this comment is not
           | responding to what was written (maybe there was an edit I
           | missed):
           | 
           | > Obviously humans will always be involved in coming up with
           | moderation policy
           | 
           | This comment seems to respond to:
           | 
           | > Obviously humans will always be involved in [moderation]
           | 
           | The first statement seems to hold true -- at least, it's a
           | more "obvious" conclusion. What scenario is required to fully
           | remove humans from coming up with moderation _policy_? These
           | companies who are so eager to automate certain tasks will
           | likely still be staffed by humans who would make the decision
           | to automate certain tasks.
        
             | ghaff wrote:
             | Fair enough. I suppose it's partly a question of what level
             | of policy we are talking about. I guess at some level,
             | humans have to set an umbrella policy. The question gets
             | unto how granularly humans get into enforcing policy
             | specifics below that.
        
       | alexmolas wrote:
       | Where's the comparison with traditional ML? In the article I only
       | see the good things about using LLM, but there's no mention to
       | traditional ML besides from the title.
       | 
       | It would be nice to see how compares this "complex" approach
       | against a "simple" TF-IDF + RF or SVM.
        
         | specproc wrote:
         | Yeah, my thoughts exactly. If you're running 500k in tokens
         | through through someone else's hallucination-prone computer and
         | paying for the privilege, I want to know why that's any better
         | than something like SetFit.
         | 
         | All I saw were attempts to reproduce some chatgpt output.
        
           | espe wrote:
           | +1 for setfit. a baseline that's hard to beat.
        
           | hellovai wrote:
           | SetFit is fairly good, and we do help train SetFit like
           | models for the results you get, however, the issue with
           | SetFit is that its latency and cost benefits come at the
           | price of flexibility.
           | 
           | If you want to add a new class, to update an existing one, it
           | requires training a new model. Sometimes this is ok and
           | sometimes it's not. This is why we generally prefer a hybrid
           | approach where some classes are using traditional models
           | (BERT based) while others are determined by the LLM.
        
             | specproc wrote:
             | I guess use case is everything. There are numerous reasons,
             | not least of which confidentiality, why chatgpt is a no go
             | for me.
             | 
             | What I'd like to see more of is systematic comparison
             | between chatgpt and classic models. I was hoping to see a
             | bit of this in this article and was disappointed.
        
               | hellovai wrote:
               | I appreciate the feedback, and also agree that chatgpt is
               | a no-go for many use cases.
               | 
               | We're working on putting together a better comparison
               | specifically along the lines of accuracy between the LLMs
               | (chatgpt, bard, falcon) and also traditional models. Hope
               | that one hits the spot for you! Are their specific
               | metrics you think might be interesting? We were primarily
               | looking at f1/accuracy for this task, but also attempting
               | to see what types of classes they work well in using
               | semantic similarity.
        
           | og_kalu wrote:
           | https://news.ycombinator.com/item?id=36685921
        
         | hellovai wrote:
         | Thanks Alex, in this article we focused more on deployment
         | comparisons, for example the cost and latency of what it would
         | take to deploy a BERT based model vs LLMs.
         | 
         | In a future article, we're planning on posting accuracy
         | comparisons as well, but here we want to evaluate a few other
         | architectures for comparison. For example, at 1TPS with 1k
         | tokens, chat-gpt-turbo would cost almost $5k vs a simpler BERT
         | model you could run for under $50.
         | 
         | This is probably very obvious to some people, but a lot of
         | people's first experience with any sort of AI is often an LLM,
         | so this is just the first of many posts we hope to share.
        
         | og_kalu wrote:
         | Current State of the art (GPT-4) is mostly on par with experts
         | and much better than crowdworkers. Might be overkill though.
         | 
         | https://www.artisana.ai/articles/gpt-4-outperforms-elite-cro...
         | 
         | 3.5 (what is used here) is better than crowd workers
         | https://arxiv.org/abs/2303.15056
        
           | lisasays wrote:
           | On par with "experts", no.
           | 
           | Per the article: "outperformed the most skilled crowdworkers"
           | on nuanced (but not highly technical) tasks like sentiment
           | labeling.
           | 
           | By definition, it can't outperform the expert ensemble
           | because that's where the gold labels come from.
        
             | og_kalu wrote:
             | It's as good or better than experts on 7/18 of those
             | benchmarks. On an additional 4, it's close (within 0.05).
             | 
             | >By definition, it can't outperform the expert ensemble
             | because that's where the gold labels come from.
             | 
             | The ensemble no but it can outperform an expert trying to
             | solve it. But yes the benchmarks are biased to the experts
             | here.
        
               | lisasays wrote:
               | Thanks, I confess to skimming and missed the individual
               | 'expert' column (presumably the mean of the individual
               | expert scores).
               | 
               | That said -- it looks like not only does model+ do worse
               | than experts on the other 12/18 (not 11/18 by my
               | counting), but when it does, it does so by a
               | significantly wider margin (2x-3x on average). For
               | example, the maximum model+ outperformance is on label
               | 'Stealing' (0.11) while there are 6 labels for which the
               | expert outperforms (by margins ranging from 0.12 to
               | 0.29).
               | 
               | In other words: distinctly sub-par compared with the
               | average expert. Which is probably why they didn't claim
               | it as a result in the paper :)
        
               | og_kalu wrote:
               | It's distinctly sub-par on 6 of 18 benchmarks while close
               | or better in 12. That's why I said "mostly on par"
        
               | lisasays wrote:
               | Seems you got it reversed - it's the expert which
               | performs sub-par on 6 rounds, while doing better on 12.
               | 
               | While missing also the part about when it outperforms, it
               | does so "at a significantly wider margin (2x-3x)". Which
               | is why, no, it's not "mostly on par".
               | 
               | Just look at the data.
        
         | viraptor wrote:
         | Or even slightly fancy Word2vec/USE or even sentence
         | transformers with clustering that you can trivially run locally
         | rather than a full blown conversational LLM. I'd love to see a
         | large scale comparison.
        
         | jonathankoren wrote:
         | Yeah, I also find the lack of the comparison suspicious. As is
         | the talk about "hallucinated class labels" being "helpful".
         | 
         | If I had to take a guess, I suspect the LLM might perform a
         | touch better, but we're taking fractional percent better. Which
         | is fine, if you have the volume, but a wash otherwise
        
           | og_kalu wrote:
           | https://news.ycombinator.com/item?id=36685921
        
       | wilg wrote:
       | Classic HN website nitpick: Logo should link to home page. In
       | this case it _is_ a link but just goes to the current page.
       | However, points for being able to easily get to the main product
       | page from the blog, usually that 's buried.
        
         | hellovai wrote:
         | oh! Good catch! Fixed this, and will update in the release.
        
       | nestorD wrote:
       | LLMs are significantly slower than traditional ML, typically
       | costlier and, I have been told, tend to be less accurate than a
       | traditional model trained on a large dataset.
       | 
       | But, they are zero/few shot classifiers. Meaning that you can get
       | your classification running and reasonably accurate _now_ ,
       | collect data and switch to a fine-tuned very efficient
       | traditional ML model later.
        
         | hellovai wrote:
         | That's a great summary and insight. We should likely use that
         | verbiage to help make it more crystal clear :)
        
         | godelski wrote:
         | > LLMs are significantly slower than traditional ML, typically
         | costlier
         | 
         | Literally point 3 in the article.
         | 
         | > But, they are zero/few shot classifiers
         | 
         | This is __NOT__ true. Zero-shot means out of domain, and if
         | we're talking about text trained LLMs, there really isn't
         | anything text that is out of domain for them because they are
         | trained on almost anything you can find on the internet. This
         | is not akin to training something on Tiny Shakespeare and then
         | having it perform sentiment analysis (classification) on Sci-Fi
         | novels. Similarly, training a model on JFT or LAION does not
         | give you the ability to perform zero shot classification on
         | datasets like COCO or ImageNet, since the same semantic data
         | exists in both datasets. I don't know why people started using
         | this term to describe the domain adaptation or transfer
         | learning, but it is not okay. Zero-shot requires novel classes,
         | and subsets are not novel.
        
           | PartiallyTyped wrote:
           | > Zero-shot means out of domain, and if we're talking about
           | text trained LLMs, there really isn't anything text that is
           | out of domain for them because they are trained on almost
           | anything you can find on the internet.
           | 
           | Respectfully, i disagree. I have used LLMs on actually novel
           | tasks for which there aren't any datasets out there. They
           | "get it".
           | 
           | > I don't know why people started using this term to describe
           | the domain adaptation or transfer learning, but it is not
           | okay. Zero-shot requires novel classes, and subsets are not
           | novel.
           | 
           | Respectfully, i disagree.
           | 
           | Zero-shot is perfectly valid because there is no
           | backpropagation or weight change involved. Causal LLMs are
           | meta-learners due to the attention mechanism and the
           | autoregressive nature of the model. These two change the
           | _effective_ weight of the matrices.
           | 
           | For all sequences of inputs and all possible weights; there
           | exists an instantiation of a neural network without attention
           | that produces identical vectors for the current token given
           | only the previous token.
           | 
           | Do the math, or read the paper "LLMs are meta learners".
           | 
           | Therefore, for all tasks, giving the model examples of inputs
           | changes its effective weights without actually modifying it,
           | it is perfectly valid for "zero shot learning" because you
           | didn't do backprop of any kind, you merely did input
           | transformations / preprocessing.
        
             | godelski wrote:
             | > I have used LLMs on actually novel tasks for which there
             | aren't any datasets out there. They "get it".
             | 
             | Can you give an example so that we may better discuss or
             | that I can adequately update my understandings? But I will
             | say that simplifying this down to "just trained to predict
             | the next token" is not accurate as it does not account for
             | the differences in architectures and cost functions which
             | dramatically affect this statement due to the differences
             | in their biases. As a clear example, training an image
             | model on likelihood does note guarantee that the model will
             | produce high fidelity samples[0]. But it will be better at
             | imputation or classification. Some other helpful
             | references[1,2]
             | 
             | > Zero-shot is perfectly valid because there is no
             | backpropagation or weight change involved.
             | 
             | I disagree with this. What you have described is still
             | within the broader class of fine tuning. Note that zero-
             | shot is also tuning. I can make this perfectly clear with a
             | simple example that is directly related to my previous
             | argument. ``Suppose we train a model on the CIFAR-10
             | dataset. Then we "zero-shot" evaluate it on CIFAR-5, where
             | we've just removed 5 random classes.`` I think you'll agree
             | that it should be unsurprising that the model performs well
             | on this second task. This is exactly the "Train on LAION
             | then 'zero-shot' classification on ImageNet" task we
             | commonly see. Subsets are not a clear task change.
             | 
             | > These two change the effective weight of the matrices.
             | 
             | I'm having a difficult time understanding your argument as
             | this directly contradicts your first sentence. I wouldn't
             | even make the lack of weight change a requirement for zero-
             | shot learning as the intent is really that we do not need
             | to directly change. If a model has enough general knowledge
             | and we do not need to modify the parameters explicitly
             | through providing more training (i.e. using a cost function
             | and {back,forward}prop), then this is sufficient (randomly
             | changing parameters, adding non-trainable parameters like
             | activations, or pruning is also acceptable. As well as
             | explicitly what you mentioned). The point comes down to
             | requiring no additional training for __additional
             | domain{,s}__. The training part is not the important part
             | here and not what is in question.
             | 
             | My point is explicitly about claiming that subdomains do
             | not constitute zero-shot learning. If you disagree in what
             | I have claimed are subdomains, then that's a different
             | argument. I'm not arguing against the latter points because
             | that's also not arguing against what I claimed. But I will
             | say that "just because you didn't use backprop doesn't mean
             | it isn't zero-shot" and if you disagree, then note that you
             | have to claim that the CIFAR-5 example is "zero-shot."
             | 
             | Tldr: A -> B doesn't require that B -> A
             | 
             | [0]A note on the evaluation of generative models:
             | https://arxiv.org/abs/1511.01844 (link for also obtaining
             | slides and code: http://theis.io/publications/17/)
             | 
             | Also worth looking at many of the works that cite this one:
             | https://www.semanticscholar.org/paper/A-note-on-the-
             | evaluati...
             | 
             | [1a] Assessing Generative Models via Precision and Recall:
             | https://arxiv.org/abs/1806.00035
             | 
             | [1b] Improved Precision and Recall Metric for Assessing
             | Generative Models: https://arxiv.org/abs/1904.06991
             | 
             | [2] The Role of ImageNet Classes in Frechet Inception
             | Distance: https://arxiv.org/abs/2203.06026
        
           | radarsat1 wrote:
           | It comes from this paper [0], and I believe the idea is that
           | the LLM was not trained on the task in question, but is able
           | to do it with only instructions (zero shot) or with one or a
           | few examples (few shot). The paper rightfully points out the
           | unexpected fact that the model is only trained to predict the
           | next word and yet can follow arbitrary instructions and
           | perform tasks that it was not explicitly trained to do.
           | 
           | [0]: https://arxiv.org/abs/2109.01652
        
         | og_kalu wrote:
         | Current State of the art (GPT-4) is not going to be less
         | accurate than whatever bespoke option you can cook up.
         | 
         | https://news.ycombinator.com/item?id=36685921
        
           | golergka wrote:
           | It will be much slower and costlier.
        
           | mplewis wrote:
           | This is absolutely untrue.
        
             | og_kalu wrote:
             | Feel free to show otherwise
        
           | withinboredom wrote:
           | I wouldn't be so sure of that.
        
       | caycep wrote:
       | what's "traditional ML"?
        
       ___________________________________________________________________
       (page generated 2023-07-11 23:00 UTC)