[HN Gopher] Bad numbers in the "gzip beats BERT" paper?
       ___________________________________________________________________
        
       Bad numbers in the "gzip beats BERT" paper?
        
       Author : ks2048
       Score  : 329 points
       Date   : 2023-07-17 14:08 UTC (8 hours ago)
        
 (HTM) web link (kenschutte.com)
 (TXT) w3m dump (kenschutte.com)
        
       | ks2048 wrote:
       | Hi, that's my blog post. I'm pretty sure about what I wrote here,
       | but may need the authors to chime-in in case I am missing
       | something. I just submited an issue on github,
       | https://github.com/bazingagin/npc_gzip/issues/3
        
         | [deleted]
        
         | lalaland1125 wrote:
         | Just wanted to say, thanks for your work debugging this.
         | 
         | You have probably saved other researchers an unfathomable
         | amount of time
        
           | [deleted]
        
         | eyegor wrote:
         | You may want to consider adding a note to the top. Seems like a
         | lot of people are lazily skimming/reading the headline and see
         | it as "gzip paper full of beans, gzip approach sucks" when
         | really I see this as "gzip approach not better than dnn models
         | but mostly competes and much cheaper to run". The paper is
         | still solid.
        
           | marcinzm wrote:
           | >gzip approach not better than dnn models but mostly competes
           | and much cheaper to run
           | 
           | Does it? It looks to do worse than FastText in all benchmarks
           | and kNN is not a cheap algorithm to run so it might actually
           | be slower than FastText.
           | 
           | edit: It looks like FastText takes 5 seconds to train on the
           | Yahoo Answers data set while the gzip approach took them 6
           | days. So definitely not faster.
        
             | eyegor wrote:
             | I'm not familiar with most of these models in detail, but
             | training time is generally less interesting than inference
             | time to me. I don't care if it takes a month to train on
             | $10k of gpu rentals if it can be deployed and run on a
             | raspberry pi. I should definitely look into fasttext
             | though.
        
               | amluto wrote:
               | As described in the paper, it didn't look like the gzip
               | classifier trained at all. Inference involved reading the
               | entire training set.
               | 
               | One could surely speed this up by preprocessing the
               | training set and snapshotting the resulting gzip state,
               | but that wouldn't affect the asymptotic complexity. In
               | effect, the number of parameters is effectively equal to
               | the size of the entire training set. (Of course, lots of
               | fancy models scale roughly like this, too, so this isn't
               | necessarily a loss.)
        
               | huac wrote:
               | The gzip approach is much slower at inference time
               | because you need to compute the gzip representation of
               | the concatenated strings (query + target). Intuitively,
               | this should be significantly more than a dot product of
               | two embedding vectors.
        
               | amluto wrote:
               | The latter depends very strongly on how much computation
               | is needed to compute those embedding vectors.
               | 
               | If you run a GPT-3.5-sized mode to compute that embedding
               | (which would be a bit absurd, but if you really want
               | GPT-3.5-quality classification, you may well be doing
               | something like this), you're looking through quite a few
               | tens of billions of parameters and doing a
               | correspondingly large number of FLOPs, which could be
               | just as expensive as running gzip over your whole (small,
               | private) training set.
        
               | huac wrote:
               | no, because the compute intensity scales with the number
               | of classes which you wish to classify to. if you have n
               | classes, you need to do n gzip compressions at inference
               | time. in the embedding world, you only call the embedding
               | model once on insert, and only need to dot product at
               | inference time.
               | 
               | the same logic extends to using a self-hosted embedding
               | model, which tend to be as good as Ada on most
               | benchmarks, and yes, can be finetuned over your private
               | data.
        
               | marcinzm wrote:
               | >The latter depends very strongly on how much computation
               | is needed to compute those embedding vectors.
               | 
               | Sure but the gzip metrics are worse than FastText which
               | computes the embeddings in essentially no time. Tokenize,
               | lookup embeddings by token id, and then do some
               | averaging. So compared to that the gzip approach is very
               | slow.
        
             | tensor wrote:
             | FastText isn't a LLM, it's a token embedding model with a
             | simple classifier on top.
        
               | marcinzm wrote:
               | Sure but it's existence means the statement is really
               | "gzip approach not better than dnn models, and doesn't
               | compete or be cheaper to run than previous models like
               | FastText." That's not a very meaningful value statement
               | for the approach (although why gzip is even half-decent
               | might be a very interesting research question).
        
           | light_hue_1 wrote:
           | If this story is true the paper is not solid.
           | 
           | Claims in the abstract and claim 3 in the paper, as well as
           | much of the publicity around the paper is just wrong.
           | 
           | It takes gzip from being great out of domain to being
           | middling at best. It goes from something really interesting
           | to a "meh" model. The main part that was intellectually
           | interesting is how robust gzip is out of domain, if that's
           | gone, there isn't much here.
           | 
           | If I was the reviewer for this paper, this would take the
           | paper from an accept to a "submit to a workshop".
           | 
           | Also, kNN methods are slow O(n^2).
        
             | [deleted]
        
             | ivirshup wrote:
             | kNN methods are broadly not O(n^2)[1], especially in
             | practice where approximate methods are used.
             | 
             | [1]: https://en.wikipedia.org/wiki/Nearest_neighbor_search
        
               | huac wrote:
               | how would you build an index over the gzip encoded data?
               | seems quite different from building indices over vector
               | embeddings.
        
           | tensor wrote:
           | I honestly don't know why anyone would use this gzip approach
           | in production. If you want to do text classification, really
           | the two options you should consider are a best in class
           | linear model like confidence weighted linear classification
           | by Crammer (https://www.cs.jhu.edu/~mdredze/publications/icml
           | _variance.p...) or a much more expensive LLMs.
        
             | eyegor wrote:
             | Do you happen to be familiar with audio classification?
             | There's been a ton of research on text classification and
             | prediction but not many good papers I've seen for general
             | audio classification. I'm talking more feature extraction,
             | not speech recognition. There are a lot of speech
             | recognition papers. So far I've been stuck on fft - image
             | processing pipeline but I haven't gotten great results in
             | real world tests, only on nice teat datasets.
             | 
             | Personally I don't have much experience working beyond
             | mlp/rnn/lstm/cnn models.
        
             | esafak wrote:
             | Don't look at it as a suggestion to use gzip in production,
             | but an invitation to reconsider the unassailable
             | superiority of BERT over simpler, tailored solutions.
        
               | empiko wrote:
               | Is it really an invitation? The paper shows that the
               | current models are worse for some marginalized languages
               | that are used as OOD datasets. I am not really that
               | surprised that the modelals don't speak those and I don't
               | know anybody who would use BERT like that
        
               | tensor wrote:
               | I don't think anyone actually doing NLP research has
               | thought that BERT is always better than simpler methods.
               | Linear classifiers with ngrams, or even better, large
               | margin linear classifiers, are well known to be
               | competitive with things like BERT on a variety of tasks,
               | with orders of magnitude better runtime.
               | 
               | In contrast, this gzip technique is considered a cute
               | application of information theory, but even in academia
               | is rarely included in studies because there are simpler
               | and better techniques for NLP.
               | 
               | Yes, if you are chasing the ultimate accuracy, then using
               | a LLM (not necessarily BERT either) is going to be the
               | best. But for a practical system trading some accuracy
               | for vastly improved runtime is usually a very good trade-
               | off. And again, it depends on your domain. Topic
               | classification, stick with a linear model. Sentiment
               | analysis? Ok, here a LLM actually gives substantially
               | better results so it's worth the extra cost if sentiment
               | is crucial to your application.
               | 
               | I personally like the CW algorithm I mentioned because
               | it's relatively easy to implement and has excellent
               | qualities. But if I were a dev looking for a ready to go
               | already implemented production system I'd go for vowpal
               | wabbit and move up to a LLM if I'm not getting the
               | accuracy I need for my application.
        
               | marcinzm wrote:
               | But FastText (2015) already exists and beats this gzip
               | approach on all criteria. So the invitation has already
               | existed before BERT (2018) and continues to exist.
        
         | cs702 wrote:
         | Whoa, I just read your post and saw this:
         | 
         |  _> tldr: it is reporting a top-2 accuracy rather than a
         | kNN(k=2) accuracy_
         | 
         | If the accuracy figures shown for other models are top-1,
         | that's a pretty significant mistake, hopefully an innocent one.
         | 
         | Thank you for doing this and sharing your findings!
         | 
         | ---
         | 
         | Previous discussion on HN:
         | https://news.ycombinator.com/item?id=36707193
        
           | DebtDeflation wrote:
           | Also:
           | 
           | >k=2 is a bit of an odd choice for kNN classification
           | 
           | That's an understatement. Choosing an even number for any
           | sort of voting algorithm doesn't make much sense, choosing 2
           | specifically probably makes the least sense of all.
        
             | ks2048 wrote:
             | Yeah, I think this is one part where a reviewer or advisor
             | could have focused questions.
             | 
             | There is a sentence in Appendix C: "We set k = 2 for all
             | the methods on all the datasets and we report the maximum
             | possible accuracy getting from the experiments for each
             | method."
             | 
             | I'm not sure what the second part of that means exactly.
        
             | spi wrote:
             | Yeah that is a big red flag - as the OP mentions, there is
             | basically no way of making k=2 statistically different from
             | k=1, that's why nobody uses it.
             | 
             | I suppose the authors just tried many different k and
             | selected k=2 because it performed surprisingly well (likely
             | due to the bug the OP found out). But if the results were
             | significantly better than k=1 or k=3, it's a bit weird the
             | authors never double checked why that was the case. I guess
             | it can be one of those things you settle on early in the
             | overall process with a few experiments, and just take for
             | granted afterwards, never checking it again, but still, it
             | sounds like something that should pop out at some point
             | while writing the paper...?
        
             | 1024core wrote:
             | True. You want to always use an odd number so there are no
             | ties.
             | 
             | I'm guessing they were trying a parameter sweep, and found
             | that (thanks to the bug) they got the best results for K=2.
             | 
             | This too is problematic in its own sense.
        
               | ks2048 wrote:
               | Yes, agreed. One small point: for the multi-class case
               | (more than just two classes), which include all the
               | datasets here, you can still get ties for odd k. e.g.
               | k=3, you can get 1 vote each for 3 different classes,
               | etc.
        
               | 1024core wrote:
               | Multi-class is trickier. Maybe we can break down an
               | N-class problem into N binary-classification problems?
        
         | syats wrote:
         | Thanks for the replication, this is important.
         | 
         | One question, did you try to replicate the other result table
         | (Table 3)?
         | 
         | If I understand correctly, top-2 accuracy would be 1 if you
         | have only 2 classes, but it will differ from "normal" accuracy
         | less and less as the number of classes increases (on average).
         | So this shouldn't change the results for table 3 thaaat much as
         | the datasets have large amounts of classes (see table 1).
         | 
         | In any case, top-2 accuracy of 0.685 for the 20-newsgroups
         | dataset is pretty neat for a method that doesn't even consider
         | characters as characters[1], let alone tokens, n-grams,
         | embeddings and all the nice stuff that those of use working on
         | NLP have been devoting years to.
         | 
         | [1] In my understanding of gzip, it considers only bit
         | sequences, which are not necessarily aligned with words (aka.
         | bytes).
        
           | ks2048 wrote:
           | I haven't yet replicated Table 3 because most of those
           | datasets are much larger and it will take awhile to run (they
           | said the YahooAnswers database took them 6 days).
           | 
           | Also, I have only tried the "gzip" row because that is all
           | that is in the github repo they referenced.
           | 
           | Yeah, you're right, the more classes there are, probably the
           | lower the effect this will have.
        
         | p1esk wrote:
         | Did you try contacting the authors before you went public with
         | your discovery?
        
           | _b wrote:
           | We're adult enough to have discussions like this in public.
           | They are healthy to have. People make mistakes. Kudos to the
           | original authors for releasing the source code so people
           | could inspect and replicate their results.
        
             | returningfory2 wrote:
             | I agree, and just want to add: nowadays it's super common
             | for researchers to widely publicize their new work on
             | social media. The blog post here even mentions "Table 5
             | from the paper was often included in tweets".
             | 
             | In this context of sharing your results very publicly, it
             | seems only fair that rebuttals would be very public, too.
             | Otherwise researchers would be highly incentivized to very
             | publicly publish weak results because they would get a lot
             | of positive publicity when they share the results, but not
             | much negative publicity when the weaknesses are found and
             | shared.
        
           | jarym wrote:
           | It isn't a security issue and doesn't warrant responsible
           | disclosure so why would op be expected to?
        
           | ks2048 wrote:
           | I did not, but I see why that could be a better approach. I
           | mainly am trying to be more open with little side projects I
           | do, so wanting to start blogging what I'm working on. Also,
           | this paper was beiung widely discussed so thought this would
           | be one more entry in that.
        
       | puttycat wrote:
       | This is interesting, however, why not first discuss with authors
       | directly to make sure that you're right?
        
         | pinko wrote:
         | They did. See: https://github.com/bazingagin/npc_gzip/issues/3
        
           | Dayshine wrote:
           | One hour ago?
        
         | ks2048 wrote:
         | Yes, that could be a better idea. I am mainly trying something
         | new to work more "in the open" and write blogs about things as
         | a do them. I could be wrong and that would be pretty
         | embarrassing for me. I just published the code I used to
         | double-check things, now linked on the page near the bottom.
        
       | adamsmith143 wrote:
       | Whats disturbing to me if this is true is the number of "top
       | voices" in ML on ML-Twitter were head over heals with this paper.
       | How many of them actually read it at all?
        
       | abecedarius wrote:
       | One thing many people are missing: the simple gzip-for-text-
       | classification hack is not the contribution of this paper. (They
       | reference the standard intro AI textbook for that hack.) The
       | contribution is to use the gzip numbers _together_ with
       | k-nearest-neighbors.
       | 
       | In section 6.2 they compare gzip-distance+kNN vs. gzip-distance
       | on its own on four problems: it was better on two, and worse on
       | two others.
       | 
       | Another bit of background I guess is worth saying: language
       | models are pretrained with a compression objective. That is, the
       | loss function in pretraining is the cross entropy of the input
       | text, which means "minimize the compressed length of this input
       | if you fed it to this LM driving an arithmetic coder".
        
       | fnands wrote:
       | Just a note: your blog seems to be stuck in 2022. Date of post is
       | 17 July 2022
        
         | ks2048 wrote:
         | Thanks, should be fixed in a minute. That's what I get for
         | writing dates by hand...
        
       | bjord wrote:
       | If this is true, I'm looking forward to seeing how all the people
       | who made grandiose statements about that paper now quietly scrub
       | them.
       | 
       | LinkedIn and Twitter influencers, I'm looking at you in
       | particular.
       | 
       | If it's not true, I guess I'll be the one looking stupid--I only
       | skimmed the article.
        
       | AbrahamParangi wrote:
       | Probably disappointing to the authors but an excellent rebuttal.
       | 
       | This is the sort of mistake that's awfully easy to make in ML.
       | The unfortunate thing about the field is that subtle
       | methodological errors often cause subtle failures rather than
       | catastrophic failures as we're used to in many other branches of
       | engineering or science. You can easily make a system slightly
       | worse or slightly better by contaminating your training set with
       | bad data or accidentally leaking some information about your
       | target and the ML system will take it in stride (but with
       | slightly contaminated results).
       | 
       | This result makes sense to me because as much as I would like it
       | to be true, applying existing compression algorithms to ML feels
       | like too much of a "free lunch". If there was any special magic
       | happening in compression algorithms we'd use compression
       | algorithms as encoders instead of using transformers as
       | compressors.
        
         | godelski wrote:
         | > This is the sort of mistake that's awfully easy to make in
         | ML.
         | 
         | It is important to remember this! Mistakes are common because
         | they are easy to make. Science is a noisy process, but there is
         | signal there and what we see here is exactly what peer review
         | is about. I tend to argue that open publications are a better
         | form of peer review than conferences/journals because of
         | exactly this. Peer review is about your peers reviewing your
         | work, less about whatever random and noisy standard a
         | conference/journal puts forward. Remember that this was the way
         | things happened for most of our history and that our modern
         | notion of peer review is very recent (mid 70's). Older journals
         | were more about accomplishing the mission that arxiv
         | accomplishes today: disseminating works.
         | 
         | https://mitcommlab.mit.edu/broad/commkit/peer-review-a-histo...
         | 
         | [side note] another reason I'd advocate for the abolishment of
         | conferences/journals is that through this we can actively
         | advocate for reproduction papers, failure papers, and many
         | other important aspects since we would not be held to the
         | "novelty" criteria (almost everything is incremental).
         | "Publishing" is about communicating your work to your peers and
         | having them validate or invalidate your results.
         | 
         | [edit] I think conferences are good in the fact that they bring
         | people together and that encourages collaboration. That's
         | great. But I should clarify that I'm specifically talking about
         | using these platforms as a means to judge the validity of
         | works. If a conference system wants to just invite works and
         | the community, then I'm totally cool with that. I do also like
         | journals in theory given that there's a conversation happening
         | between authors and reviewers, but I believe this also could
         | just easily be accomplished through arxiv + github or
         | OpenReview (preferred).
        
         | TX81Z wrote:
         | Academic research code is largely dogshit written as quickly as
         | possible by amateurs, barely tested whatsoever, and the primary
         | intended output of all such code is accumulating paper
         | citations.
         | 
         | A world with half as many scientific papers and twice as much
         | care would produce far more value but the whole enterprise is
         | hopelessly gamified.
        
         | jsight wrote:
         | > The unfortunate thing about the field is that subtle
         | methodological errors often cause subtle failures rather than
         | catastrophic failures as we're used to in many other branches
         | of engineering or science.
         | 
         | I've been doing a lot of studying in the ML field lately, and
         | I'm seeing this a lot. It is just another thing that feels like
         | the polar opposite of everything else that I've done as a
         | software engineer.
         | 
         | Miss a semicolon? Instant error.
         | 
         | Miscalculate some grads on one out of three layers? Every now
         | and then it might even work! But the results will be weird.
        
           | godelski wrote:
           | How about this one: tune your hyper-parameters based on the
           | results on your test data.
           | 
           | This is prolific, even the norm, but it is a form of
           | information leakage. You're passing information about the
           | test dataset to the model. The solution to this is to use 3
           | partitions: train, validation, test. Validation is for HP
           | tuning (you can do cross-validation btw) and test is a single
           | shot.
        
             | jsight wrote:
             | Yep, I've been guilty of that one lately. That and solving
             | problems by simply overfitting a neural net to the data in
             | the problem domain.
             | 
             | I mean, it works, but the result is less interesting than
             | what I should have done. :)
        
               | eyegor wrote:
               | What about: add more dropout or noise layers and train an
               | ensemble of models. Submit the best one. Is this
               | considered dirty?
        
               | godelski wrote:
               | Definitely. Problem is that doing this helps you get
               | published, not hurts. I think this is why there's often
               | confusion when industry tries to use academic models, as
               | they don't generalize well due to this overfitting. But
               | also, evaluation is fucking hard, and there's just no way
               | around that. Trying to make it easy (i.e. benchmarkism)
               | just adds up creating more noise instead of the intended
               | decrease.
        
         | iamflimflam1 wrote:
         | It's true in many experiments. The desire to get the result you
         | want can often overwhelm the need to validate what you are
         | getting.
         | 
         | Especially true when the results confirm any pre-existing
         | thinking you may have.
        
           | mattsan wrote:
           | Yep, confirmation bias. Luckily helped with peer review!
        
             | catgary wrote:
             | Hasn't this paper made it through peer review?
        
               | Karellen wrote:
               | I suspect GP commenter meant "replication study" rather
               | than "peer review".
               | 
               | ;-)
               | 
               | (Peer review doesn't check if your data is correct. They
               | check your data collection methods make sense given the
               | hypothesis you're testing, and that your conclusions are
               | supported by the data you collected.)
        
               | thomasahle wrote:
               | Yeah it was published at ACL (
               | https://aclanthology.org/2023.findings-acl.426/ ) which
               | is one of the most prestigious conferences in NLP. So
               | kinda disappointing.
               | 
               | But paper reviewers are usually not supposed to look at
               | the actual source code of the papers, and definitely
               | don't try to reproduce the results. They just read the
               | paper itself, which of course doesn't talk about the
               | error.
               | 
               | Not sure what the best solution is, other than having the
               | most "hyped" papers double verified by researchers on
               | Twitter.
        
               | godelski wrote:
               | > paper reviewers are usually not supposed to look at the
               | actual source code of the papers
               | 
               | Wait what? I haven't reviewed for ACL but most
               | conferences don't say "don't look at the source code."
               | They will say that reviewers are not required to look at
               | it (as well as the appendix). But generally it just isn't
               | uploaded. I do always look at the main method when it is
               | there but talking to my peers and advisor, this is very
               | uncommon[0]. My experience is that most reviewers do not
               | spend more than an hour on a work and make an opinion
               | within 15 minutes.
               | 
               | > Not sure what the best solution is, other than having
               | the most "hyped" papers double verified by researchers on
               | Twitter.
               | 
               | I'd say (as a start):
               | 
               | 1) Get rid of the conference system. A zero-shot (maybe
               | 1-shot if "rebuttal" is allowed) zero-sum system is just
               | disastrous, especially at scale. There's high incentives
               | to actually reject works you review for. A conference
               | system has a binary outcome and the purpose is to reject
               | 80% of papers based on a rather noisy metric of "top
               | tier." A journal system is a back and forth where
               | reviewers are trying to improve the paper. The purpose of
               | the reviewers here is to determine if the idea is indeed
               | good, and then if the paper meets the requirements or not
               | and must explicitly state what needs to be changed for
               | acceptance.
               | 
               | 1.5) An actual rebuttal system could help alleviate some
               | of these issues. Using OpenReview for a conversation
               | between authors and reviewers is critical. A singular 1
               | page response (the norm) is not adequate to respond to 4
               | different people who often have low similarities in
               | responses. Reviewers are allowed (though breaks
               | guidelines) to respond in one sentence.
               | 
               | 2) ACs need to do a better job at validating reviewers.
               | The number of inane and absolutely unacceptable level of
               | reviews I have gotten is astounding (>25%). I've also
               | seen reviewers often break guidelines and have nothing
               | happen. Examples are comments such as those claiming lack
               | of novelty with no explanation or asking authors to
               | compare to concurrent works (I've had this happen for a
               | work that was put out _after_ submission deadlines. Not
               | mine, but here's an example[1] of this being done
               | publicly). If the reviewer is pushed to update their
               | comment then the authors have no ability to respond to
               | their update without the conversation aspect. If there is
               | high variance in response -- not just scores, but what
               | the critiques are about -- then the ACs need to look
               | closer as something is going wrong. We're in a crisis for
               | reviewers but we also have an undisclosed crisis in
               | quality of reviewers. Benchmarkism is on the rise but
               | benchmarks are extremely limiting for evaluation. There's
               | a certain irony given our frequent discussion of
               | Goodhart's Law or Reward Hacking. I'll even make the
               | claim that the quality crisis influences the quantity
               | crisis as I have seen many peers stop reviewing because
               | it isn't worth their time and they aren't getting a fair
               | shot in return. On a personal note, there is a journal I
               | will no longer review for because in-actionable and
               | unreasonable responses, but I also won't submit to them
               | either.
               | 
               | 3) Either get rid of double-blind, or actually do it.
               | Everything is published on arxiv these days, which in
               | general is great for the community as it allows things to
               | move fast. But with this it is incredibly easy to de-
               | anonymize authors. Though for big labs, they de-anonymize
               | themselves actively[2]. In a very noisy process even a
               | very slight edge becomes a significant edge[3]. These
               | biases can even come unconsciously given that we're all
               | reading arxiv papers constantly and it isn't unlikely
               | that we come across some of the works we end up reviewing
               | (yet to knowingly happen to me fwiw). But certain labs do
               | have keywords that they use that can be identified.
               | 
               | I think one of the major problems comes down to this: in
               | a small community we have a certain level of
               | accountability, as we all end up knowing one another
               | through minimal connections. But in a large community
               | there is little to no accountability and what depends on
               | good faith can no longer be trusted. This encourages bad
               | actors, especially when the system is highly competitive
               | (see 1)), and creates bad science/evaluation creep. (e.g.
               | now standard to tune HPs on test data results -- this is
               | information leakage. If you don't, you likely can't
               | compete).
               | 
               | ======
               | 
               | [0] Here's a prominent researcher explicitly saying they
               | don't read the appendix, calling it trash, and a poll
               | showing most people don't look at it https://twitter.com/
               | david_picard/status/1660293648796340226
               | 
               | [1] Here's a prominent researcher criticizing a paper for
               | "not citing his work". I linked the top response which is
               | telling him the submission date was 2 months prior to his
               | arxiv release. This is someone who published >250 papers
               | vs someone with <50. For added reference, paper 2
               | (prominent researcher) was _published_ June 26th in TMLR,
               | but they did cite the other work (gotta give credit for
               | that)
               | https://twitter.com/RinonGal/status/1667943354670170118
               | 
               | [2] We have 2 scenarios here: either reviewers do not
               | know Chinchila == DeepMind, where I'd argue that they are
               | unfit for reviewing given the prominence of that model or
               | 2) they do know, and thus know this is a DeepMind work,
               | and we have an ethics problem. Neither sound great. https
               | ://openreview.net/forum?id=OpzV3lp3IMC&noteId=HXmrWV3ln..
               | .
               | 
               | [3] The conclusion in this analysis of consistency
               | experiment is that even a small amount of inconsistency
               | leads to a lot of noise given a highly selective
               | standard. Which means that paper acceptance itself is
               | highly stochastic: (2014 experiment)
               | https://inverseprobability.com/talks/notes/the-neurips-
               | exper...
               | 
               | [3.1] A shorter version:
               | https://blog.mrtz.org/2014/12/15/the-nips-experiment.html
               | 
               | [3.2] A follow-up on the 2014 experiment tdlr: reviewers
               | are good at identifying bad papers, but not good at
               | identifying good papers (i.e. bias to reject):
               | https://arxiv.org/abs/2109.09774
               | 
               | [3.3] A follow-up 2021 experiment (consistent with 2014
               | experiment): https://blog.neurips.cc/2021/12/08/the-
               | neurips-2021-consiste...
               | 
               | [3.4] Video form
               | https://www.youtube.com/watch?v=19Q-vMd9bYg
        
               | YeGoblynQueenne wrote:
               | I try not to submit to conferences if I can avoid it.
               | It's like you say, reviewers are looking for a reason to
               | reject. I don't understand what makes the difference
               | since it's usually the same people reviewing in both
               | conferences and journals, but somehow journal reviewers
               | do a much better job. Some journals have a fast
               | turnaround even, and still the quality of reviewing is
               | considerably better.
               | 
               | My second journal paper got rejected with encouragement
               | to resubmit. Half the reason for that was because the
               | reviewer had, I think, genuinely misunderstood the
               | description of an experiment, so I re-wrote it in
               | painstaking detail. I had a long section where I hammered
               | out a proof of complexity spanning three pages, with four
               | lemmas and a theorem, and the reviewer waded through all
               | that like a hero, and caught errors and made
               | recommendations for improvement. They made a new round of
               | recommendations when I resubmitted. That paper took three
               | rounds of revisions to publish (reject, resubmit, accept
               | with minor revisions) but it got 80% better every time I
               | had to revise it. I wish there was another couple of
               | rounds! It was exhausting, and I bet much more so to the
               | reviewer, but it was 100% worth it.
               | 
               | And yeah, I absolutely do my best to review like that
               | myself. Even in conferences, which probably seems really
               | weird to authors. But, hey, be the change you want to
               | see.
        
               | godelski wrote:
               | Yeah, honestly the only reason I submit to conferences
               | now is because my advisor asks me to. If it was up to me
               | I would submit exclusively to journals or just to
               | arxiv/open review directly. I think I'll do this when I
               | graduate (soon).
               | 
               | As for the reason why it happens in conferences, I think
               | it may actually be a different set of reviewers. While
               | journal reviewers are going to be conference reviewers, I
               | don't think the other way around is true. I think
               | conferences tend to just have a larger number of shitty
               | reviewers (as well as more shitty submissions). And as
               | you note, it is quite easy to misunderstand a work,
               | doubly so when you're reading under a time constraint. It
               | just makes for a noisy process, especially when reviewers
               | view their job as to reject (not improve). I just think
               | it is a bad system with a bad premise that can't really
               | be fixed. For conference reviewing, I always try to write
               | what would change my mind and if I think the authors
               | should resubmit to another venue. But even reviewing I
               | don't feel authors get a fair shot at responding. They
               | can't address all my comments while addressing others in
               | a single page.
               | 
               | Edit: I saw your bio. I actually have a SOTA work that is
               | rejected (twice). Good performance jump with large
               | parameter drop. But just couldn't tune or run enough
               | datasets because compute limited. Conferences are fun.
        
               | catgary wrote:
               | Yeah, it's not (entirely) the students' faults that this
               | slipped through peer review. I don't envy the whiplash
               | they're going to experience over the next few weeks.
               | 
               | If I was the graduate chair of their department I might
               | schedule a meeting with their supervisor to sort out how
               | this happened.
        
               | westurner wrote:
               | What about the difference in CPU cost, RAM cost, and GPU
               | training hours, though? What about the comparative
               | Big-E's of the models?
               | 
               | Great topic: Model minification and algorithmic omplexity
        
           | mananaysiempre wrote:
           | One particular example that I remember from an introductory
           | particle physics class is the History Plots section[1] of the
           | biennial review of experimental data.
           | 
           | Knowing these quantities is important, but their particular
           | values largely aren't; nobody's funding or career really
           | depended on them being equal to one thing or another. Yet
           | look at all the jumps, where the measurements after the
           | initial very rough ones got stuck in the completely wrong
           | place until the jump to the right value--when it happened--
           | was of a completely implausible magnitude, like four, six, or
           | ten sigma.
           | 
           | [1] https://pdg.lbl.gov/2023/reviews/rpp2022-rev-history-
           | plots.p...
        
             | godelski wrote:
             | What's also good to see here is that the post '90 numbers
             | usually don't even fall within the error bars of the pre
             | '90 numbers. While uncertainty is great, it isn't the end
             | all. I think a lot of people forget how difficult
             | evaluation actually is. Usually we just look at one or two
             | metrics and judge based on that, but such an evaluation is
             | incredibly naive. Metrics and measures are only guides,
             | they do not provide certainty nor targets.
        
         | BSEdlMMldESB wrote:
         | now shift fields such that the subtle methodological errors
         | don't come to light in 20 years.
         | 
         | which field are you on now? economics!? haahha
        
       | refulgentis wrote:
       | Really happy to see this: KNN + classification task + doing
       | classification that's based on pure text similarity is a recipe
       | for stacked results.
       | 
       | Schaudenfreude responses to this paper misunderstand that the
       | natural language stuff is crucially important for embeddings:
       | sure, phrases that share words will classify well and GZIP well,
       | so GZIP can be used as ersatz classification.
       | 
       | The miracle of BERT / embeddings is _not_ having to share words:
       | for instance, "what is my safe passcode?" has a strong match with
       | "my lockbox pin is 1234", but not "my jewelry is stored safely in
       | the safe".
       | 
       | This is also an important thing to consider with LLMs: people are
       | using embeddings intended to do text similarity, whereas you want
       | to use an SBERT model: that is trained to correlate a question to
       | a document that will answer it.
       | 
       | https://www.sbert.net/ for the full rabbit hole.
       | 
       | Previously: Should you use OpenAI's embeddings? Probably not, and
       | here's why. https://iamnotarobot.substack.com/p/should-you-use-
       | openais-e....
       | 
       | HN discussion: https://news.ycombinator.com/item?id=35377935
        
         | jabbany wrote:
         | > The miracle of BERT / embeddings is _not_ having to share
         | words
         | 
         | To be fair, the original task is specifically chosen where
         | something like knn+compression has a chance of being good: i.e.
         | out of domain + low resource.
         | 
         | Under these conditions the training inputs could be too sparse
         | for a highly parameterized model to learn good embeddings from.
         | 
         | In traditional in domain + big data classification settings
         | there's no chance that non-parametric methods like compression
         | would beat a learned representation.
        
       | renewiltord wrote:
       | Great job replicating and fixing. So easy to accidentally create
       | results that are statistical artifacts.
        
       | netdur wrote:
       | I have hacked javascript port
       | https://gist.github.com/netdur/a777f75fb70e0abc19c407c2ff7f9...
       | 
       | and it seems to work!!! regardless
       | 
       | Best matched answer for 'How do I start a new project?' is 'To
       | create a new project, go to the dashboard and click on 'New
       | Project'. Then, fill out the details and click 'Save'.'
       | 
       | Best matched answer for 'How can I delegate tasks to others in my
       | team?' is 'To invite team members to your project, open the
       | project and click on 'Invite Members'. Enter their email
       | addresses and click 'Send Invites'.'
       | 
       | Best matched answer for 'What do I do when I finish a task?' is
       | 'When a task is completed, it should be marked as 'Complete'. It
       | will then be moved to the 'Completed Tasks' section.'
        
       | chaxor wrote:
       | I appreciate that this blog post was made.
       | 
       | This is like so, so many little projects that I do (even
       | specifically showing problems in papers like this) that never see
       | the light of day. I usually just make a small noise, and then it
       | sits on my hard drive and that's it.
       | 
       | So thank you for putting this out there.
        
         | thomasahle wrote:
         | I've started using Twitter as a "lower effort blogging"
         | platform. After I spend a day on something like this, I'm
         | usually too tired to actually write a blog post about it, which
         | feels like a waste. But then writing a small Twitter thread is
         | usually within my capabilities.
        
       | snowstormsun wrote:
       | So, couldn't it be that the authors of the paper ran the model
       | with a random tie break and got lucky? This blog post seems to
       | assume they had the "rand" flag deactivated. Please correct me if
       | I am wrong.
        
         | expensive_news wrote:
         | From what I understand in the post getting lucky enough to see
         | that big of a change in this situation would be like getting
         | 1000 head flips in a row. It's not luck you could expect to
         | ever get.
        
           | snowstormsun wrote:
           | I see
        
       | usgroup wrote:
       | It wasn't obvious to me why the authors chose kNN for the
       | classifier. Since they produce a distance matrix, they could have
       | used multi-dimensional scaling to convert the matrix to factors,
       | and then used a tree algorithm such as xgboost which would likely
       | make use of more information than kNN and produce significantly
       | better results. They could have also used a PAQ compression
       | algorithm which are much better than the LZ compressors -- all of
       | which could have significantly improved the results and possibly
       | delivered on their original conclusions.
       | 
       | what i liked about the subject paper is that the compression
       | algorithm is abstracted away and it led me to consider what else
       | one could do with compression via the p(x) ~ K^(-|x|)
       | relationship, where K is the alphabet size and |x| is the length
       | of the string x, and assuming optimal coding.
       | 
       | For example it occurred to me that one could do traditional
       | classification by packing the factors for each response into
       | separate documents and then proceeding as the paper does to
       | classify which document best compresses the next sample in order
       | to determine its class: a sort of supervised classification using
       | compression algorithms. The closer the compressor is to an
       | optimal code for the dataset, the better it will work.
       | 
       | Schemes for sequence prediction are equally straightforward to
       | implement.
       | 
       | I found it to be a pleasant surprise.
        
       | antonoo wrote:
       | My take on this:
       | 
       | https://twitter.com/antonosika/status/1679423272541192196?s=...
       | 
       | Regardless, great work digging into the code, and great work by
       | authors publishing the code.
        
       | softwaredoug wrote:
       | When we did search relevance experimentation at Shopify we made
       | lots of mistakes. I can empathize with the authors. I've had a
       | lot of my own public screw ups.
       | 
       | At the end of my time at Shopify I learned good science requires
       | good software engineering. It's really easy to make mistakes at
       | so many places in the stack.
       | 
       | We spent a lot of time on creating rigorous, heavily tested and
       | high quality software for our experiments so we could trust our
       | numbers and reproduce each others experiments. We tried to
       | discourage one-off evaluation methods, but if we created a new
       | one, to add it to our suite and test the metric to understand
       | what it meant.
       | 
       | It seems obvious, but sadly not as common as I wish it were in my
       | experience with this kind of experimentation. Companies want
       | velocity, and thinking deeply statistically, and building
       | internal tools, is not in the interest of most higher ups.
        
       | skrebbel wrote:
       | Can anyone explain to me how a compression algorithm can beat an
       | LLM at anything? Isn't that like saying horses are better than
       | graffiti?
       | 
       | I'm sure the answer is in there somewhere but I'm not well versed
       | in AI and I simply can't figure it out.
        
         | ozr wrote:
         | The intuition about how the gzip method works goes like so:
         | 
         | If you compress `ABC`, it will be X bytes. If you then compress
         | `ABCABC`, it will _not_ take 2x bytes. The more similar the two
         | strings that you concatenate, the less bytes it will take.
         | `ABCABD` will take more than `ABCABC`, but less than `ABCXYZ`.
         | 
         | BERT is, by todays standards, a very small LLM, which we know
         | has weaker performance than the billion-param scale models most
         | of us are interacting with today.
        
           | jabbany wrote:
           | > very small LLM
           | 
           | Heh. So does that make it a MLM (medium)?
           | 
           | I've always found it funny that we've settled on a term for a
           | class of models that has a size claim... Especially given how
           | fast things are evolving...
        
         | stuartaxelowen wrote:
         | Many other replies here are wrong - the primary reason is that
         | the LLMs were used on completely out of distribution data (e.g.
         | trained on English, evaluated on completely different language
         | that shared some characters). The points about compression's
         | relatedness to understanding are valid, but they are not the
         | primary reason for LLMs underperforming relative to naive
         | compression.
        
         | ks2048 wrote:
         | It's a very limited task: take a document and classify it into
         | one of (for example) 10 or so categories. Things like detecting
         | certain words can do pretty well in some cases. Things that
         | compress well have the occurrence of common substrings.
        
         | kachnuv_ocasek wrote:
         | One way to interpret/understand language models is as quite
         | involved compression algorithms.
        
         | refulgentis wrote:
         | Other reply is great, more in-depth on details from me here:
         | https://news.ycombinator.com/item?id=36758681
         | 
         | Plain english TL;DR:
         | 
         | - if you limit your task to binning snippets of text
         | 
         | - and the snippets are very well-defined (ex. code vs. Filipino
         | text)
         | 
         | - the snippets _bytes_ could be compared and score well, no
         | text understanding needed
         | 
         | - size delta of a GZIP after adding one more sentence acts as
         | an ersatz way to compare sets of bytes to eachother (ex. you
         | can imagine a GZIP containing 0xFFABCDEF that has 0xFFABCDEF
         | added to it will have a size delta of 0)
        
           | fsmv wrote:
           | Did you read the recent Douglas Hofstadter article or do you
           | just always use the word ersatz?
        
             | refulgentis wrote:
             | I was homeschooled till 12 and mostly left to my own
             | devices as long as it was reading - I believe that has
             | caused a lifelong issue where I sound like a tryhard
             | unintentionally :( (TL;Dr I use it but IDK when, didn't see
             | hofstader article but now I'm looking forward to it)
        
               | stavros wrote:
               | You can substitute the word "substitute" as an ersatz
               | "ersatz".
        
               | refulgentis wrote:
               | It's honestly weird and annoying and I'd give it up in a
               | second.
               | 
               | There's two issues:
               | 
               | - I don't have an ear for what's simple vocabulary versus
               | tryhard, I go into a mad loop when I try
               | 
               | - even if I actively notice it, substitution can seem
               | very far away from intent. Simple wouldn't have occurred
               | to me - I wanted to say something more akin to sloppy /
               | stunt and ersatz is much closer to "hacky" in meaning
               | than simple. Think MacGyver.
               | 
               | But I should do the exercise of at least scanning for
               | words more often and aim for wider audience - I would
               | have known ersatz was an outlier and I shouldn't feel
               | it's condescending or diluting meaning, it's broadening
               | the audience who can parse it
        
               | inimino wrote:
               | Why would you apologize for your vocabulary and try to
               | sound like someone less well-read than you are? Just get
               | over it and be yourself.
        
               | stavros wrote:
               | Eh, it's fine, it doesn't sound tryhard to me, just a bit
               | hard to read.
        
               | matthewdgreen wrote:
               | The word ersatz is great, and conveys the notion that the
               | replacement is simpler and possibly inferior when
               | compared across all features. "Substitute" doesn't cut
               | it. Human language (ironically w.r.t. TFA) isn't a
               | collection of redundant symbols, the synonyms carry all
               | sorts of useful nuance.
        
               | tonyg wrote:
               | It doesn't sound tryhard; it sounds literate.
        
         | ajtulloch wrote:
         | https://www.inference.org.uk/itprnn/book.pdf is a classic text
         | on this connection.
        
         | optimalsolver wrote:
         | Compression is equivalent to intelligence:
         | 
         | https://mattmahoney.net/dc/rationale.html
        
         | awegio wrote:
         | A language model estimates the probability of a sequence of
         | words P(w_1, ..., w_n) or equivalently P(word | context).
         | 
         | For compression, word sequences that have higher probability
         | should be encoded with shorter codes, so there is a direct
         | relationship. A well known method to construct such codes based
         | on probabilities is Huffman coding.
         | 
         | This works whether you use a statistical language model using
         | word frequencies or an LLM to estimate probabilities. The
         | better your language model (lower perplexity) the shorter the
         | compressed output will be.
         | 
         | Conversely, you can probably argue that a compression algorithm
         | implicitly defines a language model by the code lengths, e.g.,
         | it assumes duplicate strings are more likely than random noise.
        
         | GuB-42 wrote:
         | Generally, compression = model + entropy coding.
         | 
         | The model's job is to predict what comes next. The entropy
         | coder's job is to encode the difference between the prediction
         | and what actually comes next so that the most likely outcome
         | uses as few bits as possible. The more accurate the model is,
         | the less the difference between reality and prediction, the
         | less bits the entropy coder needs and the better the
         | compression.
         | 
         | Simple compression algorithms have simple models, like "if I
         | see the same byte 10 times, the 11th is likely to be the same".
         | But you can also use a LLM as your model, as completing text
         | with the most likely word is what LLMs do.
         | 
         | Here they did the opposite. Instead of using a model for
         | compression, by using a few tricks, they used a compression
         | algorithm as a model: the most likely outcome is when the
         | compression algorithm uses less bits to encode the result. And
         | the original authors have shown that, in some tasks, the simple
         | model that can be extracted out of gzip beats much more complex
         | LLMs.
        
           | contravariant wrote:
           | Generally compression algorithm try to give structured data a
           | distribution more similar to random data.
           | 
           | If any byte sequence is a correct file (unlikely, but mostly
           | because compression algorithms try to be robust against
           | corruption), then this is easy to reverse, you just generate
           | a random sequence of bytes and then decompress it.
           | 
           | Basically you can turn a compression algorithm into a
           | probability distribution by inserting random bytes wherever
           | the decompression algorithm tries to read one, but sometimes
           | not _all_ bytes are allowed.
           | 
           | You can then reason about this probability distribution and
           | see what it's properties are. Typically something with a
           | probability of 'p' will require -log(p)/log(2) bits.
        
           | hoosieree wrote:
           | I almost feel like compression and embeddings are duals of
           | each other, but I can't quite articulate it.
           | 
           | Embeddings use fixed-size vectors to minimize the dot product
           | between vectors of similar inputs. Compressors use a
           | variable-length encoding to minimize the overall stored size.
        
         | IKantRead wrote:
         | LLMs and essentially all neural networks can be viewed as
         | learning compression algorithms where the behavior of the
         | compression algorithm is learned and subject to potential
         | constraints beyond mere file reconstruction.
         | 
         | Highly recommend reading Ted Chiang's "ChatGPT Is a Blurry JPEG
         | of the Web"[0] to get a better sense of this.
         | 
         | Keeping this fact in your mental model neural networks can also
         | go a long way to demystify them.
         | 
         | 0. https://www.newyorker.com/tech/annals-of-
         | technology/chatgpt-...
        
           | FeepingCreature wrote:
           | (The human brain is also, in part, a blurry JPEG of the
           | world.)
        
             | IKantRead wrote:
             | We currently have no reason to believe this, and
             | information we do have seems to suggest that is very
             | unlikely to be the case. I'm also guessing from my username
             | you can infer that I don't think we even know enough to
             | concretely say what is this "world" you are referencing.
        
               | og_kalu wrote:
               | I don't know what exactly blurry jpeg means to you but we
               | have every reason to believe we operate on shortcuts of
               | reality, not reality. Nearly all your brain does with
               | sense data is warp it to confirm to internal predictions
               | in numerous ways.
               | 
               | Memories are always part fabrications. You can't return
               | to previous mental states (you only think you do) and we
               | have no real clue what really informs decisions i.e
               | preferences shape choices just as much as choices shape
               | preferences.
               | 
               | Your brain will happily fabricate rationals you sincerely
               | believe for decision that couldn't possibly be true i.e
               | split brain experiments
        
       | AtNightWeCode wrote:
       | In current times *zip mostly cripples things like the web and
       | Docker. To find out it is the best at something in 2023 is not
       | very likely.
       | 
       | Nice find btw.
        
       ___________________________________________________________________
       (page generated 2023-07-17 23:01 UTC)