[HN Gopher] Bad numbers in the "gzip beats BERT" paper?
___________________________________________________________________
Bad numbers in the "gzip beats BERT" paper?
Author : ks2048
Score : 329 points
Date : 2023-07-17 14:08 UTC (8 hours ago)
(HTM) web link (kenschutte.com)
(TXT) w3m dump (kenschutte.com)
| ks2048 wrote:
| Hi, that's my blog post. I'm pretty sure about what I wrote here,
| but may need the authors to chime-in in case I am missing
| something. I just submited an issue on github,
| https://github.com/bazingagin/npc_gzip/issues/3
| [deleted]
| lalaland1125 wrote:
| Just wanted to say, thanks for your work debugging this.
|
| You have probably saved other researchers an unfathomable
| amount of time
| [deleted]
| eyegor wrote:
| You may want to consider adding a note to the top. Seems like a
| lot of people are lazily skimming/reading the headline and see
| it as "gzip paper full of beans, gzip approach sucks" when
| really I see this as "gzip approach not better than dnn models
| but mostly competes and much cheaper to run". The paper is
| still solid.
| marcinzm wrote:
| >gzip approach not better than dnn models but mostly competes
| and much cheaper to run
|
| Does it? It looks to do worse than FastText in all benchmarks
| and kNN is not a cheap algorithm to run so it might actually
| be slower than FastText.
|
| edit: It looks like FastText takes 5 seconds to train on the
| Yahoo Answers data set while the gzip approach took them 6
| days. So definitely not faster.
| eyegor wrote:
| I'm not familiar with most of these models in detail, but
| training time is generally less interesting than inference
| time to me. I don't care if it takes a month to train on
| $10k of gpu rentals if it can be deployed and run on a
| raspberry pi. I should definitely look into fasttext
| though.
| amluto wrote:
| As described in the paper, it didn't look like the gzip
| classifier trained at all. Inference involved reading the
| entire training set.
|
| One could surely speed this up by preprocessing the
| training set and snapshotting the resulting gzip state,
| but that wouldn't affect the asymptotic complexity. In
| effect, the number of parameters is effectively equal to
| the size of the entire training set. (Of course, lots of
| fancy models scale roughly like this, too, so this isn't
| necessarily a loss.)
| huac wrote:
| The gzip approach is much slower at inference time
| because you need to compute the gzip representation of
| the concatenated strings (query + target). Intuitively,
| this should be significantly more than a dot product of
| two embedding vectors.
| amluto wrote:
| The latter depends very strongly on how much computation
| is needed to compute those embedding vectors.
|
| If you run a GPT-3.5-sized mode to compute that embedding
| (which would be a bit absurd, but if you really want
| GPT-3.5-quality classification, you may well be doing
| something like this), you're looking through quite a few
| tens of billions of parameters and doing a
| correspondingly large number of FLOPs, which could be
| just as expensive as running gzip over your whole (small,
| private) training set.
| huac wrote:
| no, because the compute intensity scales with the number
| of classes which you wish to classify to. if you have n
| classes, you need to do n gzip compressions at inference
| time. in the embedding world, you only call the embedding
| model once on insert, and only need to dot product at
| inference time.
|
| the same logic extends to using a self-hosted embedding
| model, which tend to be as good as Ada on most
| benchmarks, and yes, can be finetuned over your private
| data.
| marcinzm wrote:
| >The latter depends very strongly on how much computation
| is needed to compute those embedding vectors.
|
| Sure but the gzip metrics are worse than FastText which
| computes the embeddings in essentially no time. Tokenize,
| lookup embeddings by token id, and then do some
| averaging. So compared to that the gzip approach is very
| slow.
| tensor wrote:
| FastText isn't a LLM, it's a token embedding model with a
| simple classifier on top.
| marcinzm wrote:
| Sure but it's existence means the statement is really
| "gzip approach not better than dnn models, and doesn't
| compete or be cheaper to run than previous models like
| FastText." That's not a very meaningful value statement
| for the approach (although why gzip is even half-decent
| might be a very interesting research question).
| light_hue_1 wrote:
| If this story is true the paper is not solid.
|
| Claims in the abstract and claim 3 in the paper, as well as
| much of the publicity around the paper is just wrong.
|
| It takes gzip from being great out of domain to being
| middling at best. It goes from something really interesting
| to a "meh" model. The main part that was intellectually
| interesting is how robust gzip is out of domain, if that's
| gone, there isn't much here.
|
| If I was the reviewer for this paper, this would take the
| paper from an accept to a "submit to a workshop".
|
| Also, kNN methods are slow O(n^2).
| [deleted]
| ivirshup wrote:
| kNN methods are broadly not O(n^2)[1], especially in
| practice where approximate methods are used.
|
| [1]: https://en.wikipedia.org/wiki/Nearest_neighbor_search
| huac wrote:
| how would you build an index over the gzip encoded data?
| seems quite different from building indices over vector
| embeddings.
| tensor wrote:
| I honestly don't know why anyone would use this gzip approach
| in production. If you want to do text classification, really
| the two options you should consider are a best in class
| linear model like confidence weighted linear classification
| by Crammer (https://www.cs.jhu.edu/~mdredze/publications/icml
| _variance.p...) or a much more expensive LLMs.
| eyegor wrote:
| Do you happen to be familiar with audio classification?
| There's been a ton of research on text classification and
| prediction but not many good papers I've seen for general
| audio classification. I'm talking more feature extraction,
| not speech recognition. There are a lot of speech
| recognition papers. So far I've been stuck on fft - image
| processing pipeline but I haven't gotten great results in
| real world tests, only on nice teat datasets.
|
| Personally I don't have much experience working beyond
| mlp/rnn/lstm/cnn models.
| esafak wrote:
| Don't look at it as a suggestion to use gzip in production,
| but an invitation to reconsider the unassailable
| superiority of BERT over simpler, tailored solutions.
| empiko wrote:
| Is it really an invitation? The paper shows that the
| current models are worse for some marginalized languages
| that are used as OOD datasets. I am not really that
| surprised that the modelals don't speak those and I don't
| know anybody who would use BERT like that
| tensor wrote:
| I don't think anyone actually doing NLP research has
| thought that BERT is always better than simpler methods.
| Linear classifiers with ngrams, or even better, large
| margin linear classifiers, are well known to be
| competitive with things like BERT on a variety of tasks,
| with orders of magnitude better runtime.
|
| In contrast, this gzip technique is considered a cute
| application of information theory, but even in academia
| is rarely included in studies because there are simpler
| and better techniques for NLP.
|
| Yes, if you are chasing the ultimate accuracy, then using
| a LLM (not necessarily BERT either) is going to be the
| best. But for a practical system trading some accuracy
| for vastly improved runtime is usually a very good trade-
| off. And again, it depends on your domain. Topic
| classification, stick with a linear model. Sentiment
| analysis? Ok, here a LLM actually gives substantially
| better results so it's worth the extra cost if sentiment
| is crucial to your application.
|
| I personally like the CW algorithm I mentioned because
| it's relatively easy to implement and has excellent
| qualities. But if I were a dev looking for a ready to go
| already implemented production system I'd go for vowpal
| wabbit and move up to a LLM if I'm not getting the
| accuracy I need for my application.
| marcinzm wrote:
| But FastText (2015) already exists and beats this gzip
| approach on all criteria. So the invitation has already
| existed before BERT (2018) and continues to exist.
| cs702 wrote:
| Whoa, I just read your post and saw this:
|
| _> tldr: it is reporting a top-2 accuracy rather than a
| kNN(k=2) accuracy_
|
| If the accuracy figures shown for other models are top-1,
| that's a pretty significant mistake, hopefully an innocent one.
|
| Thank you for doing this and sharing your findings!
|
| ---
|
| Previous discussion on HN:
| https://news.ycombinator.com/item?id=36707193
| DebtDeflation wrote:
| Also:
|
| >k=2 is a bit of an odd choice for kNN classification
|
| That's an understatement. Choosing an even number for any
| sort of voting algorithm doesn't make much sense, choosing 2
| specifically probably makes the least sense of all.
| ks2048 wrote:
| Yeah, I think this is one part where a reviewer or advisor
| could have focused questions.
|
| There is a sentence in Appendix C: "We set k = 2 for all
| the methods on all the datasets and we report the maximum
| possible accuracy getting from the experiments for each
| method."
|
| I'm not sure what the second part of that means exactly.
| spi wrote:
| Yeah that is a big red flag - as the OP mentions, there is
| basically no way of making k=2 statistically different from
| k=1, that's why nobody uses it.
|
| I suppose the authors just tried many different k and
| selected k=2 because it performed surprisingly well (likely
| due to the bug the OP found out). But if the results were
| significantly better than k=1 or k=3, it's a bit weird the
| authors never double checked why that was the case. I guess
| it can be one of those things you settle on early in the
| overall process with a few experiments, and just take for
| granted afterwards, never checking it again, but still, it
| sounds like something that should pop out at some point
| while writing the paper...?
| 1024core wrote:
| True. You want to always use an odd number so there are no
| ties.
|
| I'm guessing they were trying a parameter sweep, and found
| that (thanks to the bug) they got the best results for K=2.
|
| This too is problematic in its own sense.
| ks2048 wrote:
| Yes, agreed. One small point: for the multi-class case
| (more than just two classes), which include all the
| datasets here, you can still get ties for odd k. e.g.
| k=3, you can get 1 vote each for 3 different classes,
| etc.
| 1024core wrote:
| Multi-class is trickier. Maybe we can break down an
| N-class problem into N binary-classification problems?
| syats wrote:
| Thanks for the replication, this is important.
|
| One question, did you try to replicate the other result table
| (Table 3)?
|
| If I understand correctly, top-2 accuracy would be 1 if you
| have only 2 classes, but it will differ from "normal" accuracy
| less and less as the number of classes increases (on average).
| So this shouldn't change the results for table 3 thaaat much as
| the datasets have large amounts of classes (see table 1).
|
| In any case, top-2 accuracy of 0.685 for the 20-newsgroups
| dataset is pretty neat for a method that doesn't even consider
| characters as characters[1], let alone tokens, n-grams,
| embeddings and all the nice stuff that those of use working on
| NLP have been devoting years to.
|
| [1] In my understanding of gzip, it considers only bit
| sequences, which are not necessarily aligned with words (aka.
| bytes).
| ks2048 wrote:
| I haven't yet replicated Table 3 because most of those
| datasets are much larger and it will take awhile to run (they
| said the YahooAnswers database took them 6 days).
|
| Also, I have only tried the "gzip" row because that is all
| that is in the github repo they referenced.
|
| Yeah, you're right, the more classes there are, probably the
| lower the effect this will have.
| p1esk wrote:
| Did you try contacting the authors before you went public with
| your discovery?
| _b wrote:
| We're adult enough to have discussions like this in public.
| They are healthy to have. People make mistakes. Kudos to the
| original authors for releasing the source code so people
| could inspect and replicate their results.
| returningfory2 wrote:
| I agree, and just want to add: nowadays it's super common
| for researchers to widely publicize their new work on
| social media. The blog post here even mentions "Table 5
| from the paper was often included in tweets".
|
| In this context of sharing your results very publicly, it
| seems only fair that rebuttals would be very public, too.
| Otherwise researchers would be highly incentivized to very
| publicly publish weak results because they would get a lot
| of positive publicity when they share the results, but not
| much negative publicity when the weaknesses are found and
| shared.
| jarym wrote:
| It isn't a security issue and doesn't warrant responsible
| disclosure so why would op be expected to?
| ks2048 wrote:
| I did not, but I see why that could be a better approach. I
| mainly am trying to be more open with little side projects I
| do, so wanting to start blogging what I'm working on. Also,
| this paper was beiung widely discussed so thought this would
| be one more entry in that.
| puttycat wrote:
| This is interesting, however, why not first discuss with authors
| directly to make sure that you're right?
| pinko wrote:
| They did. See: https://github.com/bazingagin/npc_gzip/issues/3
| Dayshine wrote:
| One hour ago?
| ks2048 wrote:
| Yes, that could be a better idea. I am mainly trying something
| new to work more "in the open" and write blogs about things as
| a do them. I could be wrong and that would be pretty
| embarrassing for me. I just published the code I used to
| double-check things, now linked on the page near the bottom.
| adamsmith143 wrote:
| Whats disturbing to me if this is true is the number of "top
| voices" in ML on ML-Twitter were head over heals with this paper.
| How many of them actually read it at all?
| abecedarius wrote:
| One thing many people are missing: the simple gzip-for-text-
| classification hack is not the contribution of this paper. (They
| reference the standard intro AI textbook for that hack.) The
| contribution is to use the gzip numbers _together_ with
| k-nearest-neighbors.
|
| In section 6.2 they compare gzip-distance+kNN vs. gzip-distance
| on its own on four problems: it was better on two, and worse on
| two others.
|
| Another bit of background I guess is worth saying: language
| models are pretrained with a compression objective. That is, the
| loss function in pretraining is the cross entropy of the input
| text, which means "minimize the compressed length of this input
| if you fed it to this LM driving an arithmetic coder".
| fnands wrote:
| Just a note: your blog seems to be stuck in 2022. Date of post is
| 17 July 2022
| ks2048 wrote:
| Thanks, should be fixed in a minute. That's what I get for
| writing dates by hand...
| bjord wrote:
| If this is true, I'm looking forward to seeing how all the people
| who made grandiose statements about that paper now quietly scrub
| them.
|
| LinkedIn and Twitter influencers, I'm looking at you in
| particular.
|
| If it's not true, I guess I'll be the one looking stupid--I only
| skimmed the article.
| AbrahamParangi wrote:
| Probably disappointing to the authors but an excellent rebuttal.
|
| This is the sort of mistake that's awfully easy to make in ML.
| The unfortunate thing about the field is that subtle
| methodological errors often cause subtle failures rather than
| catastrophic failures as we're used to in many other branches of
| engineering or science. You can easily make a system slightly
| worse or slightly better by contaminating your training set with
| bad data or accidentally leaking some information about your
| target and the ML system will take it in stride (but with
| slightly contaminated results).
|
| This result makes sense to me because as much as I would like it
| to be true, applying existing compression algorithms to ML feels
| like too much of a "free lunch". If there was any special magic
| happening in compression algorithms we'd use compression
| algorithms as encoders instead of using transformers as
| compressors.
| godelski wrote:
| > This is the sort of mistake that's awfully easy to make in
| ML.
|
| It is important to remember this! Mistakes are common because
| they are easy to make. Science is a noisy process, but there is
| signal there and what we see here is exactly what peer review
| is about. I tend to argue that open publications are a better
| form of peer review than conferences/journals because of
| exactly this. Peer review is about your peers reviewing your
| work, less about whatever random and noisy standard a
| conference/journal puts forward. Remember that this was the way
| things happened for most of our history and that our modern
| notion of peer review is very recent (mid 70's). Older journals
| were more about accomplishing the mission that arxiv
| accomplishes today: disseminating works.
|
| https://mitcommlab.mit.edu/broad/commkit/peer-review-a-histo...
|
| [side note] another reason I'd advocate for the abolishment of
| conferences/journals is that through this we can actively
| advocate for reproduction papers, failure papers, and many
| other important aspects since we would not be held to the
| "novelty" criteria (almost everything is incremental).
| "Publishing" is about communicating your work to your peers and
| having them validate or invalidate your results.
|
| [edit] I think conferences are good in the fact that they bring
| people together and that encourages collaboration. That's
| great. But I should clarify that I'm specifically talking about
| using these platforms as a means to judge the validity of
| works. If a conference system wants to just invite works and
| the community, then I'm totally cool with that. I do also like
| journals in theory given that there's a conversation happening
| between authors and reviewers, but I believe this also could
| just easily be accomplished through arxiv + github or
| OpenReview (preferred).
| TX81Z wrote:
| Academic research code is largely dogshit written as quickly as
| possible by amateurs, barely tested whatsoever, and the primary
| intended output of all such code is accumulating paper
| citations.
|
| A world with half as many scientific papers and twice as much
| care would produce far more value but the whole enterprise is
| hopelessly gamified.
| jsight wrote:
| > The unfortunate thing about the field is that subtle
| methodological errors often cause subtle failures rather than
| catastrophic failures as we're used to in many other branches
| of engineering or science.
|
| I've been doing a lot of studying in the ML field lately, and
| I'm seeing this a lot. It is just another thing that feels like
| the polar opposite of everything else that I've done as a
| software engineer.
|
| Miss a semicolon? Instant error.
|
| Miscalculate some grads on one out of three layers? Every now
| and then it might even work! But the results will be weird.
| godelski wrote:
| How about this one: tune your hyper-parameters based on the
| results on your test data.
|
| This is prolific, even the norm, but it is a form of
| information leakage. You're passing information about the
| test dataset to the model. The solution to this is to use 3
| partitions: train, validation, test. Validation is for HP
| tuning (you can do cross-validation btw) and test is a single
| shot.
| jsight wrote:
| Yep, I've been guilty of that one lately. That and solving
| problems by simply overfitting a neural net to the data in
| the problem domain.
|
| I mean, it works, but the result is less interesting than
| what I should have done. :)
| eyegor wrote:
| What about: add more dropout or noise layers and train an
| ensemble of models. Submit the best one. Is this
| considered dirty?
| godelski wrote:
| Definitely. Problem is that doing this helps you get
| published, not hurts. I think this is why there's often
| confusion when industry tries to use academic models, as
| they don't generalize well due to this overfitting. But
| also, evaluation is fucking hard, and there's just no way
| around that. Trying to make it easy (i.e. benchmarkism)
| just adds up creating more noise instead of the intended
| decrease.
| iamflimflam1 wrote:
| It's true in many experiments. The desire to get the result you
| want can often overwhelm the need to validate what you are
| getting.
|
| Especially true when the results confirm any pre-existing
| thinking you may have.
| mattsan wrote:
| Yep, confirmation bias. Luckily helped with peer review!
| catgary wrote:
| Hasn't this paper made it through peer review?
| Karellen wrote:
| I suspect GP commenter meant "replication study" rather
| than "peer review".
|
| ;-)
|
| (Peer review doesn't check if your data is correct. They
| check your data collection methods make sense given the
| hypothesis you're testing, and that your conclusions are
| supported by the data you collected.)
| thomasahle wrote:
| Yeah it was published at ACL (
| https://aclanthology.org/2023.findings-acl.426/ ) which
| is one of the most prestigious conferences in NLP. So
| kinda disappointing.
|
| But paper reviewers are usually not supposed to look at
| the actual source code of the papers, and definitely
| don't try to reproduce the results. They just read the
| paper itself, which of course doesn't talk about the
| error.
|
| Not sure what the best solution is, other than having the
| most "hyped" papers double verified by researchers on
| Twitter.
| godelski wrote:
| > paper reviewers are usually not supposed to look at the
| actual source code of the papers
|
| Wait what? I haven't reviewed for ACL but most
| conferences don't say "don't look at the source code."
| They will say that reviewers are not required to look at
| it (as well as the appendix). But generally it just isn't
| uploaded. I do always look at the main method when it is
| there but talking to my peers and advisor, this is very
| uncommon[0]. My experience is that most reviewers do not
| spend more than an hour on a work and make an opinion
| within 15 minutes.
|
| > Not sure what the best solution is, other than having
| the most "hyped" papers double verified by researchers on
| Twitter.
|
| I'd say (as a start):
|
| 1) Get rid of the conference system. A zero-shot (maybe
| 1-shot if "rebuttal" is allowed) zero-sum system is just
| disastrous, especially at scale. There's high incentives
| to actually reject works you review for. A conference
| system has a binary outcome and the purpose is to reject
| 80% of papers based on a rather noisy metric of "top
| tier." A journal system is a back and forth where
| reviewers are trying to improve the paper. The purpose of
| the reviewers here is to determine if the idea is indeed
| good, and then if the paper meets the requirements or not
| and must explicitly state what needs to be changed for
| acceptance.
|
| 1.5) An actual rebuttal system could help alleviate some
| of these issues. Using OpenReview for a conversation
| between authors and reviewers is critical. A singular 1
| page response (the norm) is not adequate to respond to 4
| different people who often have low similarities in
| responses. Reviewers are allowed (though breaks
| guidelines) to respond in one sentence.
|
| 2) ACs need to do a better job at validating reviewers.
| The number of inane and absolutely unacceptable level of
| reviews I have gotten is astounding (>25%). I've also
| seen reviewers often break guidelines and have nothing
| happen. Examples are comments such as those claiming lack
| of novelty with no explanation or asking authors to
| compare to concurrent works (I've had this happen for a
| work that was put out _after_ submission deadlines. Not
| mine, but here's an example[1] of this being done
| publicly). If the reviewer is pushed to update their
| comment then the authors have no ability to respond to
| their update without the conversation aspect. If there is
| high variance in response -- not just scores, but what
| the critiques are about -- then the ACs need to look
| closer as something is going wrong. We're in a crisis for
| reviewers but we also have an undisclosed crisis in
| quality of reviewers. Benchmarkism is on the rise but
| benchmarks are extremely limiting for evaluation. There's
| a certain irony given our frequent discussion of
| Goodhart's Law or Reward Hacking. I'll even make the
| claim that the quality crisis influences the quantity
| crisis as I have seen many peers stop reviewing because
| it isn't worth their time and they aren't getting a fair
| shot in return. On a personal note, there is a journal I
| will no longer review for because in-actionable and
| unreasonable responses, but I also won't submit to them
| either.
|
| 3) Either get rid of double-blind, or actually do it.
| Everything is published on arxiv these days, which in
| general is great for the community as it allows things to
| move fast. But with this it is incredibly easy to de-
| anonymize authors. Though for big labs, they de-anonymize
| themselves actively[2]. In a very noisy process even a
| very slight edge becomes a significant edge[3]. These
| biases can even come unconsciously given that we're all
| reading arxiv papers constantly and it isn't unlikely
| that we come across some of the works we end up reviewing
| (yet to knowingly happen to me fwiw). But certain labs do
| have keywords that they use that can be identified.
|
| I think one of the major problems comes down to this: in
| a small community we have a certain level of
| accountability, as we all end up knowing one another
| through minimal connections. But in a large community
| there is little to no accountability and what depends on
| good faith can no longer be trusted. This encourages bad
| actors, especially when the system is highly competitive
| (see 1)), and creates bad science/evaluation creep. (e.g.
| now standard to tune HPs on test data results -- this is
| information leakage. If you don't, you likely can't
| compete).
|
| ======
|
| [0] Here's a prominent researcher explicitly saying they
| don't read the appendix, calling it trash, and a poll
| showing most people don't look at it https://twitter.com/
| david_picard/status/1660293648796340226
|
| [1] Here's a prominent researcher criticizing a paper for
| "not citing his work". I linked the top response which is
| telling him the submission date was 2 months prior to his
| arxiv release. This is someone who published >250 papers
| vs someone with <50. For added reference, paper 2
| (prominent researcher) was _published_ June 26th in TMLR,
| but they did cite the other work (gotta give credit for
| that)
| https://twitter.com/RinonGal/status/1667943354670170118
|
| [2] We have 2 scenarios here: either reviewers do not
| know Chinchila == DeepMind, where I'd argue that they are
| unfit for reviewing given the prominence of that model or
| 2) they do know, and thus know this is a DeepMind work,
| and we have an ethics problem. Neither sound great. https
| ://openreview.net/forum?id=OpzV3lp3IMC¬eId=HXmrWV3ln..
| .
|
| [3] The conclusion in this analysis of consistency
| experiment is that even a small amount of inconsistency
| leads to a lot of noise given a highly selective
| standard. Which means that paper acceptance itself is
| highly stochastic: (2014 experiment)
| https://inverseprobability.com/talks/notes/the-neurips-
| exper...
|
| [3.1] A shorter version:
| https://blog.mrtz.org/2014/12/15/the-nips-experiment.html
|
| [3.2] A follow-up on the 2014 experiment tdlr: reviewers
| are good at identifying bad papers, but not good at
| identifying good papers (i.e. bias to reject):
| https://arxiv.org/abs/2109.09774
|
| [3.3] A follow-up 2021 experiment (consistent with 2014
| experiment): https://blog.neurips.cc/2021/12/08/the-
| neurips-2021-consiste...
|
| [3.4] Video form
| https://www.youtube.com/watch?v=19Q-vMd9bYg
| YeGoblynQueenne wrote:
| I try not to submit to conferences if I can avoid it.
| It's like you say, reviewers are looking for a reason to
| reject. I don't understand what makes the difference
| since it's usually the same people reviewing in both
| conferences and journals, but somehow journal reviewers
| do a much better job. Some journals have a fast
| turnaround even, and still the quality of reviewing is
| considerably better.
|
| My second journal paper got rejected with encouragement
| to resubmit. Half the reason for that was because the
| reviewer had, I think, genuinely misunderstood the
| description of an experiment, so I re-wrote it in
| painstaking detail. I had a long section where I hammered
| out a proof of complexity spanning three pages, with four
| lemmas and a theorem, and the reviewer waded through all
| that like a hero, and caught errors and made
| recommendations for improvement. They made a new round of
| recommendations when I resubmitted. That paper took three
| rounds of revisions to publish (reject, resubmit, accept
| with minor revisions) but it got 80% better every time I
| had to revise it. I wish there was another couple of
| rounds! It was exhausting, and I bet much more so to the
| reviewer, but it was 100% worth it.
|
| And yeah, I absolutely do my best to review like that
| myself. Even in conferences, which probably seems really
| weird to authors. But, hey, be the change you want to
| see.
| godelski wrote:
| Yeah, honestly the only reason I submit to conferences
| now is because my advisor asks me to. If it was up to me
| I would submit exclusively to journals or just to
| arxiv/open review directly. I think I'll do this when I
| graduate (soon).
|
| As for the reason why it happens in conferences, I think
| it may actually be a different set of reviewers. While
| journal reviewers are going to be conference reviewers, I
| don't think the other way around is true. I think
| conferences tend to just have a larger number of shitty
| reviewers (as well as more shitty submissions). And as
| you note, it is quite easy to misunderstand a work,
| doubly so when you're reading under a time constraint. It
| just makes for a noisy process, especially when reviewers
| view their job as to reject (not improve). I just think
| it is a bad system with a bad premise that can't really
| be fixed. For conference reviewing, I always try to write
| what would change my mind and if I think the authors
| should resubmit to another venue. But even reviewing I
| don't feel authors get a fair shot at responding. They
| can't address all my comments while addressing others in
| a single page.
|
| Edit: I saw your bio. I actually have a SOTA work that is
| rejected (twice). Good performance jump with large
| parameter drop. But just couldn't tune or run enough
| datasets because compute limited. Conferences are fun.
| catgary wrote:
| Yeah, it's not (entirely) the students' faults that this
| slipped through peer review. I don't envy the whiplash
| they're going to experience over the next few weeks.
|
| If I was the graduate chair of their department I might
| schedule a meeting with their supervisor to sort out how
| this happened.
| westurner wrote:
| What about the difference in CPU cost, RAM cost, and GPU
| training hours, though? What about the comparative
| Big-E's of the models?
|
| Great topic: Model minification and algorithmic omplexity
| mananaysiempre wrote:
| One particular example that I remember from an introductory
| particle physics class is the History Plots section[1] of the
| biennial review of experimental data.
|
| Knowing these quantities is important, but their particular
| values largely aren't; nobody's funding or career really
| depended on them being equal to one thing or another. Yet
| look at all the jumps, where the measurements after the
| initial very rough ones got stuck in the completely wrong
| place until the jump to the right value--when it happened--
| was of a completely implausible magnitude, like four, six, or
| ten sigma.
|
| [1] https://pdg.lbl.gov/2023/reviews/rpp2022-rev-history-
| plots.p...
| godelski wrote:
| What's also good to see here is that the post '90 numbers
| usually don't even fall within the error bars of the pre
| '90 numbers. While uncertainty is great, it isn't the end
| all. I think a lot of people forget how difficult
| evaluation actually is. Usually we just look at one or two
| metrics and judge based on that, but such an evaluation is
| incredibly naive. Metrics and measures are only guides,
| they do not provide certainty nor targets.
| BSEdlMMldESB wrote:
| now shift fields such that the subtle methodological errors
| don't come to light in 20 years.
|
| which field are you on now? economics!? haahha
| refulgentis wrote:
| Really happy to see this: KNN + classification task + doing
| classification that's based on pure text similarity is a recipe
| for stacked results.
|
| Schaudenfreude responses to this paper misunderstand that the
| natural language stuff is crucially important for embeddings:
| sure, phrases that share words will classify well and GZIP well,
| so GZIP can be used as ersatz classification.
|
| The miracle of BERT / embeddings is _not_ having to share words:
| for instance, "what is my safe passcode?" has a strong match with
| "my lockbox pin is 1234", but not "my jewelry is stored safely in
| the safe".
|
| This is also an important thing to consider with LLMs: people are
| using embeddings intended to do text similarity, whereas you want
| to use an SBERT model: that is trained to correlate a question to
| a document that will answer it.
|
| https://www.sbert.net/ for the full rabbit hole.
|
| Previously: Should you use OpenAI's embeddings? Probably not, and
| here's why. https://iamnotarobot.substack.com/p/should-you-use-
| openais-e....
|
| HN discussion: https://news.ycombinator.com/item?id=35377935
| jabbany wrote:
| > The miracle of BERT / embeddings is _not_ having to share
| words
|
| To be fair, the original task is specifically chosen where
| something like knn+compression has a chance of being good: i.e.
| out of domain + low resource.
|
| Under these conditions the training inputs could be too sparse
| for a highly parameterized model to learn good embeddings from.
|
| In traditional in domain + big data classification settings
| there's no chance that non-parametric methods like compression
| would beat a learned representation.
| renewiltord wrote:
| Great job replicating and fixing. So easy to accidentally create
| results that are statistical artifacts.
| netdur wrote:
| I have hacked javascript port
| https://gist.github.com/netdur/a777f75fb70e0abc19c407c2ff7f9...
|
| and it seems to work!!! regardless
|
| Best matched answer for 'How do I start a new project?' is 'To
| create a new project, go to the dashboard and click on 'New
| Project'. Then, fill out the details and click 'Save'.'
|
| Best matched answer for 'How can I delegate tasks to others in my
| team?' is 'To invite team members to your project, open the
| project and click on 'Invite Members'. Enter their email
| addresses and click 'Send Invites'.'
|
| Best matched answer for 'What do I do when I finish a task?' is
| 'When a task is completed, it should be marked as 'Complete'. It
| will then be moved to the 'Completed Tasks' section.'
| chaxor wrote:
| I appreciate that this blog post was made.
|
| This is like so, so many little projects that I do (even
| specifically showing problems in papers like this) that never see
| the light of day. I usually just make a small noise, and then it
| sits on my hard drive and that's it.
|
| So thank you for putting this out there.
| thomasahle wrote:
| I've started using Twitter as a "lower effort blogging"
| platform. After I spend a day on something like this, I'm
| usually too tired to actually write a blog post about it, which
| feels like a waste. But then writing a small Twitter thread is
| usually within my capabilities.
| snowstormsun wrote:
| So, couldn't it be that the authors of the paper ran the model
| with a random tie break and got lucky? This blog post seems to
| assume they had the "rand" flag deactivated. Please correct me if
| I am wrong.
| expensive_news wrote:
| From what I understand in the post getting lucky enough to see
| that big of a change in this situation would be like getting
| 1000 head flips in a row. It's not luck you could expect to
| ever get.
| snowstormsun wrote:
| I see
| usgroup wrote:
| It wasn't obvious to me why the authors chose kNN for the
| classifier. Since they produce a distance matrix, they could have
| used multi-dimensional scaling to convert the matrix to factors,
| and then used a tree algorithm such as xgboost which would likely
| make use of more information than kNN and produce significantly
| better results. They could have also used a PAQ compression
| algorithm which are much better than the LZ compressors -- all of
| which could have significantly improved the results and possibly
| delivered on their original conclusions.
|
| what i liked about the subject paper is that the compression
| algorithm is abstracted away and it led me to consider what else
| one could do with compression via the p(x) ~ K^(-|x|)
| relationship, where K is the alphabet size and |x| is the length
| of the string x, and assuming optimal coding.
|
| For example it occurred to me that one could do traditional
| classification by packing the factors for each response into
| separate documents and then proceeding as the paper does to
| classify which document best compresses the next sample in order
| to determine its class: a sort of supervised classification using
| compression algorithms. The closer the compressor is to an
| optimal code for the dataset, the better it will work.
|
| Schemes for sequence prediction are equally straightforward to
| implement.
|
| I found it to be a pleasant surprise.
| antonoo wrote:
| My take on this:
|
| https://twitter.com/antonosika/status/1679423272541192196?s=...
|
| Regardless, great work digging into the code, and great work by
| authors publishing the code.
| softwaredoug wrote:
| When we did search relevance experimentation at Shopify we made
| lots of mistakes. I can empathize with the authors. I've had a
| lot of my own public screw ups.
|
| At the end of my time at Shopify I learned good science requires
| good software engineering. It's really easy to make mistakes at
| so many places in the stack.
|
| We spent a lot of time on creating rigorous, heavily tested and
| high quality software for our experiments so we could trust our
| numbers and reproduce each others experiments. We tried to
| discourage one-off evaluation methods, but if we created a new
| one, to add it to our suite and test the metric to understand
| what it meant.
|
| It seems obvious, but sadly not as common as I wish it were in my
| experience with this kind of experimentation. Companies want
| velocity, and thinking deeply statistically, and building
| internal tools, is not in the interest of most higher ups.
| skrebbel wrote:
| Can anyone explain to me how a compression algorithm can beat an
| LLM at anything? Isn't that like saying horses are better than
| graffiti?
|
| I'm sure the answer is in there somewhere but I'm not well versed
| in AI and I simply can't figure it out.
| ozr wrote:
| The intuition about how the gzip method works goes like so:
|
| If you compress `ABC`, it will be X bytes. If you then compress
| `ABCABC`, it will _not_ take 2x bytes. The more similar the two
| strings that you concatenate, the less bytes it will take.
| `ABCABD` will take more than `ABCABC`, but less than `ABCXYZ`.
|
| BERT is, by todays standards, a very small LLM, which we know
| has weaker performance than the billion-param scale models most
| of us are interacting with today.
| jabbany wrote:
| > very small LLM
|
| Heh. So does that make it a MLM (medium)?
|
| I've always found it funny that we've settled on a term for a
| class of models that has a size claim... Especially given how
| fast things are evolving...
| stuartaxelowen wrote:
| Many other replies here are wrong - the primary reason is that
| the LLMs were used on completely out of distribution data (e.g.
| trained on English, evaluated on completely different language
| that shared some characters). The points about compression's
| relatedness to understanding are valid, but they are not the
| primary reason for LLMs underperforming relative to naive
| compression.
| ks2048 wrote:
| It's a very limited task: take a document and classify it into
| one of (for example) 10 or so categories. Things like detecting
| certain words can do pretty well in some cases. Things that
| compress well have the occurrence of common substrings.
| kachnuv_ocasek wrote:
| One way to interpret/understand language models is as quite
| involved compression algorithms.
| refulgentis wrote:
| Other reply is great, more in-depth on details from me here:
| https://news.ycombinator.com/item?id=36758681
|
| Plain english TL;DR:
|
| - if you limit your task to binning snippets of text
|
| - and the snippets are very well-defined (ex. code vs. Filipino
| text)
|
| - the snippets _bytes_ could be compared and score well, no
| text understanding needed
|
| - size delta of a GZIP after adding one more sentence acts as
| an ersatz way to compare sets of bytes to eachother (ex. you
| can imagine a GZIP containing 0xFFABCDEF that has 0xFFABCDEF
| added to it will have a size delta of 0)
| fsmv wrote:
| Did you read the recent Douglas Hofstadter article or do you
| just always use the word ersatz?
| refulgentis wrote:
| I was homeschooled till 12 and mostly left to my own
| devices as long as it was reading - I believe that has
| caused a lifelong issue where I sound like a tryhard
| unintentionally :( (TL;Dr I use it but IDK when, didn't see
| hofstader article but now I'm looking forward to it)
| stavros wrote:
| You can substitute the word "substitute" as an ersatz
| "ersatz".
| refulgentis wrote:
| It's honestly weird and annoying and I'd give it up in a
| second.
|
| There's two issues:
|
| - I don't have an ear for what's simple vocabulary versus
| tryhard, I go into a mad loop when I try
|
| - even if I actively notice it, substitution can seem
| very far away from intent. Simple wouldn't have occurred
| to me - I wanted to say something more akin to sloppy /
| stunt and ersatz is much closer to "hacky" in meaning
| than simple. Think MacGyver.
|
| But I should do the exercise of at least scanning for
| words more often and aim for wider audience - I would
| have known ersatz was an outlier and I shouldn't feel
| it's condescending or diluting meaning, it's broadening
| the audience who can parse it
| inimino wrote:
| Why would you apologize for your vocabulary and try to
| sound like someone less well-read than you are? Just get
| over it and be yourself.
| stavros wrote:
| Eh, it's fine, it doesn't sound tryhard to me, just a bit
| hard to read.
| matthewdgreen wrote:
| The word ersatz is great, and conveys the notion that the
| replacement is simpler and possibly inferior when
| compared across all features. "Substitute" doesn't cut
| it. Human language (ironically w.r.t. TFA) isn't a
| collection of redundant symbols, the synonyms carry all
| sorts of useful nuance.
| tonyg wrote:
| It doesn't sound tryhard; it sounds literate.
| ajtulloch wrote:
| https://www.inference.org.uk/itprnn/book.pdf is a classic text
| on this connection.
| optimalsolver wrote:
| Compression is equivalent to intelligence:
|
| https://mattmahoney.net/dc/rationale.html
| awegio wrote:
| A language model estimates the probability of a sequence of
| words P(w_1, ..., w_n) or equivalently P(word | context).
|
| For compression, word sequences that have higher probability
| should be encoded with shorter codes, so there is a direct
| relationship. A well known method to construct such codes based
| on probabilities is Huffman coding.
|
| This works whether you use a statistical language model using
| word frequencies or an LLM to estimate probabilities. The
| better your language model (lower perplexity) the shorter the
| compressed output will be.
|
| Conversely, you can probably argue that a compression algorithm
| implicitly defines a language model by the code lengths, e.g.,
| it assumes duplicate strings are more likely than random noise.
| GuB-42 wrote:
| Generally, compression = model + entropy coding.
|
| The model's job is to predict what comes next. The entropy
| coder's job is to encode the difference between the prediction
| and what actually comes next so that the most likely outcome
| uses as few bits as possible. The more accurate the model is,
| the less the difference between reality and prediction, the
| less bits the entropy coder needs and the better the
| compression.
|
| Simple compression algorithms have simple models, like "if I
| see the same byte 10 times, the 11th is likely to be the same".
| But you can also use a LLM as your model, as completing text
| with the most likely word is what LLMs do.
|
| Here they did the opposite. Instead of using a model for
| compression, by using a few tricks, they used a compression
| algorithm as a model: the most likely outcome is when the
| compression algorithm uses less bits to encode the result. And
| the original authors have shown that, in some tasks, the simple
| model that can be extracted out of gzip beats much more complex
| LLMs.
| contravariant wrote:
| Generally compression algorithm try to give structured data a
| distribution more similar to random data.
|
| If any byte sequence is a correct file (unlikely, but mostly
| because compression algorithms try to be robust against
| corruption), then this is easy to reverse, you just generate
| a random sequence of bytes and then decompress it.
|
| Basically you can turn a compression algorithm into a
| probability distribution by inserting random bytes wherever
| the decompression algorithm tries to read one, but sometimes
| not _all_ bytes are allowed.
|
| You can then reason about this probability distribution and
| see what it's properties are. Typically something with a
| probability of 'p' will require -log(p)/log(2) bits.
| hoosieree wrote:
| I almost feel like compression and embeddings are duals of
| each other, but I can't quite articulate it.
|
| Embeddings use fixed-size vectors to minimize the dot product
| between vectors of similar inputs. Compressors use a
| variable-length encoding to minimize the overall stored size.
| IKantRead wrote:
| LLMs and essentially all neural networks can be viewed as
| learning compression algorithms where the behavior of the
| compression algorithm is learned and subject to potential
| constraints beyond mere file reconstruction.
|
| Highly recommend reading Ted Chiang's "ChatGPT Is a Blurry JPEG
| of the Web"[0] to get a better sense of this.
|
| Keeping this fact in your mental model neural networks can also
| go a long way to demystify them.
|
| 0. https://www.newyorker.com/tech/annals-of-
| technology/chatgpt-...
| FeepingCreature wrote:
| (The human brain is also, in part, a blurry JPEG of the
| world.)
| IKantRead wrote:
| We currently have no reason to believe this, and
| information we do have seems to suggest that is very
| unlikely to be the case. I'm also guessing from my username
| you can infer that I don't think we even know enough to
| concretely say what is this "world" you are referencing.
| og_kalu wrote:
| I don't know what exactly blurry jpeg means to you but we
| have every reason to believe we operate on shortcuts of
| reality, not reality. Nearly all your brain does with
| sense data is warp it to confirm to internal predictions
| in numerous ways.
|
| Memories are always part fabrications. You can't return
| to previous mental states (you only think you do) and we
| have no real clue what really informs decisions i.e
| preferences shape choices just as much as choices shape
| preferences.
|
| Your brain will happily fabricate rationals you sincerely
| believe for decision that couldn't possibly be true i.e
| split brain experiments
| AtNightWeCode wrote:
| In current times *zip mostly cripples things like the web and
| Docker. To find out it is the best at something in 2023 is not
| very likely.
|
| Nice find btw.
___________________________________________________________________
(page generated 2023-07-17 23:01 UTC)