[HN Gopher] Gzip beats BERT? Part 2: dataset issues, improved sp...
___________________________________________________________________
Gzip beats BERT? Part 2: dataset issues, improved speed, and
results
Author : JoeyBananas
Score : 170 points
Date : 2023-07-29 16:04 UTC (6 hours ago)
(HTM) web link (kenschutte.com)
(TXT) w3m dump (kenschutte.com)
| beefman wrote:
| Part 1 discussed here:
| https://news.ycombinator.com/item?id=36758433
| dang wrote:
| Thanks! Macroexpanded:
|
| _Bad numbers in the "gzip beats BERT" paper?_ -
| https://news.ycombinator.com/item?id=36758433 - July 2023 (128
| comments)
| __vec__ wrote:
| Anyone recreate all of the results with un"contaminated
| datasets"?
| ks2048 wrote:
| This is my blog post, if anyone has any questions.
|
| I'll add that since I wrote these two blog posts, other people
| have sent me their other interesting work:
|
| (1) I link to this at the end of the post (using zstd
| dictionaries): https://github.com/cyrilou242/ftcc
|
| (2) today someone sent me this (bag-of-words better than gzip):
| https://arxiv.org/abs/2307.15002v1
| p1esk wrote:
| Your conclusion: "using ideas from text compression for text
| classification tasks is an interesting idea and may lead to
| other interesting research."
|
| Would you say this idea is interesting enough for you
| personally to research it further?
| ks2048 wrote:
| For me, no. Mainly because "text classification" is a pretty
| limited application and one I don't plan to spend much time
| on. For NLP tasks that require a deeper "understanding", I
| don't see how compression algorithms can help much (at least
| _directly_ ).
| nico wrote:
| Just conceptually, compression is an analog of
| understanding
|
| To be able to compress something, you need to understand it
| first
|
| We use this everyday, we compress things by naming them
|
| Once we name something, we don't need to explain or
| describe, we can just use the name instead
|
| That allows us to compress our communications and it
| directly affects the parties understanding of the
| information
|
| That's just conceptually. At a math/algorithm level I don't
| really know the specifics of your research or the paper in
| question
| ChainOfFools wrote:
| It may sound strange out of context, but the most
| memorable quote I've encountered in any book or any piece
| of writing anywhere, at least in terms of informing my
| own understanding of language and the construction of
| meaning through communication, came in a book on screen
| writing by William Goldman. The guy who wrote The
| Princess Bride, of all things.
|
| The sentence was simply, (and in capitals in the
| original), "POETRY IS COMPRESSION."
| quickthrower2 wrote:
| Would make a good haiku line 2
| ks2048 wrote:
| Yes, I agree. That's why I said _directly_ (with regards
| to compression algorithms used for understanding).
| _Indirectly_ , yes, compression and
| intelligence/understanding are closely related.
| mannykannot wrote:
| One could say that you need to understand _something_
| about the artifact you are compressing, but, to be clear,
| you can compress text without understanding anything
| about its semantic content, and this is what gzip does.
| The only understanding needed for that level of
| compression is that the thing to be compressed is a
| string in a binary alphabet.
| joshuamorton wrote:
| Of course, which is why gzip is a good baseline for
| "better" compressors that do have semantic understanding.
|
| The whole idea of an autoencoder _is_ conceptual
| compression. You take a concept (say: human faces) and
| create a compressor that is so overfit to that concept
| that when given complete goobldygook (random seed data)
| it decompresses that to something with semantic meaning!
| refulgentis wrote:
| Text-similarity embeddings aren't very interesting and will
| correlate with gzip, especially when the test is text
| similarity, especially when they're distinct vocabularies
| being tested.
|
| The really useful ones are based on SBERT, and measure the
| likelihood that the answer is contained in the text that was
| embedded.
|
| ex. from my unit tests: "what is my safe passcode?" has a
| strong match with "my lockbox pin is 1234", but a very weak
| match to 'my jewelry is stored safely in the safe'
|
| I learned this from
| https://news.ycombinator.com/item?id=35377935: thank you to
| whoever posted this, blew my mind and gave me a powerful
| differentiator
| cs702 wrote:
| No questions from me. Just want to say: Thank you for doing all
| this work!
| phyzome wrote:
| I'm idly curious how much of a speedup you achieved.
| ks2048 wrote:
| I don't have complete numbers on this (I think it depends a
| lot on the size of training set), but for one dataset,
| normalized time for a batch: original :
| 1.000 precomputed : 0.644 (first improvement)
| gziplength : 0.428 (+ 2nd improvement)
| godelski wrote:
| I really think that the numbers were inflated because the
| prolific benchmarkism that goes on in ML. Basically if you don't
| beat SOTA, you don't get published. Usually you need SOTA on
| MULTIPLE datasets. Which is is problematic, because plenty of non
| SOTA methods are useful (forget novel). Given the results
| Ken/ks2048 calculated, I am pretty confident that the work
| wouldn't have made it in. BUT I think the results given the other
| features does make the work quite useful! I agree Ken, that it
| unfairly boosts their work, but I understand why they're bending
| over backwards to defend it. I wish people would just admit
| mistakes but that risks (probably not) losing a paper. This is
| probably the same reason they didn't think to double check the
| suspicious results like the Filipino dataset too (btw, not
| uncommon for datasets to be spoiled people. Always be
| suspicious!).
|
| I'm not trying to give them a pass, but we do need to discuss the
| perverse incentives we've set up that make these kinds of things
| prolific. The work should be good on its own, but good doesn't
| mean it'll get published in a journal. And frankly, it doesn't
| matter how many citations your arxiv paper has, people will still
| say "it isn't peer reviewed" and it won't help you get a job,
| graduate, or advance in academia. Which I think we should all
| agree is idiotic, since citations are indicating peer review too.
| lalaland1125 wrote:
| I don't blame them for failing to double check their results.
|
| I blame them for giving obviously incorrect excuses on GitHub
| when such an obvious mistake is pointed out.
|
| There is no way they could be at the stage they claim to be in
| their program (having just defended their thesis) and think the
| excuses they gave on GitHub are reasonable.
| godelski wrote:
| Yeah, I fully agree. They should just admit the mistake
| rather than try to justify it. I was just trying to explain
| the incentive structure around them that encourages this
| behavior. Unfortunately no one gives you points for admitting
| your mistakes (in fact, you risk losing points) and you are
| unlikely to lose points for doubling down on an error.
|
| > There is no way they could be at the stage they claim to be
| in their program (having just defended their thesis) and
| think the excuses they gave on GitHub are reasonable.
|
| Unfortunately it is a very noisy process. I know people from
| top 3 universities that have good publication records and
| don't know probabilities from likelihoods. I know students
| and professors at these universities that think autocasting
| your model to fp16 reduces your memory by half (from fp32)
| and are confused when you explain that that's a theoretical
| (and not practical) lower bound. Just the other day I had
| someone open an issue on my github (who has a PhD from one of
| these universities and is currently a professor!) who was
| expecting me to teach them how to load a pretrained model.
| This is not uncommon.
|
| Goodhart's Law is a bitch.
| recov wrote:
| Related, sentdex did a video and implementation on it as well:
| https://www.youtube.com/watch?v=jkdWzvMOPuo
| dekhn wrote:
| This is a masterwork of analysis and improvement of a method.
| [deleted]
| the_man_of_sex wrote:
| [flagged]
| birdyrooster wrote:
| There is this sense of deflation of effort in tech right now
| (always?) where, if you can just wait a moment longer to start
| coding, you can just adopt something else and save yourself from
| the rat race.
| luc4sdreyer wrote:
| > Scientific research typically has been founded on high ethical
| standards established by researchers in academia and health care
| research institutions. Scientific fraud, an act of deception or
| misrepresentation of one's own work, violates these ethical
| standards.
|
| And according to Ken Schutte:
|
| > this method uses the test label as part of its decision process
| which is not the standard classification setting and can't be
| fairly compared to others that don't.
|
| Can anyone make the case that these two descriptions don't
| overlap? Personally I can't see how the original author can be so
| blase about this.
|
| [1] https://pubmed.ncbi.nlm.nih.gov/2061524/
| godelski wrote:
| I try to explain in this comment[0]. I agree that this is
| unethical behavior, but we need to also be aware of what
| pressures are encouraging this behavior. I also think Ken is
| romanticizing the standards of science a bit here. This would
| be great, but it is not what happens in practice.
| Unfortunately. Mostly unintentionally, but there is intentional
| ones too.
|
| [0] https://news.ycombinator.com/item?id=36922708
| codeflo wrote:
| The article has a link[1] to a discussion between the blog author
| and the paper author that I find revealing.
|
| Perhaps as a reminder, the issue is that the paper's
| implementation of their 2-nearest neighbor secretly uses an
| oracle to break ties, which obviously inflates the accuracy
| compared to a real-world kNN classifier that has to choose
| heuristically. To be fair, this could be a weird implementation
| accident and not malice. But I think it does invalidate the
| results.
|
| But rather than admit error, the author _defends_ this choice,
| and does so using (in my opinion) dubious statistical arguments.
| Which leads me to believe that -- at least at this point -- they
| know they made a mistake and just won't admit it.
|
| They claim that instead of a real-world accuracy, they wanted to
| find the "max" accuracy that their classifier was statistically
| capable of. That is, the accuracy you get if the stars happen to
| align and you get the luckiest possible result. Well, not only is
| this creative new metric not described in the paper, it's also
| not applied to the other algorithms. For example, I think a
| neural network is capable of achieving a "max" accuracy of 100%,
| if all the initial weights happen to perfectly encode both the
| training and test sets. But of course they just use standard
| training to give the numbers for those algorithms.
|
| [1] https://github.com/bazingagin/npc_gzip/issues/3
| ks2048 wrote:
| Well put. Yes, I mention a similar case towards the end of that
| exchange: Consider a random-guess classifier. That has a _max
| accuracy_ of 100%. Clearly, not a useful measure on its own.
| pedrosorio wrote:
| > They claim that instead of a real-world accuracy, they wanted
| to find the "max" accuracy that their classifier was
| statistically capable of
|
| Yeah, I read this on the GitHub issue a week ago and couldn't
| believe it. Ideally, their profile(1) should allow them to
| quickly admit they were wrong on such a simple issue. Pursuit
| of truth and knowledge, etc.
|
| (1) a young PhD from a prestigious university
|
| > For example, I think a neural network is capable of achieving
| a "max" accuracy of 100%
|
| Why reach for such powerful tools? f(x) = random(num_classes),
| achieves 100% "upper bound" accuracy.
| lalaland1125 wrote:
| In academia, it's better to cling to obviously false
| justifications to dismiss criticism and keep a paper accepted
| than to admit fault and potentially be forced to retract.
|
| Publish or perish
| bonzini wrote:
| Retracting is extremely rare in computer science, which is
| why instead many conferences have started "stamping" papers
| that have artifacts which provide reproducible results.
| hinkley wrote:
| A couple of AI hype cycles ago, everyone was abuzz about
| genetic algorithms. I recall a cautionary tale that was related
| about someone using FPGAs to do genetic algorithms.
|
| After a while they noticed several disturbing things. One, that
| the winners had fewer gates than theory thought was necessary
| to solve the problem. Two, some days the winners didn't work,
| and three, sometimes the winners didn't work on a different
| FPGA.
|
| After much study the answer was that the winning candidates
| were treating the gate logic as analog. Manufacturing flaws or
| PSU fluctuations would result in the analog aspects behaving
| differenty.
|
| To fix this, they split the fitness test in two passes. All
| implementations that actually worked got re-run in an emulator,
| which of course treats the behavior as purely digital. Only if
| they worked with both did they avoid being culled.
| [deleted]
| Twirrim wrote:
| > The paper's repo does minimal processing on the datasets. It
| turns out that these problems exist in the source Huggingface
| datasets. The two worst ones can be checked quickly using only
| Huggingface's datasets.load_dataset:
|
| I'm really surprised HuggingFace isn't doing filtering/evaluation
| of the datasets they're presenting. This ought to be a simple
| check for them.
| lalaland1125 wrote:
| It's not the job of HuggingFace to certify datasets. It's
| simply outside the scope of their work.
| godelski wrote:
| That's a tall order. While the cases here are simple and more
| obvious, they don't scale well. It can also be problematic if
| an official dataset has the error, as now they've created a
| different one. They have 48,627 datasets. Their goal is not to
| validate datasets (which is far more difficult than checking
| for dupes (not easy btw)), but to be like github so that others
| (like Ken) can review the work of his peers and check for
| mistakes. Due to this, HF has to allow for uploading of
| arbitrary datasets, because they cannot be an arbitrator of
| what is good or bad, since that depends on what's being solved.
| They could probably set a flag for datasets (and maybe even
| some statistics!) that are under a few gigs in size, but they
| cannot and should not filter them.
| Twirrim wrote:
| I appreciate there is nuance, and some checks would be
| computationally expensive, but something like training data
| and evaluation data being literally identical seems like it
| would be pretty straightforward to check for and a very
| simple quick rejection.
| _delirium wrote:
| I think of HuggingFace as essentially a GitHub for ML stuff.
| They just provide infrastructure that anyone can upload to.
| pizza wrote:
| Is there feature for hf's datasets platform that makes
| load_dataset throw an exception if you try to load a known-
| dubious dataset unless you explicitly provide a kwarg like
| 'allow_dubious=True'? If not, that might be a boon for the
| whole field.. might nip the propagation of false results at the
| outset
| itvision wrote:
| <offtopic probably>Haven't read the article but nowadays there's
| no reason to use either GZIP, or bzip2 when ZSTD is available.
| It's just so much better than both, I've no idea why people
| haven't replaced everything with ZSTD, except for XZ/7-Zip which
| can provide much high compression ratios at the cost of very slow
| compression and insane RAM requirements (3840MB dictionary with
| at least two threads).</offtopic probably>
___________________________________________________________________
(page generated 2023-07-29 23:00 UTC)