[HN Gopher] 78% MNIST accuracy using GZIP in under 10 lines of code
       ___________________________________________________________________
        
       78% MNIST accuracy using GZIP in under 10 lines of code
        
       Author : js98
       Score  : 268 points
       Date   : 2023-09-20 13:00 UTC (9 hours ago)
        
 (HTM) web link (jakobs.dev)
 (TXT) w3m dump (jakobs.dev)
        
       | tysam_and wrote:
       | If you'd like to play around with MNIST yourself, I wrote a
       | PyTorch training implementation that gets ~99.45%+ val accuracy
       | in <13.6 seconds on a V100, est. < 6.5 seconds on an A100. Made
       | to be edited/run in Colab: https://github.com/tysam-code/hlb-
       | CIFAR10
       | 
       | It's originally kitted for CIFAR10, but I've found the parameters
       | to be quite general. The code is very easy to read and well-
       | commented, and is a great starting place for exploration.
       | 
       | Min-cut deltas to run MNIST:                 .datasets.CIFAR10('
       | -> .datasets.MNIST(' (both occurences)            'whiten':
       | Conv(3, -> 'whiten': Conv(1,            crop_size = 28 - >
       | `crop_size = 28
       | 
       | Compute for the project funded by Carter Brown and Daniel Gross,
       | my appreciation to them both for helping make this possible.
       | Their support via encouragement has been very helpful as well. <3
       | :)
        
       | hinkley wrote:
       | [flagged]
        
         | morelisp wrote:
         | > Such as OCR.
         | 
         | Such as what?
        
           | hinkley wrote:
           | If you can't hear my eyes rolling from here I'll do it again,
           | harder... how bout now?
           | 
           | Optical Character Recognition. Very old, and very infamous
           | attempts at replacing humans with software. Secretaries and
           | executives know what OCR means.
        
             | tasty_freeze wrote:
             | I believe you missed the point, though I could be wrong.
             | 
             | Person A complains about MNIST not being defined. Person B
             | was sending them up for complaining about not defining the
             | meaning of acroynms and then using one they don't define.
        
               | [deleted]
        
               | hinkley wrote:
               | No I got it. I just don't want it.
        
           | vren wrote:
           | Optical character recognition... Has been pretty widespread
           | tech since ~2016 or so.
        
             | hinkley wrote:
             | Oh my ex and some acquaintances were bitching about it in
             | 1999.
             | 
             | I recall a guy with good transcription skills starting a
             | long thread on... must have been slashdot? About how it was
             | faster for him to transcribe the numbers again than to
             | verify the OCR was correct, but he couldn't tell his boss
             | that and what should he do?
             | 
             | There's also a famous case where the compression algorithm
             | in a copy/fax machine had a bug in a predefined dictionary
             | that resulted in changing digits to other digits when
             | making copies or faxes. Because they weren't just
             | compressing, they were trying to use a specialized OCR to
             | do aggressive compression. I believe it was spitting out
             | perfectly formed zeroes where another digit was in the
             | original. Yikes.
        
               | ElectricalUnion wrote:
               | > I believe it was spitting out perfectly formed zeroes
               | where another digit was in the original. Yikes.
               | 
               | JBIG2 lossy compression. Covered in another hn story:
               | https://news.ycombinator.com/item?id=29223815
               | 
               | From the story and the comments:
               | 
               | > This is not an OCR problem (as we switched off OCR on
               | purpose), it is a lot worse - patches of the pixel data
               | are randomly replaced in a very subtle and dangerous way:
               | The scanned images look correct at first glance, even
               | though numbers may actually be incorrect.
               | 
               | It's lossy compression turned to 11 that looks
               | convincingly non-lossy.
        
               | hinkley wrote:
               | I know technically you are right, but I feel like if you
               | sat down and described the high level features and goals
               | of that image compression (to compress text to a
               | ridiculous degree) and OCR , you'd be hard pressed to say
               | which list is which unless they gave it away with jargon
               | words. My brain compressed it to "OCR" and filed it
               | there.
               | 
               | It's like Jim Gaffigan's joke about working in a TexMex
               | restaurant in the northern Midwest. It's a tortilla,
               | cheese, meat and beans... I tell you what I'll just bring
               | you something.
        
               | ipython wrote:
               | you're referring to the Xerox Workcenter:
               | http://www.dkriesel.com/en/blog/2013/0802_xerox-
               | workcentres_...
        
               | hinkley wrote:
               | YES. God what a nightmare. Can you imagine all of the
               | arguments and threats of lawsuits that bug caused for
               | small and medium businesses? You changed the contract
               | after we had a verbal agreement and tricked me into
               | signing!
               | 
               | I wonder if any executive assistants got fired over that.
               | Surely.
        
         | telotortium wrote:
         | MNIST (and OCR) are acronyms that are so well-known to anyone
         | who has taken any kind of intro to ML class that there's no
         | more need to define them in a short blog post than there would
         | be for us to define HTML. _I_ learned about MNIST in 2008, and
         | I was just taking a class on numerical methods as a physics
         | major in Matlab where MNIST was just one project.
        
           | hinkley wrote:
           | This isn't a forum about machine learning, though. It's a
           | general forum of geek news. I could talk to you all day about
           | compression. Gzip plus MNIST rings no bells.
        
             | telotortium wrote:
             | Fortunately MNIST is a pretty distinctive Google search
             | term. At most you would need to search for (MNIST machine
             | learning).
        
               | hinkley wrote:
               | Are you running google searches on
               | https://news.ycombinator.com/news before clicking on
               | stuff? Or expecting an executive summary. I think you can
               | guess which one I'm doing. And which one I believe is
               | normal human behavior.
        
               | itishappy wrote:
               | No need to guess, you requested a summary:
               | 
               | > you've missed the most important hyperlink: what the
               | fuck is a MNIST?
        
               | hinkley wrote:
               | Two questions and you answered the more rhetorical one.
               | Is that a no?
               | 
               | Anyway my main beef isn't with this GitHub repo or the
               | author, it's that nobody remembers how to write a
               | goddamned overview anymore. The point of tree structured
               | data is making it cheap to backtrack when doing a semi
               | random search. The overview is an important part of
               | making hypertext work, particularly when it feeds through
               | catch-all lists like a news site or a landing page for a
               | wiki.
               | 
               | When a hyperlink is organically embedded into a
               | paragraph, you can usually guess what it's for, from the
               | sentence and the context. When it's just a title that is
               | lost.
        
               | telotortium wrote:
               | No, but if the article headline interested me enough to
               | click into the article, I would Google MNIST if I didn't
               | know it already (ML is full of acronyms for datasets,
               | methods, etc. - can't expect blog posts to define all of
               | them. If it were a paper that didn't define the term, I
               | would be annoyed.)
               | 
               | Also, just because an acronym isn't used doesn't mean
               | you'll understand the jargon. For example, also on the
               | front page is a paper titled "Neurons in Large Language
               | Models: Dead, N-Gram, Positional". No acronyms, but I
               | would certainly need to Google (or actually read the
               | paper, not just the abstract) to know what the 3 terms at
               | the end means (well, I do know what a dead neuron is, but
               | not the other two).
        
             | jjtheblunt wrote:
             | i think the idea is that compression exploits recognized
             | similarities, so that begs the question of if a compressor
             | is actually recognizing things usefully.
        
             | itishappy wrote:
             | The fuck is a GZIP? ;)
        
               | hinkley wrote:
               | Goddamnit where is the spray bottle?
               | 
               | Psst psst psst. NO. Bad kitty.
        
         | RecycledEle wrote:
         | [flagged]
        
           | w23j wrote:
           | > I suggest all HN articles must define all abbreviations and
           | acronyms or be removed.
           | 
           | I can't decide if you are joking?
           | 
           | HR = Human Resources
           | 
           | PC = Personal Computer
           | 
           | HN = Hacker News
           | 
           | OP = Original Poster (?)
           | 
           | M.S. = Master of Science
           | 
           | certs = certificates
        
       | jp57 wrote:
       | Leaving aside whether this problem is a good application of this
       | compression trick, I want to say that everyone experimenting with
       | this should stop using `gzip` and start using `zlib`.
       | 
       | If you change the first line from `gzip.compress` to
       | `zlib.compress` you should get the same classifier performance
       | with a 3x speedup.
        
       | montebicyclelo wrote:
       | I tried replacing the distance function in the code with some
       | simpler distance measures:                   Gzip distance: ~3
       | minutes, 78% accuracy         Euclidean distance: ~0.5 seconds,
       | 93% accuracy         Jaccard distance * : ~0.7 seconds, 94%
       | accuracy         Dice dissimilarity * : ~0.8 seconds, 94%
       | accuracy              * after binarising the images
       | 
       | So, as a distance measure for classifying MNIST digits, GZIP has
       | lower accuracy, and is much more computationally demanding than
       | other measures.
       | 
       | I'm not that familiar with how the GZIP algorithm works, it's
       | kind of interesting that it's so much lower. I wonder whether
       | image focused compression algorithms might do better?
       | 
       | (Edit: Btw, I enjoyed the post; it's a creative idea, the writing
       | and code was great, and it's sparked some good discussion. But
       | after having a closer look, I think the baselines above provide
       | some context to the gzip scores.)
        
         | w-m wrote:
         | Best one I found is Normalized Mutual Information at 95%. It's
         | a little more complex, but you could still compute it
         | relatively quickly on binarized images.                   NMI
         | skimage: ~30 seconds, 95% accuracy         NMI numba *: ~0.6
         | seconds, 95% accuracy
         | 
         | Here's the code, courtesy of ChatGPT:
         | @njit(cache=True)         def compute_joint(img1, img2):
         | stacked = 2 * img1.ravel() + img2.ravel()             return
         | np.bincount(stacked, minlength=4).reshape(2, 2)
         | @njit(cache=True)         def entropy(counts, total):
         | # Convert counts to probabilities             p = counts /
         | total             # Only calculate for non-zero entries to
         | avoid log(0)             return -np.sum(p[p > 0] * np.log2(p[p
         | > 0]))              @njit(cache=True)         def
         | normalized_mutual_information(img1, img2):
         | joint_counts = compute_joint(img1, img2)
         | total_pixels = 28*28                          # Marginal counts
         | c1 = np.sum(joint_counts, axis=1)             c2 =
         | np.sum(joint_counts, axis=0)                          # Compute
         | entropies directly from counts             h1 = entropy(c1,
         | total_pixels)             h2 = entropy(c2, total_pixels)
         | joint_entropy = entropy(joint_counts.flatten(), total_pixels)
         | mutual_info = h1 + h2 - joint_entropy             return 2 *
         | mutual_info / (h1 + h2)
        
         | tysam_and wrote:
         | Holy cow, I knew that MNIST was simple, but not that simple.
         | 
         | Could you post a snippet of the code that you used to achieve
         | this? It would be really, really nice to have a baseline to
         | work from I'm sure, and I feel like this could be really useful
         | to a few other areas (my personal obsession is speed-training
         | on CIFAR10 :')))) )
         | 
         | Holy cow, that's insane. :O
        
           | montebicyclelo wrote:
           | I used the notebook linked in the original post [1]. It
           | evaluates using 100 samples from the test set, (I'm guessing
           | because the gzip method is slow - it would take ~7 hours on
           | the full test set, on my machine).
           | 
           | I plugged in the distance measures, for the `compute_ncd`
           | function. (Jaccard/Dice have been negated and the -1
           | removed.)                   def euclidean(x1, x2):
           | return np.sum(np.square(x1 - x2))              def
           | jaccard(x1, x2):             x1_binary = x1 > 0.5
           | x2_binary = x2 > 0.5             return
           | np.logical_or(x1_binary, x2_binary).sum() /
           | np.logical_and(x1_binary, x2_binary).sum()              def
           | dice(x1, x2):             x1_binary = x1 > 0.5
           | x2_binary = x2 > 0.5            return 2 *
           | np.logical_or(x1_binary, x2_binary).sum() / (x1_binary.sum()
           | + x2_binary.sum())
           | 
           | [1] https://github.com/Jakob-98/mono/blob/73168bc0ea904e75865
           | 815...
        
         | jdthedisciple wrote:
         | That's pretty amazing.
         | 
         | So the whole gzip thing - while fancy - really is extra steps
         | for less bang.
        
       | yannccc2 wrote:
       | All these ideas date back to at least 2010, probably earlier. It
       | was called "information distance".
       | https://arxiv.org/abs/1006.3520
        
       | petters wrote:
       | I wonder why you started with a code-golfed version? That seems
       | original to the main point of the post
        
         | seeknotfind wrote:
         | original -> orthogonal?
         | 
         | I like short code! It's like a joke "this is so easy, 10 loc"
         | :)
        
           | hinkley wrote:
           | The internet has turned into a game of "is it him or is it
           | autocorrect?"
           | 
           | I can't count how many times I've looked up at a line of text
           | to spot check it, type in a couple more letters, hit save
           | without looking, and find later that I sound like I'm having
           | a stroke. Never mind the times I've simply forgotten to look.
           | Yes that's a word, stop trying to replace it.
           | 
           | (Five minutes ago I had to type "laissez-faire" three times.
           | Yesterday it turned "understanding" into god knows what. And
           | if iOS doesn't stop adding apostrophes to "its" I'm gonna
           | fuckin lose my shit)
        
             | burnished wrote:
             | I ended up turning auto correct off due to iOS being more
             | aggressive with its corrections than my previous android
             | phones (both of which frequently fail to recognize common
             | words, and maybe both but certainly the android was
             | absurdly brand aware).
        
               | hinkley wrote:
               | > brand aware
               | 
               | I got so tired of Mathematica italicizing its own name
               | when I was in college so I figured out how to type it in
               | two phases so it wouldn't. It was my tiny little petty
               | victory over Stephen Wolfram ruining math for me.
               | 
               | I know that struggle.
        
       | great_psy wrote:
       | Having some artificial intelligence algorithm that is completely
       | understood and tu able like gzip is would be great for many uses.
       | 
       | I think it's pretty hard to just improve a NN without more data
       | or some non trivial amount of effort.
        
       | Lerc wrote:
       | I'm sure there was something like this mentioned in "Managing
       | Gigabytes" or thereabouts. It might have been using bitwise
       | arithmetic compression which makes sense for the problem at hand.
        
       | tdr2d wrote:
       | Obviously, the code may be elegant and compact, 78% accuracy is
       | considered very very bad for MNIST.
       | 
       | A dummy model written with Tensorflow easilly reaches 90%
       | accuracy. The best models ranked at 99,87%, see the benchmark :
       | https://paperswithcode.com/sota/image-classification-on-mnis...
        
         | m00x wrote:
         | It's not trying to break records, it just shows a neat aspect
         | of compression. It's still 8 times better than baseline, which
         | showcases that compression can learn representation.
        
         | esafak wrote:
         | The article emphasizes the wrong thing, in my view. The
         | interesting part is that compression -- without learning a
         | model -- can be used for classification. This raises the
         | question of what other information-theoretic measures can be
         | used; cheaper, lossy ones.
         | 
         |  _To Compress or Not to Compress- Self-Supervised Learning and
         | Information Theory: A Review_
         | https://arxiv.org/abs/2304.09355\*
        
           | Quenty wrote:
           | I mean, they're using knn underneath. You can apparently get
           | 97% accuracy with normal knn, at n=4 if you compare pixel
           | distance.
           | 
           | So another way to frame this might be that gzip costs a lot
           | of accuracy but may lead to better performance.
           | 
           | https://newpblog.netlify.app/2018-01-24-knn-analysis-on-
           | mnis...
        
           | a_wild_dandan wrote:
           | An increasingly common refrain in machine learning is
           | "intelligence is compression." Folks who believe that might
           | bristle at the distinction between learning and compression.
        
             | esafak wrote:
             | I doubt it because learned features already constitute
             | lossy compression. The question is what kind of
             | compression; lossy vs. lossless, learned vs. unlearned.
        
         | sundarurfriend wrote:
         | The point is not to have "elegant and compact" code, this is
         | meant to be a fun curiosity, and doing it in 10 lines is just
         | an additional layer of challenge for the heck of it.
         | 
         | The interesting thing is not in whether GZip can achieve SOTA,
         | it's that it can do a decent job at all. (The interesting thing
         | is not in whether the bear can recreate Mozart exactly, it's
         | that it can play the piano at all.)
        
       | 3abiton wrote:
       | Didn't it turn out that authors of that paper have made mistakes
       | that catapulted their results to the top of the benchmark charts?
       | I thought the theory was inconsistent after that incident. 78%
       | accuracy from just GZIP is impressive.
        
         | 0xfffafaCrash wrote:
         | haven't looked through this post yet but I think this is what
         | you have in mind: https://kenschutte.com/gzip-knn-paper/
        
       | 0xDEF wrote:
       | Has anyone tried how different compression algorithms compare
       | when doing NCD classification?
       | 
       | gzip first does LZ77 and then Huffman coding. ANS is a more
       | modern alternative to Huffman coding that can achieve higher
       | compression ratios than Huffman coding.
        
       | tjungblut wrote:
       | I don't immediately find it, but couple of years back there was a
       | "meta-feature" which was the size of the MNIST image. I think
       | that scored about 90'ish % accurate results on its own - without
       | even looking at the image.
        
         | 13of40 wrote:
         | A few years back I worked on a project that involved
         | fingerprinting screenshots of web pages, and compressed image
         | size was pretty much as good as any fingerprinting method we
         | could come up with for comparing the similarity between them.
        
           | tysam_and wrote:
           | Makes sense, considering that's not too terribly far off from
           | what the KL divergence does
        
           | momojo wrote:
           | The off-the-beaten-path nature of this reminds me of banks
           | sending $<arbitrary decimal amount> as a PIN for auth.
        
         | p1esk wrote:
         | What do you mean by "size"? Gzipped size? If you simply look at
         | how dark a Mnist image is (count the percentage of dark pixels)
         | you'll get about 20% accuracy, which is twice better than
         | random guess but a long way from 90'ish %.
        
           | yvdriess wrote:
           | What do you mean with accuracy here? Usually 50% accuracy
           | means cointoss, meaning 20% accuracy is equal to 80%
           | accuracy, which is better than the article's 78% and not that
           | far from 90%.
        
             | Frotag wrote:
             | > meaning 20% accuracy is equal to 80% accuracy,
             | 
             | Only if your model is outputting a yes/no answer right? And
             | that your definition of accuracy is "class with highest
             | probability" (and not "N classes with highest prob")
             | 
             | If your dataset has more than 2 classes like MNIST, a super
             | low accuracy only tells you to ignore the class the model
             | guesses. It doesn't tell you which of the remaining classes
             | is correct
        
             | rjh29 wrote:
             | There are ten choices, so getting the answer right 20% of
             | the time is very plausible.
        
             | p1esk wrote:
             | There are 10 classes in Mnist. Random guess would is 10%.
        
             | tysam_and wrote:
             | "One simple trick to beat the statistical odds...."
        
       | tobeyey wrote:
       | Replacing the NCD                 distances = [(compute_ncd(x1,
       | x), label) for x, _, label in compressed_lengths]
       | 
       | with the Euclidean distance                 distances =
       | [(np.sqrt(np.sum(np.square(x1-x))), label) for x, _, label in
       | compressed_lengths]
       | 
       | gives you +15% test accuracy and saves you a lot of compute.
        
       | benreesman wrote:
       | My favorite book about the deep connections between information
       | theory, compression, and learning algorithms is MacKay (most
       | probably know about it but I didn't for a long time so maybe some
       | will benefit from the mention).
       | 
       | I gather this is common knowledge (if not sufficiently emphasized
       | at times) among those with serious educations, but as a self-
       | taught, practical application-type ML person this profound thread
       | running through all these topics (and seemingly into heavy-ass
       | particle physics and cosmology and stuff like that) was a
       | blinding "Aha!" moment that I'll venture this comment in the
       | hopes that even one other person has that unforgettable moment.
        
       | GaggiX wrote:
       | For a comparison with others techniques:
       | 
       | Linear SVC (best performance): 92 %
       | 
       | SVC rbf (best performance): 96.4 %
       | 
       | SVC poly (best performance): 94.5 %
       | 
       | Logistic regression (prev assignment): 89 %
       | 
       | Naive Bayes (prev assignment): 81 %
       | 
       | From this blog page: https://dmkothari.github.io/Machine-
       | Learning-Projects/SVM_wi...
       | 
       | Also it seems from reading online articles that people are able
       | to obtain much better results just by using K-NN, so I imagine
       | that the author just made his job harder by using gzip but I
       | could be wrong about this.
        
         | grandma_tea wrote:
         | While it's cool that this works at all, I wish we would stop
         | using MNIST as a benchmark given how trivial it is.
        
           | unoti wrote:
           | MNIST is pretty good as a proof of concept. While I agree
           | it's trivial, I see MNIST as being more like how when you're
           | making a toy programming language, the first thing you do
           | with it is writing a recursive Fibonacci function.
        
           | dmazzoni wrote:
           | It's still a great data set because it's real-world, useful,
           | high-quality, and an excellent size (~60k examples), and the
           | "correct" classification isn't very subjective at all -
           | humans can easily get 99% accuracy.
        
           | IKantRead wrote:
           | It's a good benchmark _because_ it 's so trivial.
           | 
           | Sure it's not great at differentiating between SotA
           | techniques, but it's very useful for sanity checks like this
           | one.
           | 
           | Even for SotA models, it's still useful to verify that you
           | can get greater than 98% accuracy on MNIST, before exploring
           | larger, more complex bench marks.
           | 
           | It certainly shouldn't be the _only_ benchmark but it 's a
           | great place to start iterating on ideas.
        
             | TimPC wrote:
             | It's a bad benchmark because it's artificially clean. It's
             | effectively a 2d dataset with no occlusions. So nearly
             | everything you try on it will work, and many things you try
             | on it won't scale to typical image problems. There are good
             | 3d datasets with more realistic examples that are still
             | fairly simplistic compared to the state of the art large
             | datasets, but at least give you signal that your technique
             | is robust to common problems in vision. MNIST is so
             | simplistic that you encounter none of the typical problems
             | in computer vision settings so it doesn't give you a good
             | prediction of how good your technique is.
        
               | ghshephard wrote:
               | Can you imagine a model that performs very poorly on
               | MNIST that would perform well in real-world computer
               | visions problems? If you can't, then MNIST is a nice
               | simple smoke test when assessing models.
        
               | d3w4s9 wrote:
               | What's wrong with "artificially clean"? The goal of
               | benchmarks is to compare and know whether one model is
               | better than the other. There is never a "perfect" or
               | "objective" benchmark. Different benchmarks may highlight
               | advantages in certain models, which is a good thing, but
               | there is absolutely nothing wrong with using MNIST as a
               | dataset to give you a basic idea of how models perform.
        
           | antiquark wrote:
           | Take a look at the EMNIST - Extended MNIST dataset. It has
           | both digits and letters of the alphabet.
        
         | dmazzoni wrote:
         | That blog page is not showing state-of-the-art results, it's
         | just taking relatively naive SVM implementations and comparing
         | them.
         | 
         | The original research paper that introduced the MNIST data set
         | was getting around 98% accuracy, and today neural nets are
         | getting 99.87% accuracy.
         | 
         | https://paperswithcode.com/sota/image-classification-on-mnis...
        
           | GaggiX wrote:
           | Yep I was showing others simple techniques, of couse a neural
           | network would be much better at it.
        
         | TimPC wrote:
         | The whole point isn't to do better. It's to show that enough
         | information survives compression that you can still get very
         | large signal. Compression is intended to make things harder and
         | still does.
        
           | amelius wrote:
           | Why? I mean it's trivial to add a layer of encryption ...
           | 
           | (Also the terminology seems off here, as 100% of information
           | survives gzip.)
        
           | GaggiX wrote:
           | The article OP linked is inspired by Gzip+Knn paper that was
           | released a few months ago, compression is not intended to
           | make things harder but to extract useful features that can be
           | used by simple algorithms like Knn or SVG etc...
           | 
           | For example it would be almost impossible to distinguish cat
           | and dog images by using SVG or Knn directly on the data but
           | it would be much easier if the images were first passed into
           | an image encoder and a small embeddings vector would be used
           | for each one instead. In the article OP linked it doesn't
           | seem that Gzip is very good at extracting useful patterns for
           | the MNIST dataset.
        
         | IKantRead wrote:
         | Thanks for posting this.
         | 
         | Most people don't realize that Logistic regression can get ~90%
         | accuracy on MNIST.
         | 
         | As a big fan of starting with simple models first and adding
         | complexity later, I've frequently been told that "logistic
         | regression won't work!" for problems where it can in fact
         | perform excellent.
         | 
         | When faced with this resistance to logistic regression I'll
         | often ask what they think the baseline performance of it would
         | be on MNIST. The guesses I hear most often are 20-30%.
         | 
         | People, even machine learning people, often don't realize the
         | rapidly diminishing returns you get for adding _a lot_ of
         | complexity in your models. Sometimes it 's worth it, but it's
         | always good to start simple first.
         | 
         | It's also been my experience that if you don't get good
         | performance with a simple model, you are very unlikely to get
         | great performance from a more complex one.
        
           | short_sells_poo wrote:
           | To a large degree this happens because logistic regression is
           | not a sexy approach that one can add to their CV. Everyone
           | wants to solve problems with big, complicated and buzzwordy
           | models, because that sells (and is perhaps more interesting
           | as well).
           | 
           | It's really a tragedy, because so many classic models would
           | work fine for real world applications.
        
           | make3 wrote:
           | this is true unless you can use large pretrained models,
           | which are very simple to use, are very resistant to ask sorts
           | of noise, & would get ultra high accuracy with just logistic
           | regression on the output of the model
        
           | mardifoufs wrote:
           | When would a 90% accuracy on a dataset like mnist ever be
           | useful? And I mean useful as in usable for actual products or
           | software. especially considering mnist is more of a toy
           | dataset.
           | 
           | I think that's why machine learning is the way to go for this
           | type of detection, why go with anything else than a CNN (in
           | this case) when it is now trivial to set up and train? Again,
           | unless it's just to mess around with, 90% mnist accuracy is
           | not useful in the real world
        
             | vannevar wrote:
             | I don't think the point was that you should use logistic
             | regression on MNIST. In lesser-known problems, say a custom
             | in-house model, if you don't try the simpler approach
             | first, you'll never know that your more complex solution is
             | not worth the extra expense, or is actually worse than a
             | simpler, cheaper model. MNIST is well-known to have nearly
             | perfect solutions at this point, but for most novel
             | problems, the data scientist has no idea what is
             | theoretically possible.
             | 
             | Now, you can say that CNNs or other techniques are easily
             | accessible these days, and almost trivial to set up. But
             | they may not be trivial to train and run in terms of
             | compute in the real world.
        
           | KRAKRISMOTT wrote:
           | If you stack logistic regression in layers, you get a neural
           | network.
        
           | nextos wrote:
           | Came here to say the same thing, actually NCD can probably do
           | much better than 78%. Li & Vitanyi's book about Kolmogorov
           | complexity has some interesting unsupervised examples.
           | 
           | A simple CNN as implemented in Keras tutorial can easily
           | exceed 98%. 78% is very poor performance for MNIST even if
           | model complexity is penalized.
        
           | vjerancrnjak wrote:
           | Extending existing features with their non-linear mappings
           | would improve logistic regression too, probably to the svc
           | level (rbf or poly kernel is that but implicit).
           | 
           | Linear models are really well researched, and today with the
           | compute and with proper training and data preparation they
           | can easily get to satisfying levels of performance for a
           | variety of tasks.
        
         | westurner wrote:
         | So there is a more optimal compression algorithm for the
         | relation between the MNIST inputs and outputs!
         | 
         | Other models tend to add noise somewhere in there; what about
         | feature engineering before the gzip? Maybe a gaussian blur and
         | a convolution first and then deep learning for feature
         | selection
        
       | jszymborski wrote:
       | In fairness you can run MNIST through UMAP and get near perfect
       | seperation. I'm of the belief that you have to try pretty hard
       | not to do well on MNIST these days.
       | 
       | https://github.com/lmcinnes/umap_paper_notebooks/blob/master...
       | 
       | EDIT: I should add, unless it isn't clear, that we really should
       | retire the dataset. Something like the QuickDraw dataset makes a
       | lot more sense to me.
        
         | sundarurfriend wrote:
         | Y'all are making the (rude) person below complaining about
         | acronyms look reasonable. The repository doesn't define UMAP
         | either, but if you believe ChatGPT it is:
         | 
         | > UMAP, which stands for Uniform Manifold Approximation and
         | Projection, is a dimensionality reduction technique and data
         | visualization method commonly used in machine learning and data
         | analysis.
        
           | tomrod wrote:
           | UMAP is as ubiquitous as T-SNE these days, though to be fair
           | those on HN that focus on JVM, LLVM, NPM, or TLS may not know
           | about UMAP/T-SNE.
        
           | jszymborski wrote:
           | It's[0] a non-linear dimensionality reduction technique in
           | the vein of t-SNE[1] that is very well known in the field.
           | 
           | My two cents is that if your problem can be solved with
           | something like UMAP + kNN[2], then you really shouldn't be
           | using Deep Learning to solve it.
           | 
           | [0] https://en.wikipedia.org/wiki/Nonlinear_dimensionality_re
           | duc...
           | 
           | [1] https://en.wikipedia.org/wiki/T-distributed_stochastic_ne
           | igh...
           | 
           | [2]
           | https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
        
           | wenc wrote:
           | We work in our own ecological niches which sometimes entails
           | having a specialized language that allows us to be precise
           | when referring to things.
           | 
           | I sometimes see HN comments complaining that acronyms and
           | terms are dropped with no explanation. These comments come
           | across as entitled to me -- instead of complaining, why not
           | be curious and ask "what does that mean in this context?"
           | 
           | Imagine someone dropping in here and complaining, "the
           | article sucks because it uses the word 'compiler' and assumes
           | we all know what a 'compiler' is." This is what it sounds
           | like to those of us to work in specialized fields.
           | 
           | Instead if someone said, "I'm new to this, could someone help
           | me understand what a 'compiler' is?", it demonstrates
           | curiosity and people are more inclined to explain.
        
         | krick wrote:
         | Yeah, but the thing is, gzip isn't "these days". It was a thing
         | long before UMAP and even MNIST itself. And the approach isn't
         | perticulary novel too, this is a very simple idea, if you do
         | understand compression. It could've been written the first day
         | MNIST was published, and it would be still 78% accuracy. This
         | is kinda amazing to me.
        
         | js98 wrote:
         | Author here: completely agree. I think it is not much of an
         | achievement by itself, but it is interesting to see that it
         | works!
         | 
         | I will add a comment/edit to the post once I am home to clarify
         | the relative ease of solving MNIST
        
           | jszymborski wrote:
           | Sorry if my comment came off as dismissive of the post! I was
           | just trying to frame the results for those unfamiliar with
           | the properties of MNIST.
           | 
           | I think it is indeed interesting as it shows that Gzip can
           | capture the same things that e.g. UMAP et co. are capturing
           | when they acheive good scores on MNIST.
           | 
           | Also, I'll add, that even despite some of the suspicions
           | people have cast on the Gzip results in other experiments,
           | I'm bullish on the utility of reducing entropy for
           | classification problems.
        
         | ekidd wrote:
         | > _I should add, unless it isn 't clear, that we really should
         | retire the dataset._
         | 
         | From a research perspective, MNIST is basically solved. I think
         | current performance on MNIST is better than humans, correct?
         | 
         | But it's still valuable as a teaching tool or a "Hello world"
         | data set, because it's so simple and because most reasonable
         | algorithms will reach 97% accuracy. Even if you build your
         | tools from scratch, it makes a nice homework-sized problem. And
         | it's an obviously useful task that anyone can understand. "Oh,
         | digit recognition for mail!"
        
           | jszymborski wrote:
           | I was thinking about this just now... the real utility of
           | MNIST is that, since it's basically impossible to get a bad
           | result, it's a good sanity check.
        
       | wayeq wrote:
       | I'm merely a hobbyist in this domain but isn't highly compressed
       | data (like encrypted data) also high entropy? If this is finding
       | patterns in the compressed data to figure out which digit the
       | uncompressed data represents, shouldn't those patterns be
       | exploitable for better compression?
        
         | fritzo wrote:
         | The demonstration isn't classifying based on compressed data,
         | rather it's classifying on the _compressibility_ of data. The
         | idea is that  "7 7" should be more compressible than "7 3", and
         | similarly raster images of "7 7" should be more compressible
         | than raster images of "7 3".
        
           | wayeq wrote:
           | Ah ok, that makes sense. Thank you for the explanation.
        
         | Zamicol wrote:
         | Encrypted data ideally is incompressible. Incompressibility is
         | a hallmark of efficient cryptographic operations.
         | 
         | See the Wikipedia article on Kolmogorov complexity, which has a
         | short section about compression.
         | https://en.wikipedia.org/wiki/Kolmogorov_complexity#Compress...
         | 
         | Edit: One of my favorite concepts in the domain of compression
         | is the pigeonhole principle, that states that for all
         | compression algorithms, some outputs will be _larger_ than the
         | inputs.
         | 
         | Well designed encrypted payloads may be compressed, but the
         | outputs should on average be larger than the inputs, rendering
         | compression useless thus it is said to be "incompressible".
         | 
         | https://en.wikipedia.org/wiki/Pigeonhole_principle#Uses_and_...
        
       | benob wrote:
       | What explains the huge difference in perf with the approach from
       | 2019 mentioned at the end?
       | 
       | For image classification, couldn't one use the residual from
       | applying a jpeg codebook learned on a training example as metric?
        
       | Karellen wrote:
       | You solved some problem, _including implementing the GZIP
       | algorithm_ , in less than 10 lines of code?
       | 
       | ...
       | 
       | Oh, ok, not that.
        
       | _a_a_a_ wrote:
       | "MNIST"?
       | 
       | accuracy - but of what?
       | 
       | what's this about?
        
         | simonw wrote:
         | MNIST is a classic image classification exercise - a dataset of
         | 60,000 training images and 10,000 testing images where each
         | image is a handwritten numeral as a 28x28 pixel grayscale
         | image.
         | 
         | The challenge is to build a computer vision model that can tell
         | which numeral each handwritten digit represents.
         | 
         | https://en.wikipedia.org/wiki/MNIST_database
         | 
         | 78% accuracy on a solution is pretty bad, but achieving it just
         | using GZIP is a very neat hack.
        
           | _a_a_a_ wrote:
           | thank you
        
           | Detrytus wrote:
           | What's the average human performance for this task?
        
             | burnished wrote:
             | I don't know off hand, but go take a look at the images - I
             | would expect near 100%.
        
       | tysam_and wrote:
       | Additionally, try flipping your images and averaging the size
       | before comparing the distance between them. I'd expect a boost of
       | about 78% -> 84% or so, based on how this typically works as TTA.
        
         | jdthedisciple wrote:
         | Can you elaborate more on that technique?
         | 
         | Why would it improve the result so much? Sounds very
         | interesting so I'm curious.
        
           | tysam_and wrote:
           | _facepalm_ I just realized I was talking about this in the
           | context of MNIST -- my apologies.
           | 
           | It would be better in a problem with horizontal symmetry like
           | CIFAR, esp if the zipping is serialized by unwinding the 2d
           | pixel array into 1d.
           | 
           | One trick that _should_ work is by comparing the distances of
           | starting the compression at the 4 different corners of the
           | image, in the 2 separate directions for each corner. That
           | should provide way more than enough information for k-means
           | clustering.
           | 
           | My apologies again for my mistake, thank you for asking, I
           | wouldn't have really seen that otherwise :')))).
        
       | m00x wrote:
       | It goes along the same lines as the recent paper from Deepmind
       | stating that DL (language modeling in their case) is compression.
       | 
       | https://arxiv.org/pdf/2309.10668.pdf
        
       | bob1029 wrote:
       | General purpose compressors and information distance measures
       | have become super interesting to me while I've been investigating
       | alternative language models.
       | 
       | I've been playing around with an attention mechanism that
       | combines the idea of using normalized compression distance (gzip)
       | with discrete convolution between candidate sequences (sliding
       | window of N bytes over each). Another round of normalization over
       | the convolution outputs - accommodating varying lengths - allows
       | for us to compare candidate sequences for relevant information on
       | equal grounds.
       | 
       | The NCD formula I am using right now:                 NCD(x,y) =
       | (C(xy) - MIN(C(x),C(y))) / MAX(C(x),C(y))
       | 
       | No weird parameters or any other things to tune. The only
       | parameters are the source documents and the input context/query.
        
       ___________________________________________________________________
       (page generated 2023-09-20 23:00 UTC)