[HN Gopher] My failed attempt to shrink all NPM packages by 5%
       ___________________________________________________________________
        
       My failed attempt to shrink all NPM packages by 5%
        
       Author : todsacerdoti
       Score  : 272 points
       Date   : 2025-01-27 12:44 UTC (10 hours ago)
        
 (HTM) web link (evanhahn.com)
 (TXT) w3m dump (evanhahn.com)
        
       | huqedato wrote:
       | Try to use this https://github.com/xthezealot/npmprune
        
         | phdelightful wrote:
         | My reading of OP is that it's less about whether zopfli is
         | technically the best way to achieve a 5% reduction in package
         | size, and more about how that relatively simple proposal
         | interacted with the NPM committee. Do you think something like
         | this would fare better or differently for some reason?
        
       | sd9 wrote:
       | The final pro/cons list:
       | https://github.com/npm/rfcs/pull/595#issuecomment-1200480148
       | 
       | I don't find the cons all that compelling to be honest, or at
       | least I think they warrant further discussion to see if there are
       | workarounds (e.g. a choice of compression scheme for a library
       | like typescript, if they would prefer faster publishes).
       | 
       | It would have been interesting to see what eventually played out
       | if the author hadn't closed the RFC themselves. It could have
       | been the sort of thing that eventually happens after 2 years, but
       | then quietly makes everybody's lives better.
        
         | n4r9 wrote:
         | I felt the same. The proposal wasn't rejected! Also,
         | performance gains go beyond user stories - e.g. they reduce
         | infra costs and environmental impact - so I think the main
         | concerns of the maintainers could have been addressed.
        
           | IshKebab wrote:
           | > The proposal wasn't rejected!
           | 
           | They soft-rejected by requiring more validation than was
           | reasonable. I see this all the time. "But did you consider
           | <extremely unlikely issue>? Please go and run more tests."
           | 
           | It's pretty clear that the people making the decision didn't
           | actually care about the bandwidth savings, otherwise they
           | would have put the work in themselves to do this, e.g. by
           | requiring Zopfli for popular packages. I doubt Microsoft
           | cares if it takes an extra 2 minutes to publish Typescript.
           | 
           | Kind of a wild decision considering NPM uses 4.5 PB of
           | traffic per week. 5% of that is 225 TB/week, which according
           | to my brief checks costs around $10k/week!
           | 
           | I guess this is a "not my money" problem fundamentally.
        
             | lyu07282 wrote:
             | > which according to my brief checks costs around $10k/week
             | 
             | That's the market price though, for Microsoft its a tiny
             | fraction of that.
        
             | johnfn wrote:
             | This doesn't seem quite correct to me. They weren't asking
             | for "more validation than was reasonable". They were asking
             | for literally any proof that users would benefit from the
             | proposal. That seems like an entirely reasonable thing to
             | ask before changing the way every single NPM package gets
             | published, ever.
             | 
             | I do agree that 10k/week is non-negligible. Perhaps that
             | means the people responsible for the 10k weren't in the
             | room?
        
             | bombcar wrote:
             | Or another way to look at it is it's just (at most!) 5% off
             | an already large bill, and it might cost more than that
             | elsewhere.
             | 
             | And I can buy 225 TB of bandwidth for less than $2k, I
             | assume Microsoft can get better than some HN idiot buying
             | Linode.
        
             | arccy wrote:
             | massively increase the open source github actions bill for
             | runners running longer (compute is generally more
             | expensive) to publish for a small decrease in network
             | traffic (bandwidth is cheap at scale)?
        
         | alt227 wrote:
         | I feel massively increasing publish time is a valid reason not
         | to push this though considering such small gains and who the
         | gains apply to.
        
           | scott_w wrote:
           | I agree, going from 1 second to 2.5 minutes is a huge
           | negative change, in my opinion. I know publishing a package
           | isn't something you do 10x a day but it's probably a big
           | enough change that, were I doing it, I'd think the publish
           | process is hanging and keep retrying it.
        
             | pletnes wrote:
             | If you're working on the build process itself, you'll
             | notice it a lot!
        
           | rererereferred wrote:
           | Since it's backwards compatible, individual maintainers could
           | enable it in their own pipeline if they don't have issues
           | with the slowdown. It sounds like it could be a single flag
           | in the publish command.
        
           | michaelmior wrote:
           | Probably not worth the added complexity, but in theory, the
           | package could be published immediately with the existing
           | compression and then in the background, replaced with the
           | Zopfli-compressed version.
        
             | Null-Set wrote:
             | No, it can't because the checksums won't match.
        
               | michaelmior wrote:
               | I don't think that's actually a problem, but it would
               | require continuing to host both versions (at distinct
               | URLs) for any users who may have installed the package
               | before the Zopfli-compressed version completed. Although
               | I think you could also get around this by tracking
               | whether the newly-released package was ever served by the
               | API. If not, which is probably the common case, the old
               | gzip-compressed version could be deleted.
        
             | hiatus wrote:
             | Wouldn't that result in a different checksum for package-
             | lock.json?
        
             | aja12 wrote:
             | > Probably not worth the added complexity, but in theory,
             | the package could be published immediately with the
             | existing compression and then in the background, replaced
             | with the Zopfli-compressed version.
             | 
             | Checksum matters aside, wouldn't that turn the 5% bandwidth
             | savings into an almost double bandwidth increase though?
             | IMHO, considering the complexity to even make it a build
             | time option, the author made the right call.
        
         | macspoofing wrote:
         | > I don't find the cons all that compelling to be honest
         | 
         | I found it reasonable.
         | 
         | The 5% improvement was balanced against the cons of increased
         | cli complexity, lack of native JS zopfli implementation, and
         | slower compression .. and 5% just wasn't worth it at the moment
         | - and I agree.
         | 
         | >or at least I think they warrant further discussion
         | 
         | I think that was the final statement.
        
           | sd9 wrote:
           | Yes, but there's a difference between "this warrants further
           | discussion" and "this warrants further discussion and I'm
           | closing the RFC". The latter all but guarantees that no
           | further discussion will take place.
        
             | philipwhiuk wrote:
             | No it doesn't. It only does that if you think discussion
             | around future improvements belongs in RFCs.
        
               | mcherm wrote:
               | Where DOES it belong, if not there?
        
         | jerf wrote:
         | "I don't find the cons all that compelling to be honest"
         | 
         | This is a solid example of how things change at scale. Concerns
         | I wouldn't even think about for my personal website become
         | things I need to think about for the download site being hit by
         | 50,000 of my customers become big deals when operating at the
         | scale of npm.
         | 
         | You'll find those arguments the pointless nitpicking of
         | entrenched interests who just don't want to make any changes,
         | until you experience your very own "oh man, I really thought
         | this change was perfectly safe and now my entire customer base
         | is trashed" moment, and then suddenly things like "hey, we need
         | to consider how this affects old signatures and the speed of
         | decompression and just generally whether this is worth the non-
         | zero risks for what are in the end not really that substantial
         | benefits".
         | 
         | I do not say this as the wise Zen guru sitting cross-legged and
         | meditating from a position of being above it all; I say it
         | looking at my own battle scars from the Perfectly Safe things
         | I've pushed out to my customer base, only to discover some tiny
         | little nit caused me trouble. Fortunately I haven't caused any
         | true catastrophes, but that's as much luck as skill.
         | 
         | Attaining the proper balance between moving forward even though
         | it incurs risk and just not changing things that are working is
         | the hardest part of being a software maintainer, because both
         | extremes are definitely bad. Everyone tends to start out in the
         | former situation, but then when they are inevitably bitten it
         | is important not to overcorrect into terrified fear of ever
         | changing anything.
        
           | sd9 wrote:
           | I agree with everything you said, but it doesn't contradict
           | my point
        
             | jerf wrote:
             | I'm saying you probably don't find them compelling because
             | from your point of view, the problems don't look important
             | to you. They don't from my point of view either. But my
             | point of view is the wrong point of view. From their point
             | of view this would be plenty to make me think twice and
             | several times over past that from changing something so
             | deeply fundamental to the system for what is a benefit that
             | nobody who is actually paying the price for the package
             | size seems to be particularly enthusiastic about. If the
             | people paying the bandwidth bill aren't even that excited
             | about a 5% reduction, then the cost/benefits analysis tips
             | over into essentially "zero benefit, non-zero cost", and
             | that's not very compelling.
        
               | ffsm8 wrote:
               | Or you're not understanding how he meant it: there are
               | countless ways to roll out such changes, a hard change is
               | likely a very bad idea as you've correctly pointed out.
               | 
               | But it is possible to do it more gradually, I.e. by
               | sneaking it in with a new API that's used by new npm
               | version or similar.
               | 
               | But it was his choice to make, and it's fine that he
               | didn't feel enough value in pursuing such a tiny file
               | size change
        
               | sd9 wrote:
               | The problems look important but underexplored
        
           | pif wrote:
           | > This is a solid example of how things change at scale.
           | 
           | 5% is 5% at any scale.
        
             | michaelmior wrote:
             | Yes and no. If I'm paying $5 a month for storage, I
             | probably don't care about saving 5% of my storage costs. If
             | I'm paying $50,000/month in storage costs, 5% savings is a
             | lot more worthwhile to pursue
        
               | PaulHoule wrote:
               | Doesn't npm belong to Microsoft? It must be hosted in
               | Azure which they own so they must be paying a rock bottom
               | rate for storage, bandwidth, everything.
        
               | cwmma wrote:
               | It's probably less about MS and more about the people
               | downloading the packages
        
               | PaulHoule wrote:
               | For them it is 5% of something tiny.
        
               | imoverclocked wrote:
               | Maybe, maybe not. If you are on a bandwidth limited
               | connection and you have a bunch of NPM packages to
               | install, 5% of an hour is a few minutes saved. It's
               | likely more than that because long-transfers often need
               | to be restarted.
        
               | PaulHoule wrote:
               | A properly working cache and download manager that
               | supports resume goes a long way.
               | 
               | I could never get Docker to work on my ADSL when it was 2
               | Mbps (FTTN got it up to 20) though it was fine in the
               | Montreal office which had gigabit.
        
             | gregmac wrote:
             | 5% off your next lunch and 5% off your next car are very
             | much not the same thing.
        
               | dgfitz wrote:
               | So what, instead of 50k for a car you spend 47.5k?
               | 
               | If that moves the needle on your ability to purchase the
               | car, you probably shouldn't be buying it.
               | 
               | 5% is 5%.
        
               | post-it wrote:
               | I wouldn't pick 5C/ up off the ground but I would
               | certainly pick up $2500.
        
               | ziddoap wrote:
               | Why do so many people take illustrative examples
               | literally?
               | 
               | I'm sure you can use your imagination to substitute
               | "lunch" and "car" with other examples where the absolute
               | change makes a difference despite the percent change
               | being the same.
               | 
               | Even taking it literally... The 5% might not tip the
               | scale of whether or not I _can_ purchase the car, but I
               | 'll spend a few hours of my time comparing prices at
               | different dealers to save $2500. Most people would
               | consider it dumb if you didn't shop around when making a
               | large purchase.
               | 
               | On the other hand, I'm not going to spend a few hours of
               | my time at lunch so that I can save an extra $1 on a
               | meal.
        
               | kemitche wrote:
               | If it takes 1 hour of effort to save 5%:
               | 
               | - Doing 1 hour of effort to save 5% on your $20 lunch is
               | foolhardy for most people. $1/hr is well below US minimum
               | wage. - Doing 1 hour of effort to save 5% on your $50k
               | car is wise. $2500/hr is well above what most people are
               | making at work.
               | 
               | It's not about whether the $2500 affects my ability to
               | buy the car. It's about whether the time it takes me to
               | save that 5% ends up being worthwhile to me given the
               | actual amount saved.
               | 
               | The question is really "given the person-hours it takes
               | to apply the savings, and the real value of the savings,
               | is the savings worth the person-hours spent?"
        
               | jay_kyburz wrote:
               | This is something we often do in our house. We talk about
               | things in terms of hours worked rather than price. I
               | think more people should do it.
        
               | JZerf wrote:
               | Those lunches could add up to something significant over
               | time. If you're paying $10 per lunch for 10 years, that's
               | $36,500 which is pretty comparable to the cost of a car.
        
             | horsawlarway wrote:
             | 5% of newly published packages, with a potentially serious
             | degradation to package publish times for those who have to
             | do that step.
             | 
             | Given his numbers, let's say he saves 100Tb of bandwidth
             | over a year. At AWS egress pricing... that's $5,000 total
             | saved.
             | 
             | And arguably - NPM is getting at least some of that savings
             | by adding CPU costs to publishers at package time.
             | 
             | Feels like... not enough to warrant a risky ecosystem
             | change to me.
        
               | AlotOfReading wrote:
               | How often are individuals publishing to NPM? Once a day
               | at most, more typically once a week or month? A few dozen
               | seconds of one person's day every month isn't a terrible
               | trade-off.
               | 
               | Even that's addressable though if there's motivation,
               | since something like transcoding server side during
               | publication just for popular packages would probably get
               | 80% of the benefit with no client-side increase in
               | publication time.
        
               | true_religion wrote:
               | https://www.reddit.com/r/webdev/comments/1ff3ps5/these_50
               | 00_...
               | 
               | NPM uses at least 5 petabytes per week. 5% of that is 250
               | terabytes.
               | 
               | So $15,000 a week, or $780,000 a year in savings could've
               | been gained.
        
               | canucker2016 wrote:
               | In a great example of the Pareto Principle (80/20), or
               | actually even more extreme, let's only apply this Zopfli
               | optimization if the package download total is equal or
               | more than 1GiB (from the Weekly Traffic in GiB column of
               | the Top 5000 Weekly by Traffic tab of the Google Sheets
               | file from the reddit post).
               | 
               | For reference, total bandwidth used by all 5000 packages
               | is 4_752_397 GiB.
               | 
               | Packages >= 1GiB bandwidth/week - That turns out to be
               | 437 packages (there's a header row, so it's rows 2-438)
               | which uses 4_205_510 GiB.
               | 
               | So 88% of the top 5000 bandwidth is consumed by
               | downloading the top 8.7% (437) packages.
               | 
               | 5% is about 210 TiB.
               | 
               | Limiting to the top 100 packages by bandwidth results in
               | 3_217_584 GiB, which is 68% of total bandwidth used by 2%
               | of the total packages.
               | 
               | 5% is about 161 TiB.
        
             | knighthack wrote:
             | Do you even know how absolute numbers work vis-a-vis
             | percentages?
        
             | Aicy wrote:
             | That's right, and 5% of a very small number is a very small
             | number. 5% of a very big number is a big number.
        
             | syncsynchalt wrote:
             | In some scenarios the equation flips, and the enterprise is
             | looking for _more_ scale.
             | 
             | The more bandwidth that Cloudflare needs, the more leverage
             | they have at the peering table. As GitHub's largest repo
             | (the @types / DefinitelyTyped repo owned by Microsoft) gets
             | larger, the more experience the owner of GitHub (also
             | Microsoft) gets in hosting the world's largest git repos.
             | 
             | I would say this qualifies as one of those cases, as npmjs
             | is hosted on Azure. The more resources that NPM needs, the
             | more Microsoft can build towards parity with AWS's
             | footprint.
        
         | advisedwang wrote:
         | The pros aren't all that compelling either. The npm repo is the
         | only group that this would really be remotely significant for,
         | and there seemed to be no interest. So it doesn't take much of
         | a con to nix a solution to a non-problem.
        
           | ForOldHack wrote:
           | Every single download, until the end of time is affected: It
           | speeds up the servers, speeds up the updates, saves disk
           | space on the update servers, and saves on bandwidth costs and
           | usage.
           | 
           | Everyone benefits, the only cost is a ultra microscopic time
           | on the front end, and a tiny cost on the client end, and for
           | a very significant number of users, time and money saved. The
           | examples of compression here...
        
       | orta wrote:
       | Congrats on a great write-up. Sometimes trying to ship something
       | at that sorta scale turns out to just not really make sense in a
       | way that is hard to see at the beginning.
       | 
       | Another personal win is that you got a very thorough
       | understanding of the people involved and how the outreach parts
       | of the RFC process works. I've also had a few fail, but I've also
       | had a few pass! Always easier to do the next time
        
       | stabbles wrote:
       | One thing that's excellent about zopfli (apart from being gzip
       | compatible) is how easy it is to bootstrap:                   git
       | clone https://github.com/google/zopfli.git         cc -O2
       | zopfli/src/zopfli/*.c -lm
       | 
       | It just requires a C compiler and linker.
        
         | stabbles wrote:
         | The main downside though, it's impressively slow.
         | 
         | Comparing to gzip isn't really worth it. Combine pigz
         | (threaded) with zlib-ng (simd) and you get decent performance.
         | pigz is used in `docker push`.
         | 
         | For example, gzipping llvm.tar (624MB) takes less than a second
         | for me:                   $ time
         | /home/harmen/spack/opt/spack/linux-ubuntu24.04-zen2/gcc-13.2.0/
         | pigz-2.8-5ptdjrmudifhjvhb757ym2bzvgtcsoqc/bin/pigz -k hello.tar
         | real    0m0.779s         user    0m11.126s         sys
         | 0m0.460s
         | 
         | At the same time, zopfli compiled with -O3 -march=native takes
         | 35 minutes. No wonder it's not popular.
         | 
         | It is almost _2700x_ slower than the state of the art for just
         | 6.8% bytes saved.
        
           | Levitating wrote:
           | > 2700x slower
           | 
           | That is impressively slow.
           | 
           | In my opinion even the 28x decrease in performance mentioned
           | would be a no-go. Sure the package saves a few bytes but I
           | don't need my entire pc to grind to a halt every time I
           | publish a package.
           | 
           | Besides, storage is cheap but CPU power draw is not. Imagine
           | the additional CO2 that would have to be produced if this RFC
           | was merged.
           | 
           | > 2 gigabytes of bandwidth per year across all installations
           | 
           | This must be a really rough estimate and I am curious how it
           | was calculated. In any case 2 gigabytes over _a year_ is
           | absolutely nothing. Just my home network can produce a
           | terabyte a day.
        
             | bonzini wrote:
             | 2 GB for the author's package which is neither extremely
             | common nor large; it would be 2 TB/year just for react
             | core.
        
               | Levitating wrote:
               | I am confused, how is this number calculated?
               | 
               | Because the authors mentioned package, Helmet[1], is
               | 103KB _uncompressed_ and has had 132 versions in 13
               | years. Meaning downloading every Helmet version
               | uncompressed would result in 132*103KB = 13.7MB.
               | 
               | I feel like I must be missing something really obvious.
               | 
               | Edit: Oh it's 2GB/year _across_ all installations.
               | 
               | [1]:
               | https://www.npmjs.com/package/helmet?activeTab=versions
        
       | ape4 wrote:
       | Usually people require more than 5% to make a big change
        
         | hinkley wrote:
         | That's why our code is so slow. Dozens of poor decisions that
         | each account for 2-4% of overall time lost, but 30-60% in
         | aggregate.
        
       | fergie wrote:
       | Props to anyone who tries to make the world a better place.
       | 
       | Its not always obvious who has the most important use cases. In
       | the case of NPM they are prioritizing the user experience of
       | module authors. I totally see how this change would be great for
       | module consumers, yet create potentially massive inconvenience
       | for module authors.
       | 
       | Interesting write-up
        
         | atiedebee wrote:
         | I think "massive" is overstating it. I don't think deploying a
         | new version of a package is something that happens many times a
         | day, so it wouldn't be a constant pain point.
         | 
         | Also, since this is a case of having something compressed once
         | and decompressed potentially thousands of times, it seems like
         | the perfect tool for the job.
        
           | philipwhiuk wrote:
           | Every build in a CI system would probably create the package.
           | 
           | This is changing every build in every CI system to make it
           | slower.
        
             | mkesper wrote:
             | Just use it on the release build.
        
       | avodonosov wrote:
       | My experiment on how to reduce javascript size of every web app
       | by 30-50% : https://github.com/avodonosov/pocl
       | 
       | Working approach, but in the end I abandoned the project - I
       | doubt people care about such js size savings.
        
         | dagelf wrote:
         | Wdym?? 50% is a big deal
        
           | bluGill wrote:
           | 50% size savings isn't important to the people who pay for
           | it. They pay at most pennies for 100% savings (that is
           | somehow all the functionality in zero bytes - not worth
           | anything to those paying the bills)
        
             | tyre wrote:
             | Size savings translates to latency improvements which
             | directly affects conversion rates. Smaller size isn't about
             | reducing costs but increased revenue. People care.
        
               | soared wrote:
               | Agreed - often a CTO of an ecom site is very very focused
               | on site speed and has it as their #1 priority since it
               | directly increases revenue.
        
               | fwip wrote:
               | Note that this proof-of-concept implementation saves
               | latency on first load, but may add latency at surprising
               | points while using the website. Any user invoking a
               | rarely-used function would see a delay before the
               | javascript executes, without the traditional UI
               | affordances (spinners etc) to indicate that the
               | application was waiting on the network. Further, these
               | secretly-slow paths may change from visit to visit. Many
               | users know how to "wait for the app to be ready," but the
               | traditional expectation is that once it's loaded, the
               | page/app will work, and any further delays will be
               | signposted.
               | 
               | I'm sure it works great when you've got high-speed
               | internet, but might break things unacceptably for users
               | on mobile or satellite connections.
        
               | vlovich123 wrote:
               | > without the traditional UI affordances (spinners etc)
               | to indicate that the application was waiting on the
               | network.
               | 
               | This part is obviously trivially solvable. I think the
               | same basic idea is going to at some point make it but
               | it'll have to be through explicit annotations first and
               | then there will be tooling to automatically do this for
               | your code based upon historical visits where you get to
               | tune the % of visitors that get additional fetches. Also,
               | you could probably fetch the split off script in the
               | background anyway as a prefetch + download everything
               | rather than just 1 function at a time (or even
               | downloading related groups of functions together)
               | 
               | The idea has lots of merit and you just have to execute
               | it right.
        
         | philipwhiuk wrote:
         | How do you evaluate call usage?
        
         | KTibow wrote:
         | I think this is called tree shaking and Vite/Rollup do this by
         | default these days. Of course, it's easy when you explicitly
         | say what you're importing.
        
           | avodonosov wrote:
           | That's not tree-shaking.
        
         | hinkley wrote:
         | I got measurable decreases in deployment time by shrinking the
         | node_modules directory in our docker images.
         | 
         | I think people forget that, when you're copying the same images
         | to dozens and dozens of boxes, any improvement starts to add up
         | to real numbers.
        
           | syncsynchalt wrote:
           | I've not done it, but have you considered using `pnpm` and
           | volume-mounting a shared persistent `pnpm-store` into the
           | containers? It seems like you'd get near-instant npm installs
           | that way.
        
             | hinkley wrote:
             | The only time npm install was on the critical path was
             | hotfixes. It's definitely worth considering. But I was
             | already deep into doing people giant favors that they
             | didn't even notice, so I was juggling many other goals. I
             | think the only thank you I got was from the UI lead, who
             | had some soda straw internet connection and this and
             | another thing I did saved him a bunch of hard to recover
             | timeouts.
        
       | jsheard wrote:
       | I wonder if it would make more sense to pursue Brotli at this
       | point, Node has had it built-in since 10.x so it should be pretty
       | ubiquitous by now. It would require an update to NPM itself
       | though.
        
       | pornel wrote:
       | It only doesn't apply to existing _versions_ of existing
       | packages. Newer releases would apply Zopfli, so over time likely
       | the majority of actively used /maintained packages would be
       | recompressed.
        
       | choobacker wrote:
       | Nice write up!
       | 
       | > When it was finally my turn, I stammered.
       | 
       | > Watching it back, I cringe a bit. I was wordy, unclear, and
       | unconvincing.
       | 
       | > You can watch my mumbling in the recording
       | 
       | I watched this, and the author was articulate and presented well.
       | The author is too harsh!
       | 
       | Good job for trying to push the boundaries.
        
       | hartator wrote:
       | I mean 4-5% the size for 10-100x the time is not worth it.
        
         | swiftcoder wrote:
         | That's not actually so straightforward. You pay the 10-100x
         | slowdown _once_ on the compressing side, to save 4-5% on
         | _every_ download - which for a popular package one would expect
         | downloads to be in the millions.
        
           | philipwhiuk wrote:
           | The downloads are cached. The build happens on every publish
           | for every CI build.
        
         | inglor_cz wrote:
         | As the author himself said, just React was downloaded half a
         | billion times; that is a lot of saved bandwidth on both sides,
         | but especially so for the server.
         | 
         | Maybe it would make sense to only apply this improvement in
         | images that are a) either very big or b) get downloaded at
         | least million times each year or so. That would cover most of
         | the savings while leaving most packages and developers out of
         | it.
        
         | bonzini wrote:
         | Assuming download and decompression cost to be proportional to
         | the size of the incoming compressed stream, it would break even
         | at 2000 downloads. A big assumption I know, but 2000 is a
         | really small number.
        
         | liontwist wrote:
         | It absolutely is. Packages are zipped once and downloaded
         | thousands of times.
        
       | cedws wrote:
       | Last I checked npm packages were full of garbage including non-
       | source code. There's no reason for node_modules to be as big as
       | it usually is, text compresses extremely well. It's just general
       | sloppiness endemic to the JavaScript ecosystem.
        
         | MortyWaves wrote:
         | Totally agree with you. I wish npm did a better job of
         | filtering the crap files out of packages.
        
         | Alifatisk wrote:
         | At least, switch to pnpm minimize the bloat
        
           | jefozabuss wrote:
           | I just installed a project with pnpm about 120 packages
           | mostly react/webpack/eslint/redux related
           | 
           | with prod env: 700MB
           | 
           | without prod env: 900MB
           | 
           | sadly the bloat cannot be avoided that well :/
        
             | jeffhuys wrote:
             | pnpm stores them in a central place and symlinks them.
             | You'll see the benefits when you have multiple projects
             | with a lot of the same packages.
        
               | syncsynchalt wrote:
               | You'll also see the benefit when `rm -rf`ing a
               | `node_modules` and re-installing, as pnpm still has a
               | local copy that it can re-link after validating its
               | integrity.
        
         | vinnymac wrote:
         | You might be interested in e18e if you would like to see that
         | change: https://e18e.dev/
         | 
         | They've done a lot of great work already.
        
           | KTibow wrote:
           | Does this replace ljharb stuff?
        
         | hinkley wrote:
         | I believe I knocked 10% off of our node_modules directory by
         | filing .npmignore PRs or bug reports to tools we used.
         | 
         | Now if rxjs weren't a dumpster fire...
        
         | TheRealPomax wrote:
         | That's on the package publishers, not NPM. They give you an
         | `.npmignore` that's trivially filled out to ensure your package
         | isn't full of garbage, so if someone doesn't bother using that:
         | that's on them, not NPM.
         | 
         | (And it's also a little on the folks who install dependencies:
         | if the cruft in a specific library bothers you, hit up the repo
         | and file an issue (or even MR/PR) to get that .npmignore file
         | filled out. I've helped folks reduce their packages by 50+MB in
         | some cases, it's worth your own time as much as it is theirs)
        
         | eitau_1 wrote:
         | It's not even funny:                 $ ll /nix/store/*-insect-5
         | .9.0/lib/node_modules/insect/node_modules/clipboardy/fallbacks/
         | *       /nix/store/...-insect-5.9.0/lib/node_modules/insect/nod
         | e_modules/clipboardy/fallbacks/linux:       .r-xr-xr-x 129k
         | root  1 Jan  1970 xsel            /nix/store/...-insect-5.9.0/l
         | ib/node_modules/insect/node_modules/clipboardy/fallbacks/window
         | s:       .r-xr-xr-x 444k root  1 Jan  1970 clipboard_i686.exe
         | .r-xr-xr-x 331k root  1 Jan  1970 clipboard_x86_64.exe
         | 
         | (clipboardy ships executables and none of them can be run on
         | NixOS btw)
        
           | cedws wrote:
           | Are they reproducible? Shipping binaries in JS packages is
           | dodgy AF - a Jia Tan attack waiting to happen.
        
             | eitau_1 wrote:
             | The executables are vendored in the repo [0].
             | 
             | [0] https://github.com/sindresorhus/clipboardy/tree/main/fa
             | llbac...
        
           | dicytea wrote:
           | I don't know why, but clipboard libraries tend to be really
           | poorly implemented, especially in scripting languages.
           | 
           | I just checked out clipboardy and all they do is dispatch
           | binaries from the path and hope it's the right one (or if
           | it's even there at all). I think I had a similar experience
           | with Python and Lua scripts. There's an unfunny amount of
           | poorly-written one-off clipboard scripts out there just
           | waiting to be exploited.
           | 
           | I'm only glad that the go-to clipboard library in Rust
           | (arboard) seems solid.
        
         | hombre_fatal wrote:
         | One of the things I like about node_modules is that it's not
         | purely source code and it's not purely build artifacts.
         | 
         | You can read the code and you can usually read the actual
         | README/docs/tests of the package instead of having to find it
         | online. And you can usually edit library code for debugging
         | purposes.
         | 
         | If node_modules is taking up a lot of space across a bunch of
         | old projects, just write the `find` script that recursively
         | deletes them all; You can always run `npm install` in the
         | future when you need to work on that project again.
        
       | glenjamin wrote:
       | This strikes me as something that could be done for the highest
       | traffic packages at the backend, rather than be driven by the
       | client at pubish-time.
        
         | fastest963 wrote:
         | The article talks about this. There are hashes that are
         | generated for the tarball so the backend can't recompress
         | anything.
        
       | BonoboIO wrote:
       | In summary: It's a nice feature, which gives nice benefits for
       | often downloaded packages, but nobody at npm cares for the
       | bandwidth?
        
       | nikeee wrote:
       | I don't see why it wouldn't be possible to hide behind a flag
       | once Node.js supports zopfli natively. In case of CI/CD, it's
       | totally feasible to just add a --strong-compression flag. In that
       | case, the user expects it to take its time.
       | 
       | TS releases a non-preview version every few months, so using 2.5
       | minutes for compression would work.
        
       | abound wrote:
       | A few people have mentioned the environmental angle, but I'd care
       | more about if/how much this slows down decompression on the
       | client. Compressing React 20x slower once is one thing, but 50
       | million decompressions being even 1% slower is likely net more
       | energy intensive, even accounting for the saved energy
       | transmitting 4-5% fewer bits on the wire.
        
         | web007 wrote:
         | It's very likely zero or positive impact on the decompression
         | side of things.
         | 
         | Starting with smaller data means everything ends up smaller.
         | It's the same decompression algorithm in all cases, so it's not
         | some special / unoptimized branch of code. It's yielding the
         | same data in the end, so writes equal out plus or minus disk
         | queue fullness and power cycles. It's _maybe_ better for RAM
         | and CPU because more data fits in cache, so less memory is used
         | and the compute is idle less often.
         | 
         | It's relatively easy to test decompression efficiency if you
         | think CPU time is a good proxy for energy usage: go find
         | something like React and test the decomp time of gzip -9 vs
         | zopfli. Or even better, find something similar but much bigger
         | so you can see the delta and it's not lost in rounding errors.
        
         | sltkr wrote:
         | For formats like deflate, decompression time doesn't generally
         | depend on compressed size. (zstd is similar, though memory use
         | can depend on the compression level used).
         | 
         | This means an optimization like this is virtually guaranteed to
         | be a net positive on the receiving end, since you always save a
         | bit of time/energy when downloading a smaller compressed file.
        
       | adgjlsfhk1 wrote:
       | This seems like a place where the more ambitious version that
       | switches to ZSTD might have better tradeoffs. You would get
       | similar or better compression, with faster decompression and
       | recompression than zstd.It would lose backward compatibility
       | though...
        
         | bufferoverflow wrote:
         | Brotli and lzo1b have good compression ratios and pretty fast
         | decompression speeds. Compression speed should not matter that
         | much, since you only do it once.
         | 
         | https://quixdb.github.io/squash-benchmark/
         | 
         | There even more obscure options:
         | 
         | https://www.mattmahoney.net/dc/text.html
        
         | vlovich123 wrote:
         | Not necessarily - could retain backward compat by publishing
         | both gzip and zstd variants and having downloaders with newer
         | npm's prefer to download zstd. Over time, you could require
         | packages only upload zstd going forward and either generate
         | zstd versions of the backlog of unmaintained packages or at
         | least those that see some amount of traffic over some time
         | period if you're willing to drop very old packages. The ability
         | to install arbitrary versions of packages probably means you're
         | probably better off reprocessing the backlog although that may
         | cost more than doing nothing.
         | 
         | The package lock checksum is probably a more solvable issue
         | with some coordination.
         | 
         | The benefit of doing this though is less immediate - it will
         | take a few years to show payoff and these kinds of payoffs are
         | not typically made by the kind of committee decisions process
         | described (for better or worse).
        
       | chuckadams wrote:
       | Switching to a shared cache in the fashion of pnpm would
       | eliminate far more redundant downloads than a compression
       | algorithm needing 20x more CPU.
        
       | liontwist wrote:
       | The fact that you are pursuing this is admirable.
       | 
       | But this whole thing sounds too much like work. Finding
       | opportunities, convincing entrenched stakeholders, accommodating
       | irrelevant feedback, pitching in meetings -- this is the kind of
       | thing that top engineers get paid a lot of money to do.
       | 
       | For me personally open source is the time to be creative and
       | free. So my tolerance for anything more than review is very low.
       | And I would have quit at the first roadblock.
       | 
       | What's a little sad, is NPM should not be operating like a
       | company with 1000+ employees. The "persuade us users want this"
       | approach is only going to stop volunteers. They should be
       | proactively identifying efforts like this and helping you bring
       | it across the finish line.
        
         | coliveira wrote:
         | The problem is that this guy is treating open source as if it
         | was a company where you need to direct a project to completion.
         | Nobody in open source wants to be told what to do. Just release
         | your work, if it is useful, the community will pick it up and
         | everybody will benefit. You cannot force your improvement into
         | the whole group, even if it is beneficial in your opinion.
        
           | liontwist wrote:
           | > where you need to direct a project to completion
           | 
           | Do you want to get a change in, or not?
           | 
           | Is this a project working with the community or not?
           | 
           | > Just release your work
           | 
           | What motivation exists to optimize package formats if nobody
           | uses that package format? There are no benefits unless it's
           | in mainline.
           | 
           | > Nobody in open source wants to be told what to do
           | 
           | He's not telling anybody to do work. He is sharing an
           | optimization with clear tradeoffs - not a new architecture.
           | 
           | > You cannot force your improvement into the whole group
           | 
           | Nope, but communication is key. "put up a PR and we will let
           | you know whether it's something we want to pursue".
           | 
           | Instead they put him through several levels of gatekeepers
           | where each one gave him incorrect feedback.
           | 
           | "Why do we want to optimize bandwidth" is a question they
           | should have the answer to.
           | 
           | If this PR showed up on my project I would say "I'm worried
           | about X,Y,Z" we will set up a test for X and Y and get back
           | to you. Could you look into Z?
        
         | gjsman-1000 wrote:
         | > What's a little sad, is NPM should not be operating like a
         | company with 1000+ employees. The "persuade us users want this"
         | approach is only going to stop volunteers. They should be
         | proactively identifying efforts like this and helping you bring
         | it across the finish line.
         | 
         | Says who?
         | 
         | Says an engineer? Says a product person?
         | 
         | NPM is a company with 14 employees; with a system integrated
         | into countless extremely niche and weird integrations they
         | cannot control. Many of those integrations might make a
         | professional engineer's hair catch fire - "it should never be
         | done this way!" - but the real world is that the wrong way is
         | the majority of the time. There's no guarantee that many of the
         | downloads come from the official client, just as one example.
         | 
         | The last thing they need, or I want, or any of their customers
         | want, or their 14 employees need, is something that might break
         | backwards compatibility in an extremely niche case, anger a
         | major customer, cause countless support tickets, all for a tiny
         | optimization nobody cares about.
         | 
         | This is something I've learned here about HN that, for own
         | mental health, I now dismiss: Engineers are obsessed with 2%
         | optimizations here, 5% optimizations there; unchecked, it will
         | literally turn into an OCD outlet, all for things nobody in the
         | non-tech world even notices, let alone asks about. Just let it
         | go.
        
           | liontwist wrote:
           | Open source needs to operate differently than a company
           | because people don't have time/money/energy to deal with
           | bullshit.
           | 
           | Hell. Even 15 employees larping as a corporation is going to
           | be inefficient.
           | 
           | what you and NPM are telling us, is that they are happy to
           | take free labor, but this is not an open source project.
           | 
           | > Engineers are obsessed with 2% optimizations here
           | 
           | Actually in large products these are incredible finds. But
           | ok. They should have the leadership to know which bandwidth
           | tradeoffs they are committed to and tell him immediately it's
           | not what they want, rather than sending him to various
           | gatekeepers.
        
             | gjsman-1000 wrote:
             | Correct; NPM is not an "open source project" in the sense
             | of a volunteer-first development model. Neither is Linux -
             | over 80% of commits are corporate, and have been for a
             | decade. Neither is Blender anymore - the Blender
             | Development Fund raking in $3M a year calls the shots.
             | Every successful "large" open source project has outgrown
             | the volunteer community.
             | 
             | > Actually in large products these are incredible finds.
             | 
             | In large products, incredible finds may be true; but
             | breaking compatibility with just 0.1% of your customers is
             | also an incredible disaster.
        
               | liontwist wrote:
               | > breaking compatibility with just 0.1%
               | 
               | Yes. But in this story nothing like that happened.
        
               | gjsman-1000 wrote:
               | But NPM has no proof their dashboard won't light up full
               | of corporate customers panicking the moment it goes to
               | production; because their hardcoded integration to have
               | AWS download packages and decompress them with a Lambda
               | and send them to an S3 bucket can no longer decompress
               | fast enough while completing other build steps to avoid
               | mandatory timeouts; just as one stupid example of
               | something that could go wrong. IT is also demanding now
               | that NPM fix it rather than modify the build pipeline
               | which would take weeks to validate, so corporate's
               | begging NPM to fix it by Tuesday's marketing blitz.
               | 
               | Just because it's safe in a lab provides no guarantee
               | it's safe in production.
        
               | liontwist wrote:
               | Ok, but why is the burden on him to show that? Are they
               | not interested in improving bandwidth and speed for their
               | users?
               | 
               | The conclusion of this line of reasoning is to never make
               | any change.
               | 
               | If contributions are not welcome, don't pretend they are
               | and waste my time.
               | 
               | > can no longer decompress fast enough
               | 
               | Already discussed this in another thread. It's not an
               | issue.
        
               | maccard wrote:
               | That's an argument against making any change to the
               | packaging system ever. "It might break something
               | somewhere" isn't an argument, it's a paralysis against
               | change. Improving the edge locality of delivery of npm
               | packages could speed up npm installs. But speeding up npm
               | installs might cause the CI system which is reliant on it
               | for concurrency issues to have a race condition. Does
               | that mean that npm can't ever make it faster either?
        
               | gjsman-1000 wrote:
               | It is an argument. An age old argument:
               | 
               | "If it ain't broke, don't fix it."
        
               | liontwist wrote:
               | disable PRs if this is your policy.
        
               | maccard wrote:
               | This attitude is how in an age with gigabit fiber, 4GB/s
               | hard drive write speed, 8x4 GHz cores with simd
               | instructions it takes 30+ seconds to bundle a handful of
               | files of JavaScript.
        
           | stonemetal12 wrote:
           | NPM is a webservice. They could package the top 10-15
           | enhancements call it V2. When 98% of traffic is V2 turn V1
           | off. Repeat every 10 years or so until they work their way
           | into having a good protocol.
        
           | maccard wrote:
           | > Engineers are obsessed with 2% optimizations here, 5%
           | optimizations there; unchecked, it will literally turn into
           | an OCD outlet, all for things nobody in the non-tech world
           | even notices, let alone asks about. Just let it go.
           | 
           | I absolutely disagree with you. If the world took more of
           | those 5% optimisations here and there everything would be
           | faster. I think more people should look at those 5%
           | optimisations. In many cases they unlock knowledge that
           | results in a 20% speed up later down the line. An example
           | from my past - I was tasked with reducing the running speed
           | of a one shot tool we were using at $JOB. It was taking about
           | 15 minutes to run. I shaved off seconds here and there with
           | some fine grained optimisations, and tens of seconds with
           | some modernisation of some core libraries. Nothing earth
           | shattering but improvements none the less. One day, I noticed
           | a pattern was repeating and I was fixing an issue for the
           | third time in a different place (searching a gigantic array
           | of stuff for a specific entry). I took a step back and
           | realised that if I replaced the mega list with a hash table
           | it might fix every instance of this issue in our app. It was
           | a massive change, touching pretty much every file. And all of
           | a sudden our 15 minute runtime was under 30 seconds.
           | 
           | People used this tool every day, it was developed by a team
           | of engineers wildly smarter than me. But it had grown and
           | nobody really understood the impact of the growth. When it
           | started that array was 30, 50 entries. On our project it was
           | 300,000 and growing every day.
           | 
           | Not paying attention to these things causes decay and rot.
           | Not every change should be taken, but more people should
           | care.
        
           | lyu07282 wrote:
           | > Says who?
           | 
           | > Says an engineer?
           | 
           | I prevent cross-site scripting, I monitor for DDoS attacks,
           | emergency database rollbacks, and faulty transaction
           | handlings. The Internet heard of it? Transfers half a
           | petabyte of data every minute. Do you have any idea how that
           | happens? All those YouPorn ones and zeroes streaming directly
           | to your shitty, little smart phone day after day? Every
           | dipshit who shits his pants if he can't get the new dubstep
           | Skrillex remix in under 12 seconds? It's not magic, it's
           | talent and sweat. People like me, ensuring your packets get
           | delivered, un-sniffed. So what do I do? I make sure that one
           | bad config on one key component doesn't bankrupt the entire
           | fucking company. That's what the fuck I do.
        
         | efitz wrote:
         | I think that the reason NPM responded this way is because it
         | was a premature optimization.
         | 
         | If/when NPM has a problem - storage costs are too high, or
         | transfer costs are too high, or user feedback indicates that
         | users are unhappy with transfer sizes - then they will be ready
         | to listen to this kind of proposal.
         | 
         | I think their response was completely rational, especially
         | given a potentially huge impact on compute costs and/or
         | publication latency.
        
           | maccard wrote:
           | I disagree with it being a premature optimisation. Treating
           | everything that you haven't already identified personally as
           | a problem as a premature optimisation is cargo cutting in its
           | own way. The attitude of not caring is why npm and various
           | tools are so slow.
           | 
           | That said, I think NPM's response was totally correct -
           | explain the problem and the tradeoffs. And OP decided the
           | tradeoffs weren't worth it, which is totally fair.
        
         | Cthulhu_ wrote:
         | While NPM is open source, it's in the awkward spot of also
         | having... probably hundreds of thousands if not millions of
         | professional applications depend on it; it _should_ be run like
         | a business, because millions depend on it.
         | 
         | ...which makes it all the weirder that security isn 't any
         | better, as in, publishing a package can be done without a
         | review step on the npm side, for example. I find it strange
         | that they haven't doubled down on enterprise offerings, e.g.
         | creating hosted versions (corporate proxies), reviewed /
         | validated / LTS / certified versions of packages, etc.
        
       | liontwist wrote:
       | Why would a more complex zip slow down decompress? This comment
       | seems to misunderstand how these formats work. OP is right.
        
       | dvh wrote:
       | Imagine being in the middle of nowhere, in winter, on Saturday
       | night, on some farm, knee deep in a cow piss, servicing some 3rd
       | party feed dispenser, only to discover that you have possible
       | solution but it's in some obscure format instead of .tar.gz.
       | Nearest internet 60 miles away. This is what I always imagine
       | happening when some new obscure format come into play, imagine
       | the poor fella, alone, cold, screaming. So incredibly close to
       | his goal, but ultimately stopped by some artificial unnecessary
       | made-up bullshit.
        
         | tehjoker wrote:
         | I believe zopfli compression is backwards compatible with
         | DEFLATE, it just uses more CPU during the compression phase.
        
       | PaulHoule wrote:
       | Years back I came to the conclusion that conda using bzip2 for
       | compression was a big mistake.
       | 
       | Back then if you wanted to use a particular neural network it was
       | meant for a certain version of Tensorflow which expected you to
       | have a certain version of the CUDA libs.
       | 
       | If you had to work with multiple models the "normal" way to do
       | things was use the developer unfriendly [1][2] installers from
       | NVIDIA to install a single version of the libs at a time.
       | 
       | Turned out you could have many versions of CUDA installed as long
       | as you kept them in different directories and set the library
       | path accordingly, it made sense to pack them up for conda and
       | install them together with everything else.
       | 
       | But oh boy was it slow to unpack those bzip2 packages! Since
       | conda had good caching, if you build environments often at all
       | you could be paying more in decompress time than you pay in
       | compression time.
       | 
       | If you were building a new system today you'd probably use zstd
       | since it beats gzip on both speed and compression.
       | 
       | [1] click... click... click...
       | 
       | [2] like they're really going to do something useful with my
       | email address
        
         | zahlman wrote:
         | >But oh boy was it slow to unpack those bzip2 packages! Since
         | conda had good caching, if you build environments often at all
         | you could be paying more in decompress time than you pay in
         | compression time.
         | 
         | For Paper, I'm planning to cache both the wheel archives (so
         | that they're available without recompressing on demand) and
         | unpacked versions (installing into new environments will
         | generally use hard links to the unpacked cache, where
         | possible).
         | 
         | > If you were building a new system today you'd probably use
         | zstd since it beats gzip on both speed and compression.
         | 
         | FWIW, in my testing LZMA is a big win (and I'm sure zstd would
         | be as well, but LZMA has standard library support already). But
         | there are serious roadblocks to adopting a change like that in
         | the Python ecosystem. This sort of idea puts them several
         | layers deep in meta-discussion - see for example
         | https://discuss.python.org/t/pep-777-how-to-re-invent-the-wh...
         | . In general, progress on Python packaging gets stuck in a
         | double-bind: try to change too little and you won't get any
         | buy-in that it's worthwhile, but try to change too much and
         | everyone will freak out about backwards compatibility.
        
       | kittikitti wrote:
       | Thank you so much for posting this. The original logic was clear
       | and it had me excited! I believe this is useful because
       | compression is very common and although it might not fit
       | perfectly in this scenario, it could very well be a breakthrough
       | in another. If I come across a framework that could also benefit
       | from this compression algorithm, I'll be sure to give you credit.
        
       | bhouston wrote:
       | What about a different approach - an optional npm proxy that
       | recompresses popular packages with 7z/etc in the background?
       | 
       | Could verify package integrity by hashing contents rather than
       | archives, plus digital signatures for recompressed versions. Only
       | kicks in for frequently downloaded packages once compression is
       | ready.
       | 
       | Benefits: No npm changes needed, opt-in only, potential for big
       | bandwidth savings on popular packages. Main tradeoff is
       | additional verification steps, but they could be optional given a
       | digital signature approach.
       | 
       | Curious if others see major security holes in this approach?
        
         | ndriscoll wrote:
         | This felt like the obvious way to do things to me: hash a .tar
         | file, not a .tar.gz file. Use Accept-Encoding to negotiate the
         | compression scheme for transfers. CDN can compress on the fly
         | or optionally cache precompressed files. i.e. just use standard
         | off-the-shelf HTTP features. These days I prefer uncompressed
         | .tar files anyway because ZFS has transparent zstd, so
         | decompressed archive files are generally smaller than a .gz.
        
           | cesarb wrote:
           | > hash a .tar file, not a .tar.gz file
           | 
           | For security reasons, it's usually better to hash the
           | compressed file, since it reduces the attack surface: the
           | decompressor is not exposed to unverified data. There have
           | already been vulnerabilities in decompressor implementations
           | which can be exploited through malformed compressed data (and
           | this includes IIRC at least one vulnerability in zlib, which
           | is the standard decompressor for .gz).
        
             | bhouston wrote:
             | This suggests one should just upload a tar rather than a
             | compressed file. Makes sense because one can scan the
             | contents for malicious files without risking a decompressor
             | bug.
             | 
             | BTW npm decompressed all packages anyhow because it lets
             | you view the contents these days on its website.
        
           | bhouston wrote:
           | You are correct. They should be uploading and downloading
           | dumb tar files and let the HTTP connection negotiate the
           | compression method. All hashes should be based on the
           | uncompressed raw tar dump. This would be proper separation of
           | concerns.
        
       | hinkley wrote:
       | Does npm even default to gzip -9? Wikipedia claims zopfli is 80
       | times slower under default settings.
        
         | zahlman wrote:
         | My experience has been that past -6 or so, gzip files get only
         | a tiny bit smaller in typical cases. (I think I've even seen
         | them get bigger with -9.)
        
       | jefozabuss wrote:
       | I wonder what is the tarball size difference on average if you'd
       | for example download everything in one tarball (full package
       | list) instead of 1-by-1 as the gzip compression would work way
       | better in that case.
       | 
       | Also for bigger companies this is not really a "big" problem as
       | they usually have in-house proxies (as you cannot rely on a 3rd
       | party repository in CI/CD for multiple reasons (security, audit,
       | speed, etc)).
        
       | snizovtsev wrote:
       | Yes, but it was expected. It's like prioritising code readability
       | over performance everywhere but the hot path.
       | 
       | Earlier in my career, I managed to use Zopfli once to compress
       | gigabytes of PNG assets into a fast in-memory database supporting
       | a 50K+ RPS web page. We wanted to keep it simple and avoid the
       | complexity of horizontal scaling, and it was OK to drop some
       | rarely used images. So the more images we could pack into a
       | single server, the more coverage we had. In that sense Zopfli was
       | beneficial.
        
       | 1337shadow wrote:
       | Ok but why doesn't npm registry actually recompress the archives?
       | It can even apply that retroactively, wouldn't require zopli in
       | npm CLI
        
         | aseipp wrote:
         | Hashes of the tarballs are recorded in the package-lock.json of
         | downstream dependants, so recompressing the files in place will
         | cause the hashes to change and break everyone. It has to be
         | done at upload time.
        
           | notpushkin wrote:
           | But it still can be done on the npm side, right?
        
           | bhouston wrote:
           | The hashes of the uncompressed tarballs would be great. Then
           | the HTTP connection can negotiate a compression format for
           | transfer (which can change over time at HTTP itself changes)
           | rather than baking it into the NPM package standard (which is
           | incredibly inflexible.)
        
       | tonymet wrote:
       | I'm more concerned about the 40gb of node_modules. Why hasn't
       | node supported tgz node_modules. That would save 75% of the space
       | or more.
        
         | bhouston wrote:
         | isn't this a file system thing? Why bake it into npm?
        
           | tonymet wrote:
           | efficiency
        
       | nopurpose wrote:
       | It reminds me of an effort to improve docker image format and
       | make it move away from being just a tar file. I can't find links
       | anymore, but it was a pretty clever design, which still couldn't
       | beat dumb tar in efficiency.
        
         | bhouston wrote:
         | Transferring around dumb tar is actually smart because the
         | HTTPS connection can negotiate a compressed version of it to
         | transfer - e.g. gzip, brotli, etc. No need to bake in an
         | unchangable compression format into the standard.
        
       | woadwarrior01 wrote:
       | From the RFC on github[1].
       | 
       | > Zopfli is written in C, which presents challenges. Unless it
       | was added to Node core, the CLI would need to (1) rewrite Zopfli
       | in JS, possibly impacting performance (2) rely on a native
       | module, impacting reliability (3) rely on a WebAssembly module.
       | All of these options add complexity.
       | 
       | Wow! Who's going to tell them that V8 is written in C++? :)
       | 
       | [1]: https://github.com/npm/rfcs/pull/595
        
         | kmacdough wrote:
         | It's not about C per-se, as much as each native compiled
         | dependency creates additional maintenance concerns. Changes to
         | hardware/OS can require a recompile or even fixes. NPM build
         | system already requires a JavaScript runtime, so is already
         | handled as part of existing maintenance. The point is that
         | Zopfli either needs to be rewritten for a platform-agnostic
         | abstraction they already support, or else Zopfli will be added
         | to a list of native modules to maintain.
        
           | woadwarrior01 wrote:
           | > It's not about C per-se, as much as each native compiled
           | dependency creates additional maintenance concerns. Changes
           | to hardware/OS can require a recompile or even fixes.
           | 
           | This is a canard. zopfli is written in portable C and is far
           | more portable than the nodejs runtime. On any hardware/OS
           | combo that one can build the nodejs runtime, they certainly
           | can also build and run zopfli.
        
       | gweinberg wrote:
       | I was under the impression that bzip compresses more than gzip,
       | but gzip is much faster, so gzip is better for things that need
       | to be compressed on the fly, and bzip is better fro things that
       | get archived. Is this not true? Wouldn't it have been better to
       | use bzip all along for this purpose?
        
       | bangaladore wrote:
       | I think the main TLDR here [1]:
       | 
       | > For example, I tried recompressing the latest version of the
       | typescript package. GNU tar was able to completely compress the
       | archive in about 1.2 seconds on my machine. Zopfli, with just 1
       | iteration, took 2.5 minutes.
       | 
       | [1] https://github.com/npm/rfcs/pull/595#issuecomment-1200480148
       | 
       | My question of course would be, what about LZ4, or Zstd or
       | Brotli? Or is backwards compatibility strictly necessary? I
       | understand that GZIP is still a good compressor, so those others
       | may not produce meaningful gains. But, as the author suggests,
       | even small gains can produce huge results in bandwidth reduction.
        
       | loeg wrote:
       | It probably makes more sense to save more bytes and compressor
       | time and just switch to zstd (a bigger scoped effort, sure).
        
       | imoverclocked wrote:
       | > But the cons were substantial: ... > This wouldn't
       | retroactively apply to existing packages.
       | 
       | Why is this substantial? My understanding is that packages
       | shouldn't be touched once published. It seems likely for any
       | change to not apply retroactively.
        
       | szundi wrote:
       | This is a nice guy
        
       | omoikane wrote:
       | > Integrating Zopfli into the npm CLI would be difficult.
       | 
       | Is it possible to modify "gzip -9" or zlib to invoke zopfli? This
       | way everyone who wants to compress better will get the extra
       | compression automatically, in addition to npm.
       | 
       | There will be an increase in compression time, but since "gzip
       | -9" is not the default, people preferring compression speed might
       | not be affected.
        
         | bombcar wrote:
         | You'd have more problems here, but you could do it - if you let
         | it take ages and ages to percolate though all environments.
         | 
         | It's been almost 30 years since bzip2 was released and even now
         | not everything can handle tar.bz2
        
           | arccy wrote:
           | probably because bzip2 isn't a very good format
        
       | tehjoker wrote:
       | 5% improvement is basically the minimum I usually consider
       | worthwhile to pursue, but it's still small. Once you get to 10%
       | or 20%, things become much more attractive. I can see how people
       | can go either way on a 5% increase if there are any negative
       | consequences (such as increased build time).
        
       | Alifatisk wrote:
       | I wonder if there is a way to install the npm packages without
       | the crap they come included with (like docs, tests, readme etc).
        
       | rafaelmn wrote:
       | I wonder if you could get better results if you built a
       | dictionary over entire npm. I suspect most common words could
       | easily be reduced to 16k word index. Would be much faster,
       | dictionary would probably fit in cache, can even optimize it in
       | memory for cache prefetch.
        
         | zahlman wrote:
         | This seems like a non-starter to me - new packages are added to
         | npm all the time, and will alter the word frequency
         | distribution. If you aren't prepared to re-build constantly and
         | accept that the dictionary isn't optimal, then it's hard to
         | imagine it being significantly better than what you build with
         | a more naive approach. Basically - why try to fine-tune to a
         | moving target?
        
           | arccy wrote:
           | would it really change that quickly? you might get
           | significant savings from just having keywords, common
           | variable names, standard library functions
        
       | hinkley wrote:
       | Pulling on this thread, there are a few people who have looked at
       | the ways zopfli is inefficient. Including this guy who forked it,
       | and tried to contribute a couple improvements back to master:
       | 
       | https://github.com/fhanau/Efficient-Compression-Tool
       | 
       | These days if you're going to iterate on a solution you'd better
       | make it multithreaded. We have laptops where sequential code uses
       | 8% of the available cpu.
        
       | frabjoused wrote:
       | This reminds me of a time I lost an argument with John-David
       | Dalton about cleaning up/minifying lodash as an npm dependency,
       | because when including the readme and license for every sub-
       | library, a lodash import came to ~2.5MB at the time. This also
       | took a lot of seeking time for disks because there were so many
       | individual files.
       | 
       | The conversation started and ended at the word cache.
        
       | zahlman wrote:
       | I'd love to see an effort like like this succeed in the Python
       | ecosystem. Right now, PyPI is dependent upon Fastly to serve
       | files, on the order of >1 petabyte per day. That's a truly
       | massive in-kind donation, compared to the PSF's operating budget
       | (only a few million dollars per year - far smaller than Linux or
       | Mozilla).
        
         | cozzyd wrote:
         | No problem, I'm sure if Fastly stopped doing it JiaTanCo would
         | step up
        
       | JoeAltmaier wrote:
       | These days technology moves so fast it's hard to keep up. The
       | slowest link in the system is the human being.
       | 
       | That's a strong argument that 'if it isn't broke, don't fix it."
       | 
       | LOts of numbers being thrown around, you add up tiny things
       | enough times you can get a big number. But is npm package
       | download the thing that's tanking the internet? No? Then this is
       | a second- or thirt-order optimization.
        
       | luzifer42 wrote:
       | I created once a maven plugin to recompress Java artefacts with
       | zopfli. I rewrote it in Java and runs entirely in the JVM. This
       | means, the speed is worse and may contain bugs:
       | 
       | https://luccappellaro.github.io/2015/03/01/ZopfliMaven.html
        
       ___________________________________________________________________
       (page generated 2025-01-27 23:00 UTC)