[HN Gopher] I've compared nearly all Rust crates.io crates to co...
       ___________________________________________________________________
        
       I've compared nearly all Rust crates.io crates to contents of their
       Git repos
        
       Author : robin_reala
       Score  : 117 points
       Date   : 2024-06-16 17:07 UTC (5 hours ago)
        
 (HTM) web link (mastodon.social)
 (TXT) w3m dump (mastodon.social)
        
       | jraph wrote:
       | Good initiative. Now people need to go through this and do the
       | reviews :-)
       | 
       | Next step would be to do reproducible builds (if it's not already
       | the case).
        
         | SubiculumCode wrote:
         | First pass gpt?
        
           | SubiculumCode wrote:
           | Heavily down voted, which is fair because I didn't really
           | explain what I meant, which was: Would using LLM's to parse
           | the generated diffs, as a first pass, be useful/efficient for
           | spotting and interpreting discrepancies?
        
             | arccy wrote:
             | when your goal is to improve security, the unreliability
             | that comes with LLMs is not the answer.
        
               | chipdart wrote:
               | I don't think this is a relevant take. Your goal is to
               | implement a system to automatically scan countless
               | packages and run a heuristic to determine if a package is
               | suspicious or not. You're complaining about false
               | positives/false negatives while ignoring that packages
               | that not checking packages at all is not an improvement.
        
             | pornel wrote:
             | It could work for classifying honest/innocent differences.
             | 
             | However, LLMs are incredibly naive, so they could be easily
             | fooled by a malicious actor (probably as easy as adding a
             | comment that this is definitely NOT a backdoor).
        
         | jonahx wrote:
         | Not a Rust dev so maybe a dumb question, but is this more
         | involved than just running diffs? If so, what needs to be done?
        
           | johannes1234321 wrote:
           | The key thing is interpretation of the diff. Is there a
           | difference since they ran some code generator ins the crate
           | contains generated code, not present in the repo or did they
           | add a backdoor?
        
           | thayne wrote:
           | Most of the diffs are probably innocuous. I suspect the most
           | common diff would be the version line of Cargo.toml, both
           | from CI that automatically updates that line, and people who
           | forgot to update it before making a tag in git.
        
             | mmastrac wrote:
             | As someone with a crate that's in the 50MM plus range, this
             | happens all the time. I really should automate this via a
             | GH action.
        
               | OptionOfT wrote:
               | Interested to see the crate, and maybe I can help?
        
       | mberning wrote:
       | How could you rank them for review priority? Use a combination of
       | repo popularity multiplied by amount of significant differences?
       | Where significant differences are determined by excluding non-
       | code files?
        
         | swiftcoder wrote:
         | Looking through the code, it already ignores the majority of
         | non-code files
        
         | pornel wrote:
         | I'd use popularity (how many people are using the crate
         | indirectly) divided by trust level in the publisher of the
         | crate.
         | 
         | However, publishing a list that basically says "these are the
         | least trustworthy Rust users" would cause quite a stir, so I'm
         | not doing that.
        
       | grahar64 wrote:
       | Deterministic compilation is the best way to let people validate
       | what they are downloading from repositories is what is in the
       | codebases.
        
         | xpe wrote:
         | > Deterministic compilation is the best way to let people
         | validate what they are downloading from repositories is what is
         | in the codebases.
         | 
         | "the best way"? Please make the argument for why. To do it
         | properly, you must steel-man the alternatives (not shoot down
         | straw-men)
        
           | miki123211 wrote:
           | Because deterministic compilation lets you (or someone else)
           | do this automatically.
           | 
           | If you introduce a backdoor into the compilation step, you
           | run a much greater risk of detection. As long as there are
           | multiple machines compiling packages and verifying whether
           | the checksums match, any single backdoored machine will
           | immediately be caught. This is much more important for
           | package managers that do their own builds and ship their own
           | binaries than for those who just ship whatever they got from
           | the developer.
           | 
           | Without deterministic compilation, two builds of the exact
           | same code might differ. This makes backdoors very hard to
           | detect unless you have prior suspicion that one is present in
           | a particular program.
           | 
           | Deterministic compilation forces people to embed backdoors
           | directly in the source code repository, which creates an
           | audit trail, is very visible in diffs, much easier to catch
           | in reviews and so on. You can still get away with it (see the
           | XZ situation), but it requires far more work.
        
         | progval wrote:
         | crates.io does not host compiled artifacts. If packages on
         | crates.io differ from their Git repository it's because of a
         | custom pre-build step of that particular package, so a
         | deterministic compilation toolchain won't help here.
        
           | estebank wrote:
           | There are other possible reasons for them not matching:
           | 
           | - files not being tracked in the repo
           | 
           | - files being part of the repo not being part of the
           | published crate
           | 
           | - publishing with allow dirty from a local copy of the repo
           | with changes that haven't been committed
           | 
           | - publishing from a commit that hasn't been pushed
           | 
           | I'm sure there are more.
        
       | corytheboyd wrote:
       | How crazy would it be to have a package repository that also
       | builds the artifacts it distributes? You'd need a high barrier to
       | entry to save on costs and time sifting through garbage. Perhaps
       | it's this high barrier that would prevent such a repository from
       | taking off though. Perhaps this is just a really dumb step on a
       | path leading back to simple checksum validations... though with
       | those, you're only validating that whatever was uploaded is what
       | you downloaded, it doesn't ensure that it was built from a known
       | set of source files... hard problems.
        
         | kichimi wrote:
         | Isn't this gentoo?
        
           | GauntletWizard wrote:
           | No, Gentoo does something far from it - it builds everything
           | on the host machine every time, more or less.
        
         | KolmogorovComp wrote:
         | crates.io already builds the artefacts.
         | 
         | But the code-source that is sent to crates.io is not
         | necessarily the same as the one in the public repo linked to
         | the crate.
        
           | progval wrote:
           | Do you have a source for crates.io building artefacts? I have
           | a couple of crates on it and never saw any sign it tried to
           | compile them, even when they were broken.
        
           | corytheboyd wrote:
           | Ah yeah, I suppose that's what I really mean, a means of
           | verifying builds link to source that is publicly available.
           | Sounds like the source repository has to be in on it too
        
           | kibwen wrote:
           | It's possible that crates.io might attempt to build a crate
           | when published as a sort of sanity check (I don't know if
           | this is true, but it's certainly feasible), but it doesn't
           | distribute binaries, it distributes source code.
        
         | miki123211 wrote:
         | Distro repositories (like the one you have on Debian / Ubuntu /
         | Redhat etc) do this.
         | 
         | They work on a different model, where only packages that are
         | deemed "worthy" are included, and there's a small-ish set of
         | maintainers that are authorized to make changes and/or accept
         | change requests from the community. In contrast, programming
         | language package managers like cargo, pip or npm let anybody
         | upload new packages with little to no prior verification, and
         | place the responsibility of maintaining them solely on their
         | author.
         | 
         | The distribution way of doing things is sometimes necessary, as
         | different distributions have different policies on what they
         | allow in their repositories, might want to change compilation
         | options or installation paths, backport bug and security fixes
         | from newer project versions for compatibility, or even
         | introduce small code changes to make the program work better
         | (or work at all) on that system.
         | 
         | One example of such a repository, for the Alpine Linux
         | distribution, is at https://github.com/alpinelinux/aports
        
         | thayne wrote:
         | go kind of solves that by making the git repo the source of
         | truth for a package, and host a cache for it.
         | 
         | The problem with it is you need the full git url in every file
         | you import it. which is a pain if the repo changes locations,
         | or you want to use a fork or a local version. Versioning is
         | also tricky, to the point that go recommends creating a
         | separate branch for a major/breaking version, which requires
         | updating every import statement.
         | 
         | I think a good middle ground would be to have a central
         | repository and/or package configuration file that maps package
         | names to git repos and versions to commits (possibly via tags).
         | And of course use hashes to lock the version to specific
         | contents.
         | 
         | Bazel kind of does this, but it doesn't have any built in
         | version resolution or transitive dependency resolution
         | (although in some cases there are other tools that help). And
         | it can add a lot of complexity that you may not need.
        
           | dgoldstein0 wrote:
           | bazel has modules now: https://bazel.build/external/module
           | 
           | Not tried them but they look like a reasonable dep handling
           | solution on paper - each module can declare it's own
           | dependencies and bazel will figure it out for you like a
           | package manager. Their old workspaces way of doing it was a
           | nightmare, as while patterns emerged where repos would export
           | a function to register their dependencies, the first
           | declaration of any name would win and thus you weren't
           | guaranteed to have a compatible set of workspaces at the end.
        
         | c0balt wrote:
         | That's what nixpkgs does for Nix/NixOS. The package set is
         | continuously built by a CI system and made publicly available:
         | https://github.com/NixOS/nixpkgs#continuous-integration-and-...
        
       | makeworld wrote:
       | This is why I like what Go does, where you're downloading from
       | Git directly (optionally proxied through Google, yes)
        
         | jesprenj wrote:
         | Related: Proxying can be disabled by setting the environment
         | variable GOPROXY=direct [0]. I put it in my bashrc.
         | 
         | [0] https://www.practical-go-lessons.com/chap-18-go-module-
         | proxi...
        
         | kibwen wrote:
         | I'm not sure what is meant by "downloading from Git", I assume
         | you mean downloading from Github. And Github is far less secure
         | than what crates.io does, because crates.io is immutable (once
         | published, uploaders can't change anything without opening a
         | support ticket which will get rejected if they don't have a
         | good reason), whereas Github history is trivially rewriteable.
         | This means that if you rely on "v1.2.3" of a library from
         | crates.io, that's always going to give everyone the same code;
         | conversely, relying on a git tag of "v1.2.3" from a random
         | Github repo could be anything at any point.
        
           | fsmv wrote:
           | The goproxy server makes it somewhat immutable for go too.
           | Once they have a version cached they will never delete it.
           | You can only supercede it with a new version and mark the old
           | version as bad.
        
             | estebank wrote:
             | TBH, "somewhat immutable" is not "immutable". The Go
             | approach aids to limit the effects of an attempted attach
             | where you're misled into building your project with
             | different code than originally intended, but does nothing
             | to guarantee continuity over time of dependencies being
             | available. For that you have to rely on vendoring.
             | 
             | The crates.io approach instead defends you both from a
             | dependency changing silently _and_ from it disappearing
             | from one day to the next, without having to deal with
             | vendoring.
        
               | agwa wrote:
               | As kibween noted above, crates.io can be changed if
               | there's a "good reason" so it's not truly immutable
               | either.
               | 
               | Go does not delete modules from the module proxy except
               | for copyright/legal reasons (I suspect crates.io would
               | delete for these reasons too), and additionally the
               | checksums of all modules are published in a tamper-proof
               | transparency log (https://sum.golang.org/) so if they did
               | alter or delete a module, it would be detected.
        
           | sosodev wrote:
           | Go modules can be hosted in any Git repository. The Go
           | toolchain also keeps hashes of the selected tag so if you've
           | reviewed it once it will never change without you explicitly
           | giving it the ok.
        
             | moffkalast wrote:
             | > giving it the ok
             | 
             | You mean giving it the Go ahead? ;)
        
             | Yasuraka wrote:
             | Also any svn/hg repository afaik
        
           | leoh wrote:
           | That's true, except that the lockfile records revision as a
           | commit sha.
           | 
           | https://github.com/apex/up-
           | examples/blob/master/oss/golang-g...
        
           | aseipp wrote:
           | > conversely, relying on a git tag of "v1.2.3" from a random
           | Github repo could be anything at any point.
           | 
           | I don't know of a single modern build tool that can do this
           | but doesn't require or record this information specifically?
           | Maybe the earlier versions of Go? (I know they've gone
           | through a few changes in module/import strategies.)
        
           | cgh wrote:
           | It's interesting you even have to point this out. Maven
           | solved this and other problems literally decades ago but the
           | repository packaging wheel keeps getting reinvented. For
           | example, here's the page on Maven Central's immutability
           | policy:
           | 
           | https://central.sonatype.org/publish/requirements/immutabili.
           | ..
        
         | timeon wrote:
         | You can do that with Rust as well if you define path of
         | dependency to git repo (or local dir).
        
           | estebank wrote:
           | Be aware that you cannot publish on crates.io if you do that:
           | either you buy into the system (so that you can ensure that
           | you can rebuild in perpetuity) or not at all (so you end up
           | with a crate that can depenend partially on crates.io, but
           | must always be consumed directly from a repo or directory.
        
         | agwa wrote:
         | The problem is that if you clone the Git repository, or view it
         | on GitHub, you have no assurance that you're seeing the same
         | code that the go command or the Go module proxy saw. The author
         | of a malicious module could change the Git tag to point to a
         | different, benign, commit after the Go module proxy stores the
         | malicious copy. There are other tricks an attacker can play as
         | well: https://github.com/golang/go/issues/66653
         | 
         | Ultimately, if you're doing a code audit, you have to compute
         | the checksum of the code that you're looking at, and compare it
         | against the entry in go.sum or the checksum database to make
         | sure you're auditing the right copy.
        
       | gigatexal wrote:
       | Has anyone analyzed the data?
        
       ___________________________________________________________________
       (page generated 2024-06-16 23:01 UTC)