[HN Gopher] I've compared nearly all Rust crates.io crates to co...
___________________________________________________________________
I've compared nearly all Rust crates.io crates to contents of their
Git repos
Author : robin_reala
Score : 117 points
Date : 2024-06-16 17:07 UTC (5 hours ago)
(HTM) web link (mastodon.social)
(TXT) w3m dump (mastodon.social)
| jraph wrote:
| Good initiative. Now people need to go through this and do the
| reviews :-)
|
| Next step would be to do reproducible builds (if it's not already
| the case).
| SubiculumCode wrote:
| First pass gpt?
| SubiculumCode wrote:
| Heavily down voted, which is fair because I didn't really
| explain what I meant, which was: Would using LLM's to parse
| the generated diffs, as a first pass, be useful/efficient for
| spotting and interpreting discrepancies?
| arccy wrote:
| when your goal is to improve security, the unreliability
| that comes with LLMs is not the answer.
| chipdart wrote:
| I don't think this is a relevant take. Your goal is to
| implement a system to automatically scan countless
| packages and run a heuristic to determine if a package is
| suspicious or not. You're complaining about false
| positives/false negatives while ignoring that packages
| that not checking packages at all is not an improvement.
| pornel wrote:
| It could work for classifying honest/innocent differences.
|
| However, LLMs are incredibly naive, so they could be easily
| fooled by a malicious actor (probably as easy as adding a
| comment that this is definitely NOT a backdoor).
| jonahx wrote:
| Not a Rust dev so maybe a dumb question, but is this more
| involved than just running diffs? If so, what needs to be done?
| johannes1234321 wrote:
| The key thing is interpretation of the diff. Is there a
| difference since they ran some code generator ins the crate
| contains generated code, not present in the repo or did they
| add a backdoor?
| thayne wrote:
| Most of the diffs are probably innocuous. I suspect the most
| common diff would be the version line of Cargo.toml, both
| from CI that automatically updates that line, and people who
| forgot to update it before making a tag in git.
| mmastrac wrote:
| As someone with a crate that's in the 50MM plus range, this
| happens all the time. I really should automate this via a
| GH action.
| OptionOfT wrote:
| Interested to see the crate, and maybe I can help?
| mberning wrote:
| How could you rank them for review priority? Use a combination of
| repo popularity multiplied by amount of significant differences?
| Where significant differences are determined by excluding non-
| code files?
| swiftcoder wrote:
| Looking through the code, it already ignores the majority of
| non-code files
| pornel wrote:
| I'd use popularity (how many people are using the crate
| indirectly) divided by trust level in the publisher of the
| crate.
|
| However, publishing a list that basically says "these are the
| least trustworthy Rust users" would cause quite a stir, so I'm
| not doing that.
| grahar64 wrote:
| Deterministic compilation is the best way to let people validate
| what they are downloading from repositories is what is in the
| codebases.
| xpe wrote:
| > Deterministic compilation is the best way to let people
| validate what they are downloading from repositories is what is
| in the codebases.
|
| "the best way"? Please make the argument for why. To do it
| properly, you must steel-man the alternatives (not shoot down
| straw-men)
| miki123211 wrote:
| Because deterministic compilation lets you (or someone else)
| do this automatically.
|
| If you introduce a backdoor into the compilation step, you
| run a much greater risk of detection. As long as there are
| multiple machines compiling packages and verifying whether
| the checksums match, any single backdoored machine will
| immediately be caught. This is much more important for
| package managers that do their own builds and ship their own
| binaries than for those who just ship whatever they got from
| the developer.
|
| Without deterministic compilation, two builds of the exact
| same code might differ. This makes backdoors very hard to
| detect unless you have prior suspicion that one is present in
| a particular program.
|
| Deterministic compilation forces people to embed backdoors
| directly in the source code repository, which creates an
| audit trail, is very visible in diffs, much easier to catch
| in reviews and so on. You can still get away with it (see the
| XZ situation), but it requires far more work.
| progval wrote:
| crates.io does not host compiled artifacts. If packages on
| crates.io differ from their Git repository it's because of a
| custom pre-build step of that particular package, so a
| deterministic compilation toolchain won't help here.
| estebank wrote:
| There are other possible reasons for them not matching:
|
| - files not being tracked in the repo
|
| - files being part of the repo not being part of the
| published crate
|
| - publishing with allow dirty from a local copy of the repo
| with changes that haven't been committed
|
| - publishing from a commit that hasn't been pushed
|
| I'm sure there are more.
| corytheboyd wrote:
| How crazy would it be to have a package repository that also
| builds the artifacts it distributes? You'd need a high barrier to
| entry to save on costs and time sifting through garbage. Perhaps
| it's this high barrier that would prevent such a repository from
| taking off though. Perhaps this is just a really dumb step on a
| path leading back to simple checksum validations... though with
| those, you're only validating that whatever was uploaded is what
| you downloaded, it doesn't ensure that it was built from a known
| set of source files... hard problems.
| kichimi wrote:
| Isn't this gentoo?
| GauntletWizard wrote:
| No, Gentoo does something far from it - it builds everything
| on the host machine every time, more or less.
| KolmogorovComp wrote:
| crates.io already builds the artefacts.
|
| But the code-source that is sent to crates.io is not
| necessarily the same as the one in the public repo linked to
| the crate.
| progval wrote:
| Do you have a source for crates.io building artefacts? I have
| a couple of crates on it and never saw any sign it tried to
| compile them, even when they were broken.
| corytheboyd wrote:
| Ah yeah, I suppose that's what I really mean, a means of
| verifying builds link to source that is publicly available.
| Sounds like the source repository has to be in on it too
| kibwen wrote:
| It's possible that crates.io might attempt to build a crate
| when published as a sort of sanity check (I don't know if
| this is true, but it's certainly feasible), but it doesn't
| distribute binaries, it distributes source code.
| miki123211 wrote:
| Distro repositories (like the one you have on Debian / Ubuntu /
| Redhat etc) do this.
|
| They work on a different model, where only packages that are
| deemed "worthy" are included, and there's a small-ish set of
| maintainers that are authorized to make changes and/or accept
| change requests from the community. In contrast, programming
| language package managers like cargo, pip or npm let anybody
| upload new packages with little to no prior verification, and
| place the responsibility of maintaining them solely on their
| author.
|
| The distribution way of doing things is sometimes necessary, as
| different distributions have different policies on what they
| allow in their repositories, might want to change compilation
| options or installation paths, backport bug and security fixes
| from newer project versions for compatibility, or even
| introduce small code changes to make the program work better
| (or work at all) on that system.
|
| One example of such a repository, for the Alpine Linux
| distribution, is at https://github.com/alpinelinux/aports
| thayne wrote:
| go kind of solves that by making the git repo the source of
| truth for a package, and host a cache for it.
|
| The problem with it is you need the full git url in every file
| you import it. which is a pain if the repo changes locations,
| or you want to use a fork or a local version. Versioning is
| also tricky, to the point that go recommends creating a
| separate branch for a major/breaking version, which requires
| updating every import statement.
|
| I think a good middle ground would be to have a central
| repository and/or package configuration file that maps package
| names to git repos and versions to commits (possibly via tags).
| And of course use hashes to lock the version to specific
| contents.
|
| Bazel kind of does this, but it doesn't have any built in
| version resolution or transitive dependency resolution
| (although in some cases there are other tools that help). And
| it can add a lot of complexity that you may not need.
| dgoldstein0 wrote:
| bazel has modules now: https://bazel.build/external/module
|
| Not tried them but they look like a reasonable dep handling
| solution on paper - each module can declare it's own
| dependencies and bazel will figure it out for you like a
| package manager. Their old workspaces way of doing it was a
| nightmare, as while patterns emerged where repos would export
| a function to register their dependencies, the first
| declaration of any name would win and thus you weren't
| guaranteed to have a compatible set of workspaces at the end.
| c0balt wrote:
| That's what nixpkgs does for Nix/NixOS. The package set is
| continuously built by a CI system and made publicly available:
| https://github.com/NixOS/nixpkgs#continuous-integration-and-...
| makeworld wrote:
| This is why I like what Go does, where you're downloading from
| Git directly (optionally proxied through Google, yes)
| jesprenj wrote:
| Related: Proxying can be disabled by setting the environment
| variable GOPROXY=direct [0]. I put it in my bashrc.
|
| [0] https://www.practical-go-lessons.com/chap-18-go-module-
| proxi...
| kibwen wrote:
| I'm not sure what is meant by "downloading from Git", I assume
| you mean downloading from Github. And Github is far less secure
| than what crates.io does, because crates.io is immutable (once
| published, uploaders can't change anything without opening a
| support ticket which will get rejected if they don't have a
| good reason), whereas Github history is trivially rewriteable.
| This means that if you rely on "v1.2.3" of a library from
| crates.io, that's always going to give everyone the same code;
| conversely, relying on a git tag of "v1.2.3" from a random
| Github repo could be anything at any point.
| fsmv wrote:
| The goproxy server makes it somewhat immutable for go too.
| Once they have a version cached they will never delete it.
| You can only supercede it with a new version and mark the old
| version as bad.
| estebank wrote:
| TBH, "somewhat immutable" is not "immutable". The Go
| approach aids to limit the effects of an attempted attach
| where you're misled into building your project with
| different code than originally intended, but does nothing
| to guarantee continuity over time of dependencies being
| available. For that you have to rely on vendoring.
|
| The crates.io approach instead defends you both from a
| dependency changing silently _and_ from it disappearing
| from one day to the next, without having to deal with
| vendoring.
| agwa wrote:
| As kibween noted above, crates.io can be changed if
| there's a "good reason" so it's not truly immutable
| either.
|
| Go does not delete modules from the module proxy except
| for copyright/legal reasons (I suspect crates.io would
| delete for these reasons too), and additionally the
| checksums of all modules are published in a tamper-proof
| transparency log (https://sum.golang.org/) so if they did
| alter or delete a module, it would be detected.
| sosodev wrote:
| Go modules can be hosted in any Git repository. The Go
| toolchain also keeps hashes of the selected tag so if you've
| reviewed it once it will never change without you explicitly
| giving it the ok.
| moffkalast wrote:
| > giving it the ok
|
| You mean giving it the Go ahead? ;)
| Yasuraka wrote:
| Also any svn/hg repository afaik
| leoh wrote:
| That's true, except that the lockfile records revision as a
| commit sha.
|
| https://github.com/apex/up-
| examples/blob/master/oss/golang-g...
| aseipp wrote:
| > conversely, relying on a git tag of "v1.2.3" from a random
| Github repo could be anything at any point.
|
| I don't know of a single modern build tool that can do this
| but doesn't require or record this information specifically?
| Maybe the earlier versions of Go? (I know they've gone
| through a few changes in module/import strategies.)
| cgh wrote:
| It's interesting you even have to point this out. Maven
| solved this and other problems literally decades ago but the
| repository packaging wheel keeps getting reinvented. For
| example, here's the page on Maven Central's immutability
| policy:
|
| https://central.sonatype.org/publish/requirements/immutabili.
| ..
| timeon wrote:
| You can do that with Rust as well if you define path of
| dependency to git repo (or local dir).
| estebank wrote:
| Be aware that you cannot publish on crates.io if you do that:
| either you buy into the system (so that you can ensure that
| you can rebuild in perpetuity) or not at all (so you end up
| with a crate that can depenend partially on crates.io, but
| must always be consumed directly from a repo or directory.
| agwa wrote:
| The problem is that if you clone the Git repository, or view it
| on GitHub, you have no assurance that you're seeing the same
| code that the go command or the Go module proxy saw. The author
| of a malicious module could change the Git tag to point to a
| different, benign, commit after the Go module proxy stores the
| malicious copy. There are other tricks an attacker can play as
| well: https://github.com/golang/go/issues/66653
|
| Ultimately, if you're doing a code audit, you have to compute
| the checksum of the code that you're looking at, and compare it
| against the entry in go.sum or the checksum database to make
| sure you're auditing the right copy.
| gigatexal wrote:
| Has anyone analyzed the data?
___________________________________________________________________
(page generated 2024-06-16 23:01 UTC)