[HN Gopher] Mozilla sccache: ccache with cloud storage
___________________________________________________________________
Mozilla sccache: ccache with cloud storage
Author : thunderbong
Score : 151 points
Date : 2023-12-22 10:02 UTC (2 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| Sytten wrote:
| Can you combine that with cross rs? I positively hate the
| official GHA cache action so anything to replace that would be
| nice. But we cross compile with cross rs.
| Scarjit wrote:
| It should work if you modify the cross docker image, so that it
| uses the sccache executable as wrapper
| sgammon wrote:
| Another alternative is Buildless with the Actions setup step.
| This sets up a remote and local endpoint for sccache inside
| actions, with a connection up to remote caching by Dragonfly
|
| https://github.com/buildless/setup
| fifteen1506 wrote:
| IDK what happened at Mozilla but keep going!
| CaptainOfCoit wrote:
| Worth noting that the first commit in sccache git repository
| was in 2014 (https://github.com/mozilla/sccache/commit/115016e0
| a83b290dc2...). So I suppose that what "happened" happened waay
| back.
|
| Then in 2016 it seems like sccache was re-implemented in Rust (
| https://github.com/mozilla/sccache/commit/3da89195ce91a576cc...
| ), from the initial Python implementation.
| altairprime wrote:
| Also around then, taskcluster happened. https://github.com/ta
| skcluster/taskcluster/commit/54ffef79db...
| jandeboevrie wrote:
| Regular ccache supports remote storage as well:
| https://ccache.dev/manual/latest.html#_remote_storage_backen...
| CaptainOfCoit wrote:
| I guess the difference is that sccache supports cloud storage
| (S3, R2, Google Cloud Storage, el al) out of the box while
| cache doesn't, as far as I can tell.
| slavik81 wrote:
| The sccache description was written before remote storage was
| added to ccache. Remote storage is a relatively new feature
| for ccache, and it didn't exist when sccache was created.
| kinkaid wrote:
| The ability to use the local and remote caches in tandem is
| the most important feature to me. sccache will use one or the
| other exclusively, so successive builds will have to pay the
| latency and bandwidth costs every time an artifact is built
| rather than just unpacking from the local cache. ccache
| stages remote artifacts locally by default, so it only pays
| the network costs once per artifact. In CI builds they are
| more or less the same, but the local build experience for
| ccache is much nicer imo.
| sgammon wrote:
| both are supported by buildless as well https://less.build
| sgammon wrote:
| (ccache is only about caching, not distributing builds,
| also...)
| wongarsu wrote:
| sccache is also delightfully simple to set up if you just want
| local storage. It's my go-to solution for sharing build artifacts
| between rust projects
| IshKebab wrote:
| In my experience the time this saves is generally outweighed by
| the effort of setting it up and the many hours lost when it goes
| wrong and you don't think to try a clean build. It's certainly
| better than ccache in that regard but you really need something
| like Bazel if you're going to be aggressively caching C++ builds
| like this (or impure Rust builds).
|
| (And if you're using Bazel or one of its brethren then they
| generally have native remote caching and execution support.)
| goku12 wrote:
| Setting it up took hardly 5 minutes for me (for Rust). And it
| hasn't caused any issues so far. Forcing a fresh build of
| artifacts is also very easy.
|
| Meanwhile, the only legitimate problem you mentioned is if it
| causes a build error and we don't immediately consider it as
| the source of the issue. But I use the check command so often
| that it is easy to suspect the cache if check succeeds and
| build fails.
| IshKebab wrote:
| When these things go wrong it often doesn't cause a
| compilation error; it can just cause inexplicable runtime
| behaviour.
|
| You may be lucky and be building a pure or mostly pure Rust
| program, in which case it works pretty well. Throw in some
| C/C++ and it starts to degrade (though it's still better than
| with an actual C/C++ program because you aren't actually
| editing the C/C++ code generally).
|
| And are you actually using remote caching/compilation?
| Because there's no way you can set that up in 5 minutes.
| goku12 wrote:
| > And are you actually using remote caching/compilation?
| Because there's no way you can set that up in 5 minutes.
|
| Who said anything about remote caching? You don't need it
| to benefit from it. It's useful if you build a lot of Rust
| code. A lot of packages turn up repeatedly as dependencies
| among several projects.
|
| > You may be lucky and be building a pure or mostly pure
| Rust program, in which case it works pretty well. Throw in
| some C/C++ and it starts to degrade
|
| You're making assumptions again. What is the issue with
| pure Rust code? Rust isn't like Python needing C or C++
| support for process-intensive parts. And much of the C/C++
| dependencies are dynamically-linked, with Rust wrappers. I
| haven't seen many projects that require Rust and C/C++ code
| to be built together and statically linked.
|
| Besides, I haven't heard anyone complain about sccache that
| much. How prevalent is the degradation anyway?
| sgammon wrote:
| After using sccache and Gradle with Buildless for months,
| years, I have literally never seen these tools mixup or use
| the wrong binary objects.
|
| Knowing the internals of some of them, I've found that
| build cache clients are way more likely to miss with a
| cache key misalignment than they are to mixup two objects.
| I'm sure it's possible, I've just never seen it happen in
| the wild, after extensive usage.
|
| Generally speaking these tools are very conservative about
| two inputs matching: an identical file at a different path
| will cause a cache key change in sccache.
| saghm wrote:
| > Setting it up took hardly 5 minutes for me (for Rust)
|
| You're underselling it, honestly. For those who haven't
| looked into it, if you want to enable it for all Rust
| projects on your system, literally all you need is install
| the binary and then add this to ~/.cargo/config, and then it
| will be enabled whenever you invoke `cargo` (or even rustc
| directly iirc) from the user who's config file you modified)
| [build] rustc-wrapper = "/path/to/sccache"
|
| I'm sure there are people with legitimate reasons not to
| wanting it enabled implicitly or who have multiple users they
| might have to set this up for, but for 90% of people it won't
| take any more time than it took to read this comment.
| sgammon wrote:
| If you're interested in a drop in remote cache for sccache, check
| out Buildless
|
| We just released S3 and Redis support. https://less.build
|
| Buildless also supports Gradle, Maven, Bazel, CCache and Turbo
| sgammon wrote:
| Our beta is open, just shoot me an email at sam@less.build if
| you want to try it out!
| phamilton wrote:
| I sent an email and it was blocked: "Recipient address
| rejected: Access denied."
| xjia wrote:
| What is the benefit of using a remote cache instead of a local
| ~/.cache directory? Is it only for sharing build results among
| team members? How do you make sure the build results are not
| spoofed?
| sgammon wrote:
| Sharing with team members, sharing with CI, and the ability
| to pull from more than just what's on your machine (i.e. a
| larger addressable cache than you are willing to keep on
| disk). Cache objects also compound across projects, so it's
| nice to ship them up somewhere and have them nearby when you
| need them.
|
| Re/spoofing, obviously it's all protected with API keys and
| tokens, and we're working on mechanisms to perform end-to-end
| encryption. In general, build cache objects are usually
| addressed by a content-addressable-hash, so that also helps
| because your build typically knows the content it's looking
| for and can verify.
|
| That isn't true for all tools, though, so we're working to
| understand where the gaps are and fix them.
| sgammon wrote:
| (Fwiw, group conversation encryption tech like MLS is
| somewhat applicable, and that's the sort of pattern we're
| looking at, but it would be cool to know if that's moving
| to you on the problem of safety w.r.t. builds.)
| Thorrez wrote:
| >In general, build cache objects are usually addressed by a
| content-addressable-hash
|
| How does that work? I would think the simplest case of a
| build object that needs to be cached is a .o file created
| from a .c file. The compiler sees the .c file and can
| determine its hash, but how can the compiler determine the
| hash of the .o file to know what to look up in the cache? I
| think the compiler would need to perform the lookup using
| the hash of the .c file, which isn't a hash of the data in
| the cache.
| sgammon wrote:
| In Bazel's case and other cases, build cache objects are
| held in CAS and then referenced from other keys. I
| believe BuildXL from Microsoft also works this way.
|
| Of course one other advantage to build caches is they are
| verifiable: the intent is to produce the exact same
| output as a normal call, and that's easily checked on the
| client side.
|
| No question that build caching poses inherent supply
| chain risks though and that's part of what we want to
| solve. I think people are hesitant to trust build caching
| for good reason until there are safer mechanisms and
| better cryptographic patterns applied.
| krupan wrote:
| When a .o is stored in the cache it is associated with
| the hash of the .c file
| aseipp wrote:
| In the case of the Remote Execution/Cache API used by
| Bazel among others[1] at least, it's a bit more detailed.
| There's an "ActionCache" and an actual content-addressed
| cache that just stores blobs
| ("ContentAddressableStorage"). When you run a `gcc -O2
| foo.c -o foo.o` command (locally or remotely; doesn't
| matter), you upload an "Action" into the action cache,
| which basically said "This command was run. As a result
| it had this stderr, stdout, error code, and these input
| files read and output files written." The input and
| output files are referenced by the hash of their
| contents, in this case, and they get uploaded into the
| CAS system.
|
| Most importantly you can look up an action in the
| ActionCache without actually running it, provided you
| have the inputs at hand. So now when another person comes
| by and runs the same build command, they say "Has this
| Action, with these inputs, been run before?" and the
| server can say "Yes, and the output is a file identified
| by hash XYZ" where XYZ is the hash of foo.o, so you can
| just instantly download it from the CAS.
|
| So there are a few more moving parts to make it all work.
| But the system really is ultimately content-addressed,
| for the most part.
|
| [1] https://github.com/bazelbuild/remote-
| apis/blob/main/build/ba...
| sgammon wrote:
| Yep, aseipp, and we support the full gRPC interface for
| remote caching offered by Bazel, including the newer
| APIs.
|
| Explained better than I could for sure. I find it very
| interesting how BuildXL and Bazel ended up at similar
| models for this problem. I don't yet know the history of
| which informed which.
|
| (As compared to, say, Gradle, which works based on input
| hashes instead.)
| xjia wrote:
| IIUC the actual computation (e.g. compiling, linking, ...)
| happens on client (CI or developer) machines and the
| results are written to the server-side cache.
|
| By spoofing I meant to say that an authenticated but
| malicious client (intentionally or not, e.g. a clueless
| intern) may be able to write malicious contents to the
| cache. For example, their build toolchain could be
| contaminated and the resulting build outputs are
| contaminated. The "action" per se and its hash is still
| legit, but the hash is only used as the lookup key -- their
| corresponding value is "spoofed."
|
| The only safe way I can imagine to use such a remote cache
| is for CI to publish its build results so that they could
| be reused by developers. The direction from developers to
| developers or even to CI seems difficult to handle and has
| less value. But I might be missing some important insights
| here so my conclusion could be wrong.
|
| But if that's the case, is the most valuable use case to
| just configure the CI to read from / write to the remote
| cache, and developers to only read from the remote cache?
| And given such an assumption, is it much easier to
| design/implememt a remote cache product?
| sgammon wrote:
| All great points but in practice, tools like Bazel and
| sccache are incredibly conservative about hashes
| matching, to include file path on disk and even env var
| state.
|
| One goal of these tools is to guarantee that such
| misconfiguration results in a cache key mismatch, rather
| than a hit and a bug.
|
| There are tons of challenges designing a remote build
| cache product, like anything, but that one has turned out
| to be a reliable truth.
|
| Some other interesting insights:
|
| - transmitting large objects is often not profitable, so
| we found that setting reasonable caps on what's shared
| with the cache can be really effective for keeping
| transmissions small and hits fast
|
| - deferring uploads is important because you can't
| penalize individual devs for contributing to the cache,
| and not everybody has a fast upload link. making this
| part smooth is important so that everyone can benefit
| from every compile.
|
| - build caching is ancient, Make does its own simple form
| of build caching, but the protocols for it vary in
| robustness greatly, from WebDAV in ccache to Bazel's gRPC
| interface
|
| - most GitHub Actions builds occur in a small physical
| area, so accelerating build artifacts is an easier
| problem than, say, full blown CDN serving
|
| The assumptions that definitely help:
|
| - it's a cache, not a database; things can be missing, it
| doesn't need strong consistency
|
| - replication lag is okay because a build cache entry is
| typically not requested multiple times in a short window
| of time; the client that created it has it locally
|
| - it's much better to give a fast miss than a slow hit,
| since the compiler is quite fast
|
| - it's much better to give a fast miss than an error. You
| can NEVER break a build; at worst it should just not be
| accelerated.
|
| It's an interesting problem to work on for sure.
| aseipp wrote:
| Not just team members; if you make your cache publicly
| readable, contributors to e.g. your GitHub/GitLab/Whatever
| project can also use them and get really fast builds, the
| first time they try to contribute. So a remote cache is nice
| to have, if it's seamless.
|
| Nix works this way by default (and much of the community
| operates caches like this) and it can be a massive, massive
| time saver.
|
| > How do you make sure the build results are not spoofed?
|
| What do you mean "spoofed?" As in, someone put an evil
| artifact in the cache? Or overwrote an existing artifact with
| a new one? Or someone just stole your developers access and
| started shoving shit in there? There's a whole bunch of small
| details here that really matter to understand what
| security/integrity properties you want the cache to uphold.
|
| FWIW, I've been looking into this in Buck2/Bazel land, and my
| understanding is that most large orgs just use some kind of
| terminating auth proxy that the underlying
| connection/flow/build artifacts can be correlated back to. So
| you know this cache artifact was first inserted by build B,
| done by user X, who authenticated with their key K, etc etc.
| sgammon wrote:
| Exactly -- just like Git, everything is ultimately
| identified with a key which can tie back to a stable
| identity thru OIDC or similar mechanisms. At least that's
| how we did it.
| yjftsjthsd-h wrote:
| Nix only caches at the package level, doesn't it?
| sgammon wrote:
| Nix is different, yeah, and it won't wire together a
| build cache for you. Nix is great for many things of
| course, it's just not a replacement for sccache per se
|
| Nix + sccache would probably be pretty great for
| preserving paths and environment, which is really healthy
| for build caching in general.
| mgaunard wrote:
| I built my own build system that does something similar.
|
| I've set it up at work with two S3 buckets: trusted and
| untrusted. CI/CD read/write from trusted only. Developers
| read/write from untrusted, and read-only from trusted.
| sgammon wrote:
| We decided to back our main cache with in-memory storage
| for spicier performance. I'm curious how well S3 has worked
| for you here? Is it fast enough?
|
| Or, maybe the blobs you're dealing with are on the bigger
| end? That would also make sense
| mgaunard wrote:
| Each object file (.o) has a unique hash and is stored as
| thehash.o.
|
| It's certainly much faster to download the .o than it is
| to build it. Once it's downloaded it stays on the local
| filesystem until it's garbage-collected.
| sgammon wrote:
| Hm, interesting. Our free tier is planned to be this plus
| R2, so I'm happy to hear S3-style data exchange is
| working for people. Thanks for sharing
| mgaunard wrote:
| The whole point of S3 is that it is inexpensive. You
| don't want to pay premium money for terabytes of data
| that are usually invalidated everyone someone makes a
| significant change.
| throwawaaarrgh wrote:
| It's for sharing and aggregating. Ccache is useful locally,
| but really shines when combined with Distcc, a distributed
| compiler. Every host contributes a cache object that other
| hosts can use, and every host can use the cache object
| contributed by other hosts. So you don't even have to built
| it once yourself to benefit from the cache of everyone else.
| It therefore speeds up multiple hosts/users builds,
| distributed builds and the dev experience of individuals.
| __float wrote:
| Is this only a remote _cache_ for Bazel, but it does not
| support the remote execution API at all? It 's a little
| worrisome to trust all user outputs when you do not also
| control the execution of them. (In the "best" case this could
| mean caching non-reproducible ("works on my machine") build
| results, in the worst case this could be actively dangerous if
| a malicious user poisons the build cache.)
| sgammon wrote:
| It's only a remote cache and that's deliberate. We see it as
| much safer to only offer a cache that the user can control
| and use however they want
|
| We would see taking over execution of your build as much more
| dangerous.
|
| No question though that build caching in shared form, in SaaS
| form, needs extra special attention paid to security. Our
| product doesn't introspect cache blobs and in fact doesn't
| really want to. Once we figure out how to make the crypto
| work, we shouldn't be able to see any of that data at all.
|
| Access can be made public for reads (OSS) but is always
| identified for writes.
| sgammon wrote:
| (Also, speaking as a Bazel user now, the Remote Execution
| APIs have always been a bit brittle and hard to setup, use,
| and maintain; certainly harder than just setting a cache
| endpoint.
|
| I've found that remote execution ends up returning much less
| benefit than remote caching, but that's just me and it's
| entirely possible I Did It Wrong the whole time)
| sgammon wrote:
| Yes!! Glad people are thinking about this. We just added Cache
| Projects which will be launching soon, it should allow this
| style of public cache sharing.
|
| The intent with Buildless is to release a free-first toolchain
| that helps with build caching in earnest and makes the whole
| problem much less error prone. Then the Cloud stuff on top is
| for groups who need more gas. Cloudflare is generously
| supporting our upcoming free tier.
| satvikpendem wrote:
| I use this with Rust, works great. Simply add a
| ~/.cargo/config.toml with [build]
| rustc-wrapper = "/path/to/sccache"
|
| And it will work everywhere with cargo. I also like to combine it
| with the mold linker.
| throwup238 wrote:
| You can also set the RUSTC_WRAPPER environment variable to make
| it system wide.
| pie_flavor wrote:
| The parent comment is also system wide.
| sodality2 wrote:
| It should be system wide already because it's in .home/cargo
| heads wrote:
| We used this for a while at Speechmatics but our LM researchers
| have a well established workflow based on git working copies on
| NFS /home and we had a lot of instability between sccache and
| NFS.
|
| Is sccache susceptible to cache misses when using full paths as
| cache keys? It would be very helpful if compiling
| "/home/heads/project/foo.c" could use the cached result of
| compiling "/home/thunderbong/project/foo.c".
| phlip9 wrote:
| sccache only caches if builds are run from the same absolute
| path, so indeed different home dirs won't work
| tentacleuno wrote:
| As a bystander, what would the reasoning be for doing this? I
| would have assumed that they'd hash each file and use that as
| a key in a lookup table.
| sgammon wrote:
| In some languages, symbols are provided which evaluate to a
| file's path or directory parent, so program behavior can
| vary even for the same content hash. That's just one way
| paths can bleed in to violate hermeticity/correctness.
| MereInterest wrote:
| It sounds like somebody assuming a docker build, where
| everybody's build will use the same file path. It's still a
| very silly restriction, because not everything occurs
| within docker.
| sgammon wrote:
| It's an unfortunate safety tradeoff to guarantee
| consistency. Better visibility into program behavior
| could fix it.
| slavik81 wrote:
| With ccache that would be a cache miss by default. However,
| that could be made a cache hit by configuring the
| CCACHE_BASEDIR option. There doesn't seem to be an exact
| equivalent in sccache.
| https://github.com/mozilla/sccache/issues/35
| goodpoint wrote:
| Pity they are not using GPL or at least MPL.
| szundi wrote:
| My first thought is how to prevent bad guys injecting rootkit
| binaries into these systems.
| throwawaaarrgh wrote:
| It would be nice if every application in the world didn't have to
| hack on support for X number of different almost-identical
| storage vendors who all decided it would be better to have
| completely incompatible interfaces.
|
| NFS isn't horrible, though it's limited. OTOH object storage is
| limited but has some other advantages. So there needs to either
| be the latter as a standard that can be adopted by all vendors,
| apps and OSes, or a new standard that fills the gaps of what apps
| want.
| sgammon wrote:
| sccache uses OpenDAL... https://opendal.apache.org/
| tedunangst wrote:
| Next step: store the object files in the blockchain to make a
| global cache so everyone can compile the browser at the speed of,
| uh...
| sgammon wrote:
| I know you're joking, but Unison built this without the
| blockchain cruft on top. Very cool project.
|
| When any unique piece of Unison is compiled by anyone, it no
| longer needs to be compiled by everyone.
|
| https://www.unison-lang.org/
___________________________________________________________________
(page generated 2023-12-24 23:00 UTC)