[HN Gopher] Towards Idempotent Rebuilds?
___________________________________________________________________
Towards Idempotent Rebuilds?
Author : JNRowe
Score : 49 points
Date : 2024-07-10 13:01 UTC (9 hours ago)
(HTM) web link (blog.josefsson.org)
(TXT) w3m dump (blog.josefsson.org)
| outsomnia wrote:
| I was expecting to see the word "yocto", did I miss it?
|
| This goes quite far along the path, building all the build tools
| and toolchain to the same version before building the packages.
| ethegwo wrote:
| As I know, LLVM optimizer (maybe also GCC) is not idempotent,
| does it mean we need to pay performance for idempotent buildings?
| bpfrh wrote:
| Shouldn't same system+same version+same flags produce the same
| result?
| jiehong wrote:
| The algorithms used might not produce a deterministic result
| (like order is different), and maybe it also depends on
| timings of the CPU with threads scheduled on different cores,
| or faster/slower cores, etc.
|
| Deterministic builds aren't always that straightforward
| 0x457 wrote:
| I'm using nix and clang/llvm - never have I ever got a
| different binary.
|
| If what you're saying is true - stage 3 bootstrap wouldn't
| be possible.
| malkia wrote:
| bit obvious, but all it takes something that takes
| something that changes everytime to make it non-
| reproducible - like __DATE__ and __TIME__
| zaphar wrote:
| Nix solves this by forcing __DATE__ and __TIME__ to be a
| single immutable value in the sandbox for all builds. If
| you look in the store the file timestamps will be this
| value.
| plorkyeran wrote:
| All C and C++ compilers in common use are single-threaded,
| and parallelism is achieved by compiling multiple files at
| once. An optimizer pass producing nondeterministic results
| due to a bug is not impossible, but would be a bug to fix.
| Multithreaded linkers which are expected to have
| nondeterministic results do exist, and you have to not use
| that for reproduce able builds.
|
| There's a long tail of tricky problems to solve for
| reproducible builds, but a specific build of a compiler
| producing different executable code from run to run hasn't
| really been one of them.
| bobmcnamara wrote:
| Now I'm wondering how differently my multithreaded tcc
| fork produces code. I don't think it should but it always
| wouldn't surprise me if there was some shared state
| remaining.
| kardos wrote:
| Forcing the ordering to be deterministic is a well-
| established part of the process
|
| https://reproducible-builds.org/docs/stable-inputs/
|
| https://reproducible-builds.org/docs/stable-outputs/
| patmorgan23 wrote:
| Compilation performance or end executable performance?
| ethegwo wrote:
| The latter I mean, conservative optimization level is benefit
| to keep idempotency
| vzaliva wrote:
| Coming from maths, I am confused by use of the term "idempotent"
| here. Unless we are talking about bootstrapping a compiler and I
| do not see how it applicable here. Am I missing something?
| algernonramone wrote:
| I feel that "deterministic" is probably a better word here than
| "idempotent".
| craigching wrote:
| Maybe even hermetic builds?
| https://bazel.build/basics/hermeticity
| appplication wrote:
| Not speaking to your comment specifically, but adding context
| to the thread: idempotence would mean having the same result,
| regardless of whether you run something once, twice, or 10
| times, over the same input set. Idempotence requires but goes
| beyond determinism, as it also accounts for any externalities
| and side effects.
|
| For example, let's consider a function that accepts a string
| as an argument and then writes that string to disk. We can
| consider the disk state as a side effect of the function.
|
| The function itself is perfectly deterministic (output string
| is a predictable and consistent function of input string),
| but depending on the implementation of side effects it may
| not be idempotent. If, for example, this function room simply
| added the output to a file "output.txt", this file would grow
| with every incantation, which is not idempotent. If instead
| we overwrote the output file so that it reflects only the
| singular output of the previous run, then the side effects
| would also be deterministic, that would be idempotent.
|
| At a pedantic level you could redefine your scope of
| deterministic to not just include outputs, but also include
| the external state and side effects, but for practical
| purposes the above distinction is generally how deterministic
| and idempotent would be applied in practice in computing. I
| cannot speak to the math-centric view, if there is a
| different definition there.
| eikenberry wrote:
| I don't think the string to disk example can be idempotent
| no matter how you implement it as it is dependent on the
| external state of the disk.
| appplication wrote:
| That may be true on a theoretical level, but if you're
| talking practically to a data engineer that's the
| definition of idempotence you're going to find they
| expect.
| necovek wrote:
| Practical consideration might be that a disk may
| experience a fault in one of the executions: works fine a
| hundred times but fails on 101st (eg hits disk-full error
| or a bad block).
|
| But that simply means it's harder to implement true
| idempotency when it comes to disk usage.
|
| This is why the problem is usually simplified to ignore
| unhappy paths.
| dandragona wrote:
| This captures the mathematical definition too which is just
| that an element x is idempotent if x applied to itself
| gives you back x. I.e, what you said that the function
| applied many times produces no change to the system.
| IshKebab wrote:
| To give a concrete example: some build systems embed a "build
| number" in the output that increments for each non-clean
| build (yeah this is stupid but I have seen it).
|
| This is deterministic (doesn't change randomly), but not
| idempotent.
| benreesman wrote:
| A lot of things in computer science that are conceptually
| functions actually aren't, in this context when one compiles
| e.g. a C program there is a bunch of state, some pretty
| counter-intuitive, that the resulting object files or
| executable artifacts to change even though the source code did
| not. This has serious implication for relying on computer
| systems.
|
| When a more systems-inclined person in computer science uses
| the word idempotent they very often mean something like
| "repeatable" or "deterministic".
|
| AFIAK this usage first gained traction in the theory of
| distributed systems, in which it (roughly) means that one
| application of an operation might change the state of the
| system, but subsequent applications will not change it further.
| Set union is sort of the canonical example.
| jmull wrote:
| They are definitely redefining "idempotent" here, even from a
| software development standpoint.
|
| Since we already have "deterministic build" and "reproducible
| build" for this, there's no need to overload "idempotent"
| (which already causes confusion).
| kreetx wrote:
| Idempotent sounds confusing to me as an idempotent HTTP
| request (not very mathy, I know) I could run many times over,
| the effect being as if it were done once on the first time.
|
| For a software build it sounds like running `make` multiple
| times. But those builds will be idempotent even without
| reproducibility/determinism as they happen on the same
| system.
| NortySpock wrote:
| But isn't that just deterministic behavior?
|
| "Multiple runs of the same compiler on the same code
| produced the same output"
|
| No more revolutionary a concept than a hash function.
| Valuable like a hash function, but not revolutionary.
| jmull wrote:
| Well, a second call to "make" with unmodified source on a
| local system has the same resulting output because make
| knows how to avoid rebuilding output files when the
| source hasn't changed since the last build.
|
| The goal of reproducible/deterministic builds is
| typically to be able to produce the same output files
| _without_ already having that output. (That 's because
| the people interested in reproducible/deterministic
| builds are usually trying to prove something about the
| relationship between the source and the build output.
| Make doesn't really do that -- the only thing it can
| prove is that the mod times of output files are later
| than the mod times of the source files they depend on,
| according to the make file.)
|
| So, yes, make is typically deterministic in a general
| sense (given a good makefile and certain assumptions that
| are pretty reasonable in the context of a developer doing
| development on a local machine). But isn't what people
| are looking for from reproducible/deterministic builds.
| Animats wrote:
| Agree. "Reproducible" is the right term here.
|
| An idempotent build is when you type "make" twice and the
| second one doesn't do anything.
|
| This stuff matters. You can't trust open source supply chains
| any more. There have been too many incidents of someone
| inserting a security hole.
| kaivi wrote:
| We already have isomorphic web apps, some day we might see a
| bundler advertised as "Calabi-Yau with benefits of mirror
| symmetry and vanishing first Chern class".
| lanstin wrote:
| A second countable language basis.
| no-dr-onboard wrote:
| I maintain the reproducible builds effort for my company and,
| please, let me tell you that this is the main pitfall of the
| whole effort.
|
| There is always going to be a degree of un-reproducibility just
| due to the nature of math. If you don't have the same system,
| same compiler version (down to the minor or patch level), same
| dependency versions, same build flags, filesystem ordering, OS
| handling etc. . .you're going to get differences.
|
| The RB project has readily disclosed that there is a degree of
| "significantly reproducible" sussing that each end user is going
| to have to do. The fact that the Debian maintainers chose not to
| display the degree of reproducibility is probably because showing
| low reproducibility scores undermines the efforts to evangelize
| the movement.
|
| I think that's understandable, but also is a bit of a two edged
| sword. If we don't disclose scores, we allow for the
| misrepresentation that "this is safe because it has the word
| reproducible in it". If we disclose scores, we get articles like
| this saying "wow, thats a really low score, wtf" and short lived
| paranoia gives way to ambivalence about the whole thing.
|
| It's difficult to capture the nuance in this in pithy tidbits,
| hence blog post on HN with me explaining this :).
| BonusPlay wrote:
| From my experience, you either go full reproducible builds with
| nix, or none at all.
|
| Sitting in the middle results with additional downsides from
| modifying pipeline without core upsides of reproducible builds.
| kardos wrote:
| The reproducible builds effort means same input (code,
| libraries, compiler, flags, build-environment), should give the
| same output (build artifacts). I don't think RB is trying to
| get reproducible across platforms and across varying
| compilers/dependencies -- that is a much bigger and much harder
| goal.
|
| > There is always going to be a degree of un-reproducibility
| just due to the nature of math.
|
| Fundamentally the math is deterministic (reproducible): if you
| do the same sequence of mathematical operations you get the
| same result. I gather you're getting at the non-associativity
| of floating point (eg additions), which is a fair point, but if
| you arrange to do your floating point operations in the same
| sequence then it will be reproducible.
| bobmcnamara wrote:
| For these sorts of things, please do not assume hardware is
| reliable and compilers are bug free.
| kardos wrote:
| Sounds like a mechanism to test for bad hardware (akin to
| memtest86) and find compiler bugs =)
| bobmcnamara wrote:
| Bingo - even if we do everything right the percentage of
| matching builds isn't expected to be 100%. If it is,
| there just aren't enough samples.
|
| My first Pentium could usually but not always math
| correctly until Intel replaced it during a recall.
| 8organicbits wrote:
| Bug free is less important for reproducible builds, so long
| as the bugs are reproducible. Compiler bugs are of course
| still a problem for other reasons.
| bobmcnamara wrote:
| My point is only that even when the software is correct,
| the reproducible build rate will either be 100% or we've
| got a representative sample set.
|
| Until then, those issues will continue to be lumped
| together with all the software reasons it wasn't
| reproducible.
| orbital-decay wrote:
| Multithreaded code can easily be non-deterministic, with
| things like OS preemption leaking into the sandbox.
| Guaranteed determinism is hard, it needs a deterministic
| emulator.
| Quekid5 wrote:
| Not really for building code, at least. It's a common
| _problem_ , but it's not insurmountable. Use the equivalent
| of 'make -j1' or whatever...
|
| This requires engineering on the part of compiler writers,
| but ultimately is solvable, given enough funding.
|
| (About the compiler writers part: I'm thinking about GHC
| Haskell and the like... the Clangs/GCCs of the world will
| have no problems because of Translation Units. Things may
| change when Modules are a practical choice.)
| alganet wrote:
| > (...) rebuilding packages using a bootsrappable build process,
| both seem orthogonal to the idempotent rebuild problem
|
| You know what would be awesome? If someone could start from,
| let's say, live-bootstrap[1] and build towards matching the
| checksums for some distro kernel+toolchain.
|
| It sounds like the same kind of problem, it all comes down to
| knowing what build conditions affect the resulting binaries, so I
| think you nailed the problem description on this and yes, it all
| feels very orthogonal from that perspective!
|
| Thanks for writing this blog entry!
|
| [1]: http://github.com/fosslinux/live-bootstrap
| simpaticoder wrote:
| Idempotent, deterministic builds are an argument in favor of
| synthetic virtual machines. Synthetic VMs like the JVM or CLR
| should, at least insofar as they don't contain native code,
| execute in a manner largely isolated from the vagaries of minor
| hardware/OS differences. Not an expert, but native VMs do not and
| cannot isolate processes from hardware details (e.g. Xen, Virtual
| Box), or from OS details (e.g. Docker, containerd).
| saagarjha wrote:
| This is most true of software VMs too, which leak information
| via side channels.
| nixosbestos wrote:
| Men will do anything except go to therapy ^H^H^H^H^H^H^H^H^H^H
| learn Nix.
| gaganyaan wrote:
| It's odd that the article didn't mention nix, at least to say
| "We didn't like it for this reason".
| hansvm wrote:
| Nix's definition of "reproducible" is different from what the
| author says they want though. Nix gives you a similar system
| from a given config, when it works, where the author wants
| bitwise identical results (even when compiled at a much later
| point in time and on a different machine, which you can't
| accomplish just from caching (unless the caches are remotely
| hosted)).
|
| The reason they don't like Nix should be obvious; it
| blatantly and vocally doesn't even try to do the thing the
| authors want. It'd be just as relevant to bring up Gentoo or
| something -- another fine project, like Nix, but not
| especially helpful for the stated problem, and so obviously
| so that it shouldn't be worth paying lip service to in the
| article.
| narrowbyte wrote:
| "Doesn't even try" is too strong.
|
| "When compiling from the same source on independent
| infrastructure yields bit-by-bit identical results, this
| gives confidence that the build infrastructure was not
| compromised and the artifact really does correspond to the
| source." - https://reproducible.nixos.org/
| nixosbestos wrote:
| >Nix gives you a similar system from a given config, when
| it works
|
| Not going to lie, I either don't understand what you're
| saying, or don't understand the shade you're trying to
| throw.
|
| Nix is going to do as good, or better a job, than any other
| solution other there. Probably with a lot less duck-tape.
|
| Not to mention that much of Nix packages are reliably bit-
| for-bit reproducible.
|
| Not to not to mention that bit-for-bit reproducability is
| really only beneficial for "trust", not reliability or re-
| playability.
|
| And no, comparing Nix to Gentoo in this case only very much
| confirms you don't understand Nix. Maybe you should check
| out the talk from a past NixCon where the speaker rebuilds
| Firefox from 10 years ago and boots up Flash swfs.
| abathur wrote:
| [delayed]
| xiande04 wrote:
| Nix?
| spyspy wrote:
| I spent some time trying to prove two go binaries were the same
| in the name of reproducible builds but couldn't figure out if it
| was possible, even though I had built both myself and knew they
| were in effect the exact same. Go binaries have some sort of
| randomness (timestamp? Map entry? No idea) that I couldn't pin
| down. Sometimes the hash of the binaries were the same and
| sometimes they weren't. Short of cataloguing and hashing every
| file that went into the build I couldn't figure it out and gave
| up.
| aleksi wrote:
| It is possible to make Go builds reproducible. The first step
| might be to inspect two binaries with the diffoscope
| (https://diffoscope.org).
| compiler-guy wrote:
| For whatever it is worth, Google internal builds using the
| internal version of bazel are deterministic and reproducible. And
| google spends a lot of time and effort keeping them that way. You
| do have to ensure that nothing ever sorts based on pointer value,
| for example.
|
| Clang works fine as a compiler for this--there is nothing in it
| that normally produces different results due to timing or
| whatever. When something does leak in, we fix it upstream. You do
| have to ensure that no one uses __DATE__ or similar macros, or
| that you redefine them to a known value on the command line.
___________________________________________________________________
(page generated 2024-07-10 23:01 UTC)