[HN Gopher] Towards Idempotent Rebuilds?
       ___________________________________________________________________
        
       Towards Idempotent Rebuilds?
        
       Author : JNRowe
       Score  : 49 points
       Date   : 2024-07-10 13:01 UTC (9 hours ago)
        
 (HTM) web link (blog.josefsson.org)
 (TXT) w3m dump (blog.josefsson.org)
        
       | outsomnia wrote:
       | I was expecting to see the word "yocto", did I miss it?
       | 
       | This goes quite far along the path, building all the build tools
       | and toolchain to the same version before building the packages.
        
       | ethegwo wrote:
       | As I know, LLVM optimizer (maybe also GCC) is not idempotent,
       | does it mean we need to pay performance for idempotent buildings?
        
         | bpfrh wrote:
         | Shouldn't same system+same version+same flags produce the same
         | result?
        
           | jiehong wrote:
           | The algorithms used might not produce a deterministic result
           | (like order is different), and maybe it also depends on
           | timings of the CPU with threads scheduled on different cores,
           | or faster/slower cores, etc.
           | 
           | Deterministic builds aren't always that straightforward
        
             | 0x457 wrote:
             | I'm using nix and clang/llvm - never have I ever got a
             | different binary.
             | 
             | If what you're saying is true - stage 3 bootstrap wouldn't
             | be possible.
        
               | malkia wrote:
               | bit obvious, but all it takes something that takes
               | something that changes everytime to make it non-
               | reproducible - like __DATE__ and __TIME__
        
               | zaphar wrote:
               | Nix solves this by forcing __DATE__ and __TIME__ to be a
               | single immutable value in the sandbox for all builds. If
               | you look in the store the file timestamps will be this
               | value.
        
             | plorkyeran wrote:
             | All C and C++ compilers in common use are single-threaded,
             | and parallelism is achieved by compiling multiple files at
             | once. An optimizer pass producing nondeterministic results
             | due to a bug is not impossible, but would be a bug to fix.
             | Multithreaded linkers which are expected to have
             | nondeterministic results do exist, and you have to not use
             | that for reproduce able builds.
             | 
             | There's a long tail of tricky problems to solve for
             | reproducible builds, but a specific build of a compiler
             | producing different executable code from run to run hasn't
             | really been one of them.
        
               | bobmcnamara wrote:
               | Now I'm wondering how differently my multithreaded tcc
               | fork produces code. I don't think it should but it always
               | wouldn't surprise me if there was some shared state
               | remaining.
        
             | kardos wrote:
             | Forcing the ordering to be deterministic is a well-
             | established part of the process
             | 
             | https://reproducible-builds.org/docs/stable-inputs/
             | 
             | https://reproducible-builds.org/docs/stable-outputs/
        
         | patmorgan23 wrote:
         | Compilation performance or end executable performance?
        
           | ethegwo wrote:
           | The latter I mean, conservative optimization level is benefit
           | to keep idempotency
        
       | vzaliva wrote:
       | Coming from maths, I am confused by use of the term "idempotent"
       | here. Unless we are talking about bootstrapping a compiler and I
       | do not see how it applicable here. Am I missing something?
        
         | algernonramone wrote:
         | I feel that "deterministic" is probably a better word here than
         | "idempotent".
        
           | craigching wrote:
           | Maybe even hermetic builds?
           | https://bazel.build/basics/hermeticity
        
           | appplication wrote:
           | Not speaking to your comment specifically, but adding context
           | to the thread: idempotence would mean having the same result,
           | regardless of whether you run something once, twice, or 10
           | times, over the same input set. Idempotence requires but goes
           | beyond determinism, as it also accounts for any externalities
           | and side effects.
           | 
           | For example, let's consider a function that accepts a string
           | as an argument and then writes that string to disk. We can
           | consider the disk state as a side effect of the function.
           | 
           | The function itself is perfectly deterministic (output string
           | is a predictable and consistent function of input string),
           | but depending on the implementation of side effects it may
           | not be idempotent. If, for example, this function room simply
           | added the output to a file "output.txt", this file would grow
           | with every incantation, which is not idempotent. If instead
           | we overwrote the output file so that it reflects only the
           | singular output of the previous run, then the side effects
           | would also be deterministic, that would be idempotent.
           | 
           | At a pedantic level you could redefine your scope of
           | deterministic to not just include outputs, but also include
           | the external state and side effects, but for practical
           | purposes the above distinction is generally how deterministic
           | and idempotent would be applied in practice in computing. I
           | cannot speak to the math-centric view, if there is a
           | different definition there.
        
             | eikenberry wrote:
             | I don't think the string to disk example can be idempotent
             | no matter how you implement it as it is dependent on the
             | external state of the disk.
        
               | appplication wrote:
               | That may be true on a theoretical level, but if you're
               | talking practically to a data engineer that's the
               | definition of idempotence you're going to find they
               | expect.
        
               | necovek wrote:
               | Practical consideration might be that a disk may
               | experience a fault in one of the executions: works fine a
               | hundred times but fails on 101st (eg hits disk-full error
               | or a bad block).
               | 
               | But that simply means it's harder to implement true
               | idempotency when it comes to disk usage.
               | 
               | This is why the problem is usually simplified to ignore
               | unhappy paths.
        
             | dandragona wrote:
             | This captures the mathematical definition too which is just
             | that an element x is idempotent if x applied to itself
             | gives you back x. I.e, what you said that the function
             | applied many times produces no change to the system.
        
           | IshKebab wrote:
           | To give a concrete example: some build systems embed a "build
           | number" in the output that increments for each non-clean
           | build (yeah this is stupid but I have seen it).
           | 
           | This is deterministic (doesn't change randomly), but not
           | idempotent.
        
         | benreesman wrote:
         | A lot of things in computer science that are conceptually
         | functions actually aren't, in this context when one compiles
         | e.g. a C program there is a bunch of state, some pretty
         | counter-intuitive, that the resulting object files or
         | executable artifacts to change even though the source code did
         | not. This has serious implication for relying on computer
         | systems.
         | 
         | When a more systems-inclined person in computer science uses
         | the word idempotent they very often mean something like
         | "repeatable" or "deterministic".
         | 
         | AFIAK this usage first gained traction in the theory of
         | distributed systems, in which it (roughly) means that one
         | application of an operation might change the state of the
         | system, but subsequent applications will not change it further.
         | Set union is sort of the canonical example.
        
         | jmull wrote:
         | They are definitely redefining "idempotent" here, even from a
         | software development standpoint.
         | 
         | Since we already have "deterministic build" and "reproducible
         | build" for this, there's no need to overload "idempotent"
         | (which already causes confusion).
        
           | kreetx wrote:
           | Idempotent sounds confusing to me as an idempotent HTTP
           | request (not very mathy, I know) I could run many times over,
           | the effect being as if it were done once on the first time.
           | 
           | For a software build it sounds like running `make` multiple
           | times. But those builds will be idempotent even without
           | reproducibility/determinism as they happen on the same
           | system.
        
             | NortySpock wrote:
             | But isn't that just deterministic behavior?
             | 
             | "Multiple runs of the same compiler on the same code
             | produced the same output"
             | 
             | No more revolutionary a concept than a hash function.
             | Valuable like a hash function, but not revolutionary.
        
               | jmull wrote:
               | Well, a second call to "make" with unmodified source on a
               | local system has the same resulting output because make
               | knows how to avoid rebuilding output files when the
               | source hasn't changed since the last build.
               | 
               | The goal of reproducible/deterministic builds is
               | typically to be able to produce the same output files
               | _without_ already having that output. (That 's because
               | the people interested in reproducible/deterministic
               | builds are usually trying to prove something about the
               | relationship between the source and the build output.
               | Make doesn't really do that -- the only thing it can
               | prove is that the mod times of output files are later
               | than the mod times of the source files they depend on,
               | according to the make file.)
               | 
               | So, yes, make is typically deterministic in a general
               | sense (given a good makefile and certain assumptions that
               | are pretty reasonable in the context of a developer doing
               | development on a local machine). But isn't what people
               | are looking for from reproducible/deterministic builds.
        
           | Animats wrote:
           | Agree. "Reproducible" is the right term here.
           | 
           | An idempotent build is when you type "make" twice and the
           | second one doesn't do anything.
           | 
           | This stuff matters. You can't trust open source supply chains
           | any more. There have been too many incidents of someone
           | inserting a security hole.
        
         | kaivi wrote:
         | We already have isomorphic web apps, some day we might see a
         | bundler advertised as "Calabi-Yau with benefits of mirror
         | symmetry and vanishing first Chern class".
        
           | lanstin wrote:
           | A second countable language basis.
        
       | no-dr-onboard wrote:
       | I maintain the reproducible builds effort for my company and,
       | please, let me tell you that this is the main pitfall of the
       | whole effort.
       | 
       | There is always going to be a degree of un-reproducibility just
       | due to the nature of math. If you don't have the same system,
       | same compiler version (down to the minor or patch level), same
       | dependency versions, same build flags, filesystem ordering, OS
       | handling etc. . .you're going to get differences.
       | 
       | The RB project has readily disclosed that there is a degree of
       | "significantly reproducible" sussing that each end user is going
       | to have to do. The fact that the Debian maintainers chose not to
       | display the degree of reproducibility is probably because showing
       | low reproducibility scores undermines the efforts to evangelize
       | the movement.
       | 
       | I think that's understandable, but also is a bit of a two edged
       | sword. If we don't disclose scores, we allow for the
       | misrepresentation that "this is safe because it has the word
       | reproducible in it". If we disclose scores, we get articles like
       | this saying "wow, thats a really low score, wtf" and short lived
       | paranoia gives way to ambivalence about the whole thing.
       | 
       | It's difficult to capture the nuance in this in pithy tidbits,
       | hence blog post on HN with me explaining this :).
        
         | BonusPlay wrote:
         | From my experience, you either go full reproducible builds with
         | nix, or none at all.
         | 
         | Sitting in the middle results with additional downsides from
         | modifying pipeline without core upsides of reproducible builds.
        
         | kardos wrote:
         | The reproducible builds effort means same input (code,
         | libraries, compiler, flags, build-environment), should give the
         | same output (build artifacts). I don't think RB is trying to
         | get reproducible across platforms and across varying
         | compilers/dependencies -- that is a much bigger and much harder
         | goal.
         | 
         | > There is always going to be a degree of un-reproducibility
         | just due to the nature of math.
         | 
         | Fundamentally the math is deterministic (reproducible): if you
         | do the same sequence of mathematical operations you get the
         | same result. I gather you're getting at the non-associativity
         | of floating point (eg additions), which is a fair point, but if
         | you arrange to do your floating point operations in the same
         | sequence then it will be reproducible.
        
           | bobmcnamara wrote:
           | For these sorts of things, please do not assume hardware is
           | reliable and compilers are bug free.
        
             | kardos wrote:
             | Sounds like a mechanism to test for bad hardware (akin to
             | memtest86) and find compiler bugs =)
        
               | bobmcnamara wrote:
               | Bingo - even if we do everything right the percentage of
               | matching builds isn't expected to be 100%. If it is,
               | there just aren't enough samples.
               | 
               | My first Pentium could usually but not always math
               | correctly until Intel replaced it during a recall.
        
             | 8organicbits wrote:
             | Bug free is less important for reproducible builds, so long
             | as the bugs are reproducible. Compiler bugs are of course
             | still a problem for other reasons.
        
               | bobmcnamara wrote:
               | My point is only that even when the software is correct,
               | the reproducible build rate will either be 100% or we've
               | got a representative sample set.
               | 
               | Until then, those issues will continue to be lumped
               | together with all the software reasons it wasn't
               | reproducible.
        
           | orbital-decay wrote:
           | Multithreaded code can easily be non-deterministic, with
           | things like OS preemption leaking into the sandbox.
           | Guaranteed determinism is hard, it needs a deterministic
           | emulator.
        
             | Quekid5 wrote:
             | Not really for building code, at least. It's a common
             | _problem_ , but it's not insurmountable. Use the equivalent
             | of 'make -j1' or whatever...
             | 
             | This requires engineering on the part of compiler writers,
             | but ultimately is solvable, given enough funding.
             | 
             | (About the compiler writers part: I'm thinking about GHC
             | Haskell and the like... the Clangs/GCCs of the world will
             | have no problems because of Translation Units. Things may
             | change when Modules are a practical choice.)
        
       | alganet wrote:
       | > (...) rebuilding packages using a bootsrappable build process,
       | both seem orthogonal to the idempotent rebuild problem
       | 
       | You know what would be awesome? If someone could start from,
       | let's say, live-bootstrap[1] and build towards matching the
       | checksums for some distro kernel+toolchain.
       | 
       | It sounds like the same kind of problem, it all comes down to
       | knowing what build conditions affect the resulting binaries, so I
       | think you nailed the problem description on this and yes, it all
       | feels very orthogonal from that perspective!
       | 
       | Thanks for writing this blog entry!
       | 
       | [1]: http://github.com/fosslinux/live-bootstrap
        
       | simpaticoder wrote:
       | Idempotent, deterministic builds are an argument in favor of
       | synthetic virtual machines. Synthetic VMs like the JVM or CLR
       | should, at least insofar as they don't contain native code,
       | execute in a manner largely isolated from the vagaries of minor
       | hardware/OS differences. Not an expert, but native VMs do not and
       | cannot isolate processes from hardware details (e.g. Xen, Virtual
       | Box), or from OS details (e.g. Docker, containerd).
        
         | saagarjha wrote:
         | This is most true of software VMs too, which leak information
         | via side channels.
        
       | nixosbestos wrote:
       | Men will do anything except go to therapy ^H^H^H^H^H^H^H^H^H^H
       | learn Nix.
        
         | gaganyaan wrote:
         | It's odd that the article didn't mention nix, at least to say
         | "We didn't like it for this reason".
        
           | hansvm wrote:
           | Nix's definition of "reproducible" is different from what the
           | author says they want though. Nix gives you a similar system
           | from a given config, when it works, where the author wants
           | bitwise identical results (even when compiled at a much later
           | point in time and on a different machine, which you can't
           | accomplish just from caching (unless the caches are remotely
           | hosted)).
           | 
           | The reason they don't like Nix should be obvious; it
           | blatantly and vocally doesn't even try to do the thing the
           | authors want. It'd be just as relevant to bring up Gentoo or
           | something -- another fine project, like Nix, but not
           | especially helpful for the stated problem, and so obviously
           | so that it shouldn't be worth paying lip service to in the
           | article.
        
             | narrowbyte wrote:
             | "Doesn't even try" is too strong.
             | 
             | "When compiling from the same source on independent
             | infrastructure yields bit-by-bit identical results, this
             | gives confidence that the build infrastructure was not
             | compromised and the artifact really does correspond to the
             | source." - https://reproducible.nixos.org/
        
             | nixosbestos wrote:
             | >Nix gives you a similar system from a given config, when
             | it works
             | 
             | Not going to lie, I either don't understand what you're
             | saying, or don't understand the shade you're trying to
             | throw.
             | 
             | Nix is going to do as good, or better a job, than any other
             | solution other there. Probably with a lot less duck-tape.
             | 
             | Not to mention that much of Nix packages are reliably bit-
             | for-bit reproducible.
             | 
             | Not to not to mention that bit-for-bit reproducability is
             | really only beneficial for "trust", not reliability or re-
             | playability.
             | 
             | And no, comparing Nix to Gentoo in this case only very much
             | confirms you don't understand Nix. Maybe you should check
             | out the talk from a past NixCon where the speaker rebuilds
             | Firefox from 10 years ago and boots up Flash swfs.
        
             | abathur wrote:
             | [delayed]
        
       | xiande04 wrote:
       | Nix?
        
       | spyspy wrote:
       | I spent some time trying to prove two go binaries were the same
       | in the name of reproducible builds but couldn't figure out if it
       | was possible, even though I had built both myself and knew they
       | were in effect the exact same. Go binaries have some sort of
       | randomness (timestamp? Map entry? No idea) that I couldn't pin
       | down. Sometimes the hash of the binaries were the same and
       | sometimes they weren't. Short of cataloguing and hashing every
       | file that went into the build I couldn't figure it out and gave
       | up.
        
         | aleksi wrote:
         | It is possible to make Go builds reproducible. The first step
         | might be to inspect two binaries with the diffoscope
         | (https://diffoscope.org).
        
       | compiler-guy wrote:
       | For whatever it is worth, Google internal builds using the
       | internal version of bazel are deterministic and reproducible. And
       | google spends a lot of time and effort keeping them that way. You
       | do have to ensure that nothing ever sorts based on pointer value,
       | for example.
       | 
       | Clang works fine as a compiler for this--there is nothing in it
       | that normally produces different results due to timing or
       | whatever. When something does leak in, we fix it upstream. You do
       | have to ensure that no one uses __DATE__ or similar macros, or
       | that you redefine them to a known value on the command line.
        
       ___________________________________________________________________
       (page generated 2024-07-10 23:01 UTC)