[HN Gopher] Why does an extraneous build step make my Zig app 10...
       ___________________________________________________________________
        
       Why does an extraneous build step make my Zig app 10x faster?
        
       Author : ojosilva
       Score  : 219 points
       Date   : 2024-03-20 09:18 UTC (13 hours ago)
        
 (HTM) web link (mtlynch.io)
 (TXT) w3m dump (mtlynch.io)
        
       | Rygian wrote:
       | The TL:DR; is that the build step masks the wait for input from a
       | shell pipe. With a side dish of "do buffered input" and then a
       | small "avoid memory allocation for fun."
        
       | FreeFull wrote:
       | You can easily hit a similar problem in other languages too. For
       | example, in Rust, std::fs::File isn't buffered, so reading single
       | bytes from it will also be rather slow.
        
       | aforty wrote:
       | There is general wisdom about bash pipelines here that I think
       | most people will miss simply because of the title. Interesting
       | though, my mental model of bash piping was wrong too.
        
         | alas44 wrote:
         | We are two!
        
         | Joker_vD wrote:
         | There were several reasons why pipes were added to Unix, and
         | the ability to run producer/consumer processes concurrently was
         | one of them. Before that (and for many years after on non-Unix
         | systems) indeed the most prevalent paradigm were to run multi-
         | stage pipelines with the moral equivalent of the following:
         | stage1.exe /in:input.dat /out:stage1.dat         stage2.exe
         | /in:stage1.dat /out:stage2.dat         del stage1.dat
         | stage3.exe /in:stage2.dat /out:result.dat         del
         | stage2.dat
        
           | jakogut wrote:
           | Pipes are so useful. I find myself more and more using shell
           | script and pipes for complex multi-stage tasks. This also
           | simplifies any non-shell code I must write, as there are
           | already high quality, performant implementations of hashing
           | and compression algorithms I can just pipe to.
        
             | number6 wrote:
             | "The programmer scoffed at Master Foo and rose to depart.
             | But Master Foo nodded to his student Nubi, who wrote a line
             | of shell script on a nearby whiteboard, and said: "Master
             | programmer, consider this pipeline. Implemented in pure C,
             | would it not span ten thousand lines?""
             | 
             | http://catb.org/~esr/writings/unix-koans/ten-thousand.html
        
               | keybored wrote:
               | Ugh. I don't feel that the spirit of those satirical Zen
               | Koans is to be so self-congratulatory.
        
             | capitol_ wrote:
             | What programming language do you use where there isn't
             | performant hashing/compression algorithms implemented as
             | libraries?
        
               | jvanderbot wrote:
               | Well they all do, but in terms of ease of use, tar and
               | zip are much simpler to implement in a cli pipeline than
               | to write bespoke code. At least that has been my
               | experience.
        
               | jerf wrote:
               | It is hard to compete with "| gzip" in any programming
               | language. Just importing a library and you're already
               | well past that. Just typing "import" and you're tied!
               | Overbudget if I drop the space in "| gzip".
               | 
               | This is one of the reasons why, for all its faults, shell
               | just isn't going anywhere any time soon.
        
               | deathanatos wrote:
               | It _is_ hard to compete with.
               | 
               | You can also (assuming your language supports it),
               | execute gzip, and assuming your language gives you some
               | writable-handle to the pipe, then write data into it. So,
               | you get the concurrency "for free", but you don't have to
               | go all the way to "do all of it in process".
               | 
               | I've also done the "trick" of executing [bash, -c,
               | <stuff>] in a higher language, too. I'd personally rather
               | see the work better suited for the high language done in
               | the higher language, but if shell is easier, then as such
               | it is.
               | 
               | It's sort of like unsafe blocks: minimize the shell to a
               | reasonable portion, clearly define the inputs/outputs,
               | and make sure you're not vulnerable to shell-isms, as
               | best as you can, at the boundary.
               | 
               | But I still think I see the reverse far more often. Like,
               | `steam` is ... all the time, apparently ... exec'ing a
               | shell to then exec ... xdg-user-dir? (And the error seems
               | to indicate that that's it...) Which seems more like the
               | sort of "you could just exec this yourself?". (But Steam
               | is also mostly a web-app of sorts, so, for all I know
               | there's JS under there, and I think node is one of those
               | "makes exec(2) hard/impossible" langs.)
        
               | jvanderbot wrote:
               | import os import os.subprocess #is that right?
               | subprocess.execute(f'tar cvzf t.tar.gz {'
               | '.join(list_of_files)}')
               | 
               | Did I do that right?
               | 
               | or was it
               | 
               | `tar cvzf t.tar.gz *`
        
               | Joker_vD wrote:
               | import subprocess              subprocess.run(['tar',
               | 'cvzf', 't.tar.gz', *list_of_files])
               | 
               | or indeed                   import os, subprocess
               | subprocess.run(['tar', 'cvzf', 't.tar.gz', *(f.path for f
               | in os.scandir('.'))])
               | 
               | if you need files from the current directory
        
             | jvanderbot wrote:
             | My biggest annoyance is when I get some tooling from some
             | other team, and they're like "oh just extend this Python
             | script". It'll operate on local files, using shell
             | commands, in a non-reentrant way, with only customization
             | from commenting out code. Maybe there's some argparse but
             | you end up writing a program using their giant args as
             | primitives.
             | 
             | Guys just write small programs and chain them. The wisdom
             | of the ancients is continuously lost.
        
               | anonymous-panda wrote:
               | I would recommend the python sh module instead of writing
               | bash for more complex code. Python's devenv and tooling
               | is way more mature and safer.
        
               | hiccuphippo wrote:
               | Python comes with a built-in module called fileinput that
               | makes this very easy. It checks sys.argv[1] and reads
               | from it or from stdin if it's empty or a dash.
               | 
               | https://docs.python.org/3/library/fileinput.html
        
               | dan_mctree wrote:
               | It's just a preference thing, I loathe the small program
               | chaining style and cannot work with it at all. Give me a
               | python script and I'm good though. I can't for the life
               | of me imagine why people would want to do pseudo
               | programming through piping magic when chaining is so
               | limited compared to actual programming
        
               | dgfitz wrote:
               | Chaining pipes in python is quite obnoxious.
        
               | samatman wrote:
               | This is of course a false dichotomy, there's nothing
               | pseudo about using bash (perhaps you mean sudo?) and bash
               | scripts orchestrate what you call 'actual' programs.
               | 
               | I commonly write little python scripts to filter logs,
               | which I have read from stdin. That means I can filter a
               | log to stdout:                  cat logfile.log | python
               | parse_logs.py
               | 
               | Or filter them as they're generated:
               | tail -f logfile.log | python parse_logs.py
               | 
               | Or write the filtered output to a file:
               | cat logfile.log | python parse_logs.py > filtered.log
               | 
               | Or both:                  tail -f logfile.log | python
               | parse_logs.py | tee filtered.log
               | 
               | It would be possible, I suppose, to configure a single
               | python script to do all those things, with flags or
               | whatever.
               | 
               | But who on Earth has the time for that?
        
           | fuzztester wrote:
           | Sometimes you want the intermediate files as well, though.
           | For example, if doing some kind of exploratory analysis of
           | the different output stages of the pipeline, or even just for
           | debugging.
           | 
           | Tee can be useful for that. Maybe pv (pipe viewer) too. I
           | have not tried it yet.
        
         | adql wrote:
         | ...how ? It's called pipe, not "infinitely large buffer that
         | will wait indefintely till the command ends to pass its output
         | further"
        
           | zamfi wrote:
           | Can't speak for OP, but one might reasonably expect later
           | stages to only start execution once _at least some_ data is
           | available--rather than immediately, before any data is
           | available for them to consume.
           | 
           | Of course, there many reasons you wouldn't want this--
           | processes can take time to start up, for example--but it's
           | not an unreasonable mental model.
        
             | OJFord wrote:
             | Not even that they might be particularly slow to start in
             | absolute terms, but just that they might be slow relative
             | to how fast the previous stage starts cranking out some
             | input for it.
             | 
             | (Since, as GP said, not an infinite buffer.)
        
             | Joker_vD wrote:
             | Well, it could be implemented like this, it's just more
             | cumbersome than "create N-1 anonymous pipes, fork N
             | processes, wait for the last process to finish": at the
             | very least you'll need to select() on the last unattached
             | pipe, and when job control comes into the picture, you'd
             | really would like the "setting up the pipeline" and
             | "monitoring the pipeline's execution" parts to be
             | disentangled.
        
           | hawski wrote:
           | I know this about Unix pipes from a very long time. Whenever
           | they are introduced it is always said, but I guess people can
           | miss it.
           | 
           | Though now I will break your mind as my mind was broken not a
           | long time ago. Powershell, which is often said to be a better
           | shell, works like that. It doesn't run things in parallel. I
           | think the same is to be said about Windows cmd/batch, but
           | don't cite me on that. That one thing makes Powershell
           | insufficient to ever be a full replacement of a proper shell.
        
             | MatejKafka wrote:
             | Not exactly. Non-native PowerShell pipelines are executed
             | in a single thread, but the steps are interleaved, not
             | buffered. That is, each object is passed through the whole
             | pipeline before the next object is processed. This is non-
             | ideal for high-performance data processing (e.g. `cat`ing a
             | 10GB file, searching through it and gzipping the output),
             | but for 99% of daily commands, it does not make any
             | difference.
             | 
             | cmd.exe uses standard OS pipes and behaves the same as UNIX
             | shells, same as Powershell invoking native binaries.
        
               | hawski wrote:
               | Oh, that's what I missed! I managed to find out about it
               | while trying to do an equivalent of `curl ... | tar xzf
               | -` in Powershell. I was stumped. I guess the thing is
               | that a Unix shell would do a subshell automatically.
        
             | poizan42 wrote:
             | > Though now I will break your mind as my mind was broken
             | not a long time ago. Powershell, which is often said to be
             | a better shell, works like that. It doesn't run things in
             | parallel. I think the same is to be said about Windows
             | cmd/batch, but don't cite me on that. That one thing makes
             | Powershell insufficient to ever be a full replacement of a
             | proper shell.
             | 
             | A Pipeline is PowerShell is definitely streaming unless you
             | accidentally forces the output into a list/array at some
             | point, e.g. try this for yourself (somewhere you can
             | interrupt the script obviously as it's going to run
             | forever)                   class InfiniteEnumerator :
             | System.Collections.IEnumerator         {             hidden
             | [ulong]$countMod2e64 = 0                  [object]
             | get_Current()             {                 return
             | $this.countMod2e64             }
             | [bool] MoveNext() {                 $this.countMod2e64 += 1
             | return $true             }                          Reset()
             | {                 $this.countMod2e64 = 0             }
             | }              class InfiniteEnumerable :
             | System.Collections.IEnumerable {
             | InfiniteEnumerable() {}
             | [System.Collections.IEnumerator] GetEnumerator() {
             | return [InfiniteEnumerator]::new()             }         }
             | [InfiniteEnumerable]::new() | ForEach-Object { Write-Host
             | "Element number mod 2^64: $_" }
             | 
             | Whether it runs in parallel depends on the implementation
             | of each side. Interpreted powershell code does not run in
             | parallel unless you run it a job, use ForEach-Object
             | -Parallel, or explicitly put it on another thread. But the
             | data is not collected together before being sent from one
             | step from the next.
        
               | MatejKafka wrote:
               | More compact example (not to scare the POSIX people away
               | :) ):                   0..1000000 | where {$_ % 10 -eq
               | 0} | foreach {"Got Value: $_"}
        
               | poizan42 wrote:
               | The streaming behavior of the range operator is weird
               | though. This is tested on PowerShell 7.4.1
               | > 0..1000000000 | % { $_ }         # Starts printing out
               | numbers immediately         > 0..1000000000         #
               | Hangs longer than I had patience to wait for         >
               | $x=0..100         > $x.GetType()         # IsPublic
               | IsSerial Name     BaseType         # -------- --------
               | ----     --------         # True     True     Object[]
               | System.Array
               | 
               | It's an array when I save it in a variable, but it's
               | obviously not an array on the LHS of a pipe.
        
           | lylejantzi3rd wrote:
           | Pipe, |, was also commonly used as an "OR" operator. I wonder
           | if the idea that you could "pipe" data between commands came
           | later.
        
             | hawski wrote:
             | I think the math usage was first. i.e. absolute value: |x|
        
               | adrian_b wrote:
               | The language APL\360 of IBM (August 1968) and the other
               | APL dialects that have followed it have used a single "|"
               | as a monadic prefix operator that computes the absolute
               | value and also as a dyadic infix operator that computes
               | the remainder of the division (but with the operand order
               | reversed in comparison with the language C, which is
               | usually much more convenient, especially in APL, where
               | this order avoids the need for parentheses in most
               | cases).
        
               | samatman wrote:
               | Not to get all semiotic about it, but |x| notation is a
               | pair of vertical lines. I'm sure that _someone_ has
               | written a calculator program where two 0x7D characters
               | bracketing a symbol means absolute value, but if I 've
               | ever seen it, I can't recall.
               | 
               | Although 0x7D is overly specific, since if a sibling
               | comment is correct (I have no reason to think otherwise),
               | | for bitwise OR originates in PL/1, where it would have
               | been encoded in EBCDIC, which codes it as 0x4F.
               | 
               | I'm not really disagreeing with you, the |abs| notation
               | is quite a bit older than computers, just musing on what
               | should count as the first use of "|". I'm inclined to say
               | that it should go to the first use of _an_ encoding of
               | "|", not to the similarly-appearing pen and paper
               | notation, and definitely not the first use of ASCII "|"
               | aka 0x7D in a programming language. But I don't think
               | there's a right answer here, it's a matter of taste.
               | 
               | Because one could argue back to the Roman numeral I, if
               | one were determined to do so: when written sans serif,
               | it's just a vertical line, after all. Somehow, abs
               | notation and "first use of an encoded vertical bar" both
               | seem reasonable, while the Roman numeral and
               | specifically-ASCII don't, but I doubt I can unpack that
               | intuition in any detail.
        
             | adrian_b wrote:
             | The character "|" has been introduced in computers in the
             | language NPL at IBM in December 1964 as a notation for
             | bitwise OR, replacing ".OR.", which had been used by IBM in
             | its previous programming language, "FORTRAN IV" (OR was
             | between dots to distinguish it from identifiers, marking it
             | as an operator).
             | 
             | The next year the experimental NPL (New Programming
             | Language) has been rebranded as PL/I and it has become a
             | commercial product of IBM.
             | 
             | Following PL/I, other programming languages have begun to
             | use "&" and "|" for AND and OR, including the B language,
             | the predecessor of C.
             | 
             | The pipe and its notation have been introduced in the Third
             | Edition of UNIX (based on a proposal made by M. D.
             | McIlroy), in 1972, so after the language B had been used
             | for a few years and before the development of C. The oldest
             | documentation about pipes that I have seen is in "UNIX
             | Programmer's Manual Third Edition" from February 1973.
             | 
             | Before NPL, the vertical bar had already been used in the
             | Backus-Naur notation introduced in the report about ALGOL
             | 60 as a separator between alternatives in the description
             | of the grammar of the language, so with a meaning somewhat
             | similar to OR.
        
               | hollerith wrote:
               | >as a notation for bitwise OR, replacing ".OR.", which
               | had been used by IBM in its previous programming
               | language, "FORTRAN IV".
               | 
               | Untrue: ".OR." in FORTRAN meant ordinary OR, not
               | _bitwise_ OR. I don 't remember ever seeing bitwise OR or
               | AND or XOR in FORTRAN IV.
        
               | adrian_b wrote:
               | That is right, but I did not want to provide too many
               | details that did not belong to the topic.
               | 
               | FORTRAN IV did not have bit strings, it had only Boolean
               | values ("LOGICAL").
               | 
               | Therefore all the logical operators could be applied only
               | to Boolean operands, giving a Boolean result.
               | 
               | The same was true for all earlier high-level programming
               | languages.
               | 
               | The language NPL, renamed PL/I in 1965, has been the
               | first high-level programming language that has introduced
               | bit string values, so the AND, OR and NOT operators could
               | operate on bit strings, not only on single Boolean
               | values.
               | 
               | If PL/I would have remained restricted to the smaller
               | character set accepted by FORTRAN IV in source texts,
               | they would have retained the FORTRAN IV operators
               | ".NOT.", ".AND.", ".OR.", extending their meaning as bit
               | string operators.
               | 
               | However IBM has decided to extend the character set,
               | which has allowed the use of dedicated symbols for the
               | logical operators and also for other operators that
               | previously had to use keywords, like the relational
               | operators, and also for new operators introduced by PL/I,
               | like the concatenation operator.
        
           | arp242 wrote:
           | Usually mental models develop "organically" from when one was
           | a n00b, without much thought, and sometimes it can take a
           | long time for them to be unseated, even though it's kind of
           | obvious in hindsight that the mental model is wrong (e.g. one
           | can see that from "slow-program | less", and things like
           | that).
        
             | maicro wrote:
             | I think a main reason for this is that you can have a "good
             | enough" working mental model of a process, that holds up to
             | your typical use cases and even moderate scrutiny. It's
             | often only once you run into a case where your mental model
             | fails that you even think to challenge the assumptions it
             | was built on - at least, this has been my experience.
        
           | m000 wrote:
           | That is called a sponge!                 SPONGE(1)
           | moreutils                         SPONGE(1)            NAME
           | sponge - soak up standard input and write to a file
           | SYNOPSIS              sed '...' file | grep '...' | sponge
           | [-a] file            DESCRIPTION              sponge reads
           | standard input and writes it out to the specified file.
           | Unlike a shell redirect, sponge soaks up all its input before
           | writing              the output file. This allows
           | constructing pipelines that read from and              write
           | to the same file.
        
           | shiomiru wrote:
           | DOS also has a "pipe", which works exactly like that.
           | (Obviously, since DOS can't run multiple programs in
           | parallel.)
        
       | Symmetry wrote:
       | My first guess involved caching but I was thinking about whether
       | the binary itself had to be read from disk or was already cached
       | in RAM. Great linux-fu post.
        
       | OJFord wrote:
       | I was so confused about why this mattered/made _such_ a
       | difference - until I went back and re-read from the top: OP does
       | the benchmark timing in `main`, in the Zig app under test.
       | 
       | If you don't do that, if you use the `time` CLI for example, this
       | wouldn't have been a problem in the first place. Though sure you
       | couldn't have compared to compiling fresh & running anyway, and
       | at least on small inputs would've wanted to do the input prep
       | first anyway.
       | 
       | But I think if you put the benchmark code inside the DUT you're
       | setting yourself up for all kinds of gotchas like this.
        
       | hawski wrote:
       | I don't want to belittle the author, but I am surprised, that
       | people using a low-level language on Linux wouldn't know how Unix
       | pipelines work or that reading one byte per syscall is quite
       | inefficient. I understand that the author is still learning
       | (aren't we all?), but I just felt it is a pretty fundamental
       | knowledge. At the same time author managed to have better
       | performance that the official thing had. I guess many things feel
       | fundamental in the retrospect.
        
         | mtlynch wrote:
         | Author here.
         | 
         | Thanks for reading!
         | 
         | > _I am surprised, that people using a low-level language on
         | Linux wouldn 't know ... that reading one byte per syscall is
         | quite inefficient._
         | 
         | In my defense, it wasn't that I didn't realize one byte per
         | syscall was inefficient; it was that I didn't realize that I
         | was doing one syscall per byte read.
         | 
         | I'm coming back to low-level programming after 8ish years of
         | Go/Python/JS, so I wasn't really registering that I'd forgotten
         | to layer in a buffered reader on top of stdin's reader.
         | 
         | Alex Kladov (matklad) made an interesting point on the Ziggit
         | thread[0] that the Zig standard library could adjust the API to
         | make this kind of mistake less likely:
         | 
         | > _I'd say readByte is a flawed API to have on a Reader. While
         | you technically can read a byte-at-time from something like TCP
         | socket, it just doesn't make sense. The reader should only
         | allow reading into a slice._
         | 
         | > _Byte-oriented API belongs to a buffered reader._
         | 
         | [0] https://ziggit.dev/t/zig-build-run-is-10x-faster-than-
         | compil...
        
           | joelfried wrote:
           | Thank you for sharing it!
           | 
           | Articles like this are how one learns the nuances of such
           | things, and it's good for people to keep putting them out
           | there.
        
           | hawski wrote:
           | Zig certainly needs more work. That part is more on
           | familiarity with Zig and how intuitive it is or isn't.
           | 
           | In any case I would recommend anyone investigating things
           | like that to run things through strace. It is often my first
           | step in trying to understand what happens with anything -
           | like a cryptic error "No such file or directory" without
           | telling me what a thing tried to access. You would run:
           | 
           | $ strace -f sh -c 'your | pipeline | here' -o strace.log
           | 
           | You could then track things easily and see what is really
           | happening.
           | 
           | Cheers!
        
             | mtlynch wrote:
             | Thanks for the tip! I don't have experience with strace,
             | and I'm wondering if I'm misunderstanding what you're
             | saying.
             | 
             | I tried running your suggested command, and I got 2,800
             | lines of output like this:
             | execve("/nix/store/vqvj60h076bhqj6977caz0pfxs6543nb-
             | bash-5.2-p15/bin/sh", ["sh", "-c", "echo
             | \"60016000526001601ff3\" | xx"...], 0x7fffffffcf38 /* 106
             | vars */) = 0         brk(NULL)
             | = 0x4ea000         arch_prctl(0x3001 /* ARCH_??? */,
             | 0x7fffffffcdd0) = -1 EINVAL (Invalid argument)
             | access("/etc/ld-nix.so.preload", R_OK)  = -1 ENOENT (No
             | such file or directory)         openat(AT_FDCWD, "/nix/stor
             | e/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libdl.s
             | o.2", O_RDONLY|O_CLOEXEC) = 3         read(3, "\177ELF\2\1\
             | 1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\0\0\0\0\0\0\0"...,
             | 832) = 832         newfstatat(3, "", {st_mode=S_IFREG|0555,
             | st_size=15688, ...}, AT_EMPTY_PATH) = 0         mmap(NULL,
             | 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
             | 0) = 0x7ffff7fc2000         mmap(NULL, 16400, PROT_READ,
             | MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7ffff7fbd000
             | mmap(0x7ffff7fbe000, 4096, PROT_READ|PROT_EXEC,
             | MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1000) =
             | 0x7ffff7fbe000
             | 
             | Am I doing it wrong? Or is there more training involved
             | before one could usefully integrate this into debugging?
             | Because to me, the output is pretty inscrutable.
        
               | hawski wrote:
               | There is a lot of output here, but you can grep around or
               | filter with strace CLI. If you used -f option you should
               | get PID numbers later on. Then you can look for all
               | execve's to see how PIDs map to parts of the pipeline.
               | For now maybe grep the log file with something like:
               | "grep -e clone -e execve -e write -e read". You can do
               | this with strace CLI, but I never remember the syntax and
               | usually analyze the log extensively.
               | 
               | I think something like this could work:
               | strace -f -e execve,clone,write,read -o strace.log sh -c
               | '...'
               | 
               | Clone is fork, so a creation of a new process, before
               | eventual execve (with echo there will probably be just
               | clone).
        
               | dgoldstein0 wrote:
               | strace tells you every syscall the process under it
               | makes. So very helpful to understanding how a program
               | interacts with the operating system - and I/O as all IO
               | mechanisms are managed by the operating system.
               | 
               | As for how to filter this I'll leave that to the other
               | comments, but I personally would look at the man page or
               | Google around for tips
        
           | amluto wrote:
           | As a very general tip:
           | 
           | > execution time: 438.059us
           | 
           | That's a rather short time. (It's a lot of _cycles_ , but
           | there are plenty of things one might do on a computer that
           | take time comparable to this, especially anything involving
           | IO. It's only a small fraction of a disk head seek if you're
           | using spinning disks, and it's only a handful of non-
           | overlapping random accesses even on NVMe.)
           | 
           | So, when you benchmark anything and get a time this short,
           | you should make sure you're benchmarking the right thing.
           | Watch out for fixed costs or correct for them. Run in a loop
           | and see how time varies with iteration count. Consider using
           | a framework that can handle this type of benchmarking.
           | 
           | A result like "this program took half a millisecond to run"
           | doesn't really tell you much about how long any part of it
           | took.
        
           | vlovich123 wrote:
           | FWIW, the current version of the code (i.e. with the buffered
           | reader), on my machine at least, runs identically fast with
           | and without the tmp file.
           | 
           | Here's a possibly more detailed reason as to why
           | https://news.ycombinator.com/item?id=39764287#39768022.
        
         | nebulous1 wrote:
         | I feel like you are underestimating how many fundamental
         | misunderstandings people can have (including both of us) even
         | though they have deep understanding of adjacent issues.
        
         | 0x000xca0xfe wrote:
         | There are many fundamental things People Should Know(tm), but
         | we are not LLMs that have ingested entire libraries worth of
         | books.
         | 
         | Exploratory programming and being curious about strange effects
         | is a great way to learn the fundamentals. I already knew how
         | pipes and processes work, but I don't know the Ethereum VM. The
         | author now knows both.
        
         | cpuguy83 wrote:
         | It's one of those things that you don't usually need to think
         | about, so you don't.
         | 
         | Not too long ago I hit this same realization with pipes because
         | my "grep ... file | sed > file" (or something of that nature)
         | was racey.
         | 
         | I took the time to think about it and realized "oh I guess
         | that's how pipes would _have_ to be implemented".
        
         | boffinAudio wrote:
         | This deleterious effect is a factor in computing. We deal with
         | it every few years: kids graduate, having ignored the history
         | of their elders in order to focus on the new and cool - and
         | once they hit industry, find that they really, really should
         | have learned POSIX or whatever.
         | 
         | Its not uncommon. As a professional developer I have observed
         | this obfuscation of prior technology countless times,
         | especially with junior devs.
         | 
         | There is a lot to learn. Always. It doesn't ever stop.
        
         | franciscop wrote:
         | This is exactly what surprised me as well. I'm literally now
         | learning in depth WebStreams[1] in JS (vs the traditional Node
         | Streams) and I've seen too many times the comparison of how
         | "pipe() and pipeTo() behave just like Unix's pipes |". Reading
         | this article makes me think this might not be the best
         | comparison, specially since for many webdevs it's the first
         | time for approaching some CS concepts. OTOH, the vast majority
         | of webdevs don't really need to learn WebStreams in-depth.
         | 
         | [1] https://exploringjs.com/nodejs-shell-scripting/ch_web-
         | stream...
        
         | LAC-Tech wrote:
         | Most people have gaps somewhere in their knowledge. I learned
         | very on, as a general superstition, to always try and batch
         | things that dealt with the world without like file writes,
         | allocations, network requests etc. But for years I had no idea
         | what a syscall even was.
        
       | 0x000xca0xfe wrote:
       | Also a great reminder to always benchmark with different data set
       | sizes.
        
       | masto wrote:
       | I realize one needs a catchy title and some storytelling to get
       | people to read a blog article, but for a summary of the main
       | points:
       | 
       | * This is not about a build step that makes the app perform
       | better
       | 
       | * The app isn't 10x faster (or faster at all; it's the same
       | binary)
       | 
       | * The author ran a benchmark two ways, one of which inadvertently
       | included the time taken to generate sample input data, because it
       | was coming from a pipe
       | 
       | * Generating the data before starting the program under test
       | fixes the measurement
        
         | meowface wrote:
         | Another semi-summary of the core part of the article:
         | 
         | >"echo '60016000526001601ff3' | xxd -r -p | zig build run
         | -Doptimize=ReleaseFast" is much faster than "echo
         | '60016000526001601ff3' | xxd -r -p | ./zig-out/bin/count-bytes"
         | (compiling + running the program is faster than just running an
         | already-compiled program)
         | 
         | >When you execute the program directly, xxd and count-bytes
         | start at the same time, so the pipe buffer is empty when count-
         | bytes first tries to read from stdin, requiring it to wait
         | until xxd fills it. But when you use zig build run, xxd gets a
         | head start while the program is compiling, so by the time
         | count-bytes reads from stdin, the pipe buffer has been filled.
         | 
         | >Imagine a simple bash pipeline like the following: "./jobA |
         | ./jobB". My mental model was that jobA would start and run to
         | completion and then jobB would start with jobA's output as its
         | input. It turns out that all commands in a bash pipeline start
         | at the same time.
        
           | anonymous-panda wrote:
           | That doesn't make sense unless you have only 1 or 2 physical
           | CPUs with contention. In a modern CPU the latter should be
           | faster and I'm left unsatisfied by the correctness of the
           | explanation. Am I just being thick or is there a more
           | plausible explanation?
        
             | DougBTX wrote:
             | It depends on where the timing code is. If the timer starts
             | after all the data has already been loaded, the time
             | recorded will be lower (even if the total time for the
             | whole process is higher).
        
               | anonymous-panda wrote:
               | I'm not following how that would result in a 10x
               | discrepancy. The amount of data we're talking about here
               | is laughably small (it's like 32 bytes or something)
        
               | masklinn wrote:
               | > The amount of data we're talking about here is
               | laughably small
               | 
               | So is the runtime.
        
             | masklinn wrote:
             | The latter is faster in actual CPU time, however note that
             | TFA the measurement only starts with the program, it does
             | not start with the start of the pipeline.
             | 
             | Because the compilation time overlaps with the pipes
             | filling up, blocking on the pipe is mostly excluded from
             | the measurement in the former case (by the time the program
             | starts there's enough data in the pipe that the program can
             | slurp a bunch of it, especially reading it byte by byte),
             | but included in the latter.
        
               | anonymous-panda wrote:
               | My hunch is that if you added the buffered reader and
               | kept the original xxd in the pipe you'd see similar
               | timings.
               | 
               | The amount of input data is just laughably small here to
               | result in a huge timing discrepancy.
               | 
               | I wonder if there's an added element where the constant
               | syscalls are reading on a contended mutex and that
               | contention disappears if you delay the start of the
               | program.
        
               | vlovich123 wrote:
               | Good hunch. On my machine (13900k) & zig 0.11, the latest
               | version of the code:
               | 
               | > INFILE="$(mktemp)" && echo $INFILE && \ echo
               | '60016000526001601ff3' | xxd -r -p > "${INFILE}" && \ zig
               | build run -Doptimize=ReleaseFast < "${INFILE}" >
               | execution time: 27.742us
               | 
               | vs
               | 
               | > echo '60016000526001601ff3' | xxd -r -p | zig build run
               | -Doptimize=ReleaseFast > execution time: 27.999us
               | 
               | The idea that the overlap of execution here by itself
               | plays a role is nonsensical. The overlap of execution +
               | reading a byte at a time causing kernel mutex contention
               | seems like a more plausible explanation although I would
               | expect someone better knowledgeable (& more motivated)
               | about capturing kernel perf measurements to confirm. If
               | this is the explanation, I'm kind of surprised that there
               | isn't a lock-free path for pipes in the kernel.
        
               | rofrol wrote:
               | This @mtlynch
        
               | vlovich123 wrote:
               | To sanity check myself, I reran this without the buffered
               | reader and still don't see the slow execution time:
               | 
               | > echo '60016000526001601ff3' | xxd -r -p > | zig build
               | run -Doptimize=ReleaseFast
               | 
               | > execution time: 28.889us
               | 
               | So I think my machine config for whatever reason isn't
               | representative of whatever OP is using.
               | 
               | Linux-ck 6.8 CONFIG_NO_HZ=y CONFIG_HZ_1000=y
               | 
               | Intel 13900k
               | 
               | zig 0.11
               | 
               | bash 5.2.26
               | 
               | xxd 2024-02-10
               | 
               | Would be good if someone that can repro it compares the
               | two invocation variants with buffered reader implemented
               | & lists their config.
        
               | mtlynch wrote:
               | Based on what you've shared, the second version can start
               | reading instantly because "INFILE" was populated in the
               | previous test. Did you clear it between tests?
               | 
               | Here are the benchmarks before and after fixing the
               | benchmarking code:
               | 
               | Before: https://output.circle-
               | artifacts.com/output/job/2f6666c1-1165...
               | 
               | After: https://output.circle-
               | artifacts.com/output/job/457cd247-dd7c...
               | 
               | What would explain the drastic performance increase if
               | the pipelining behavior is irrelevant?
        
               | vlovich123 wrote:
               | That was just a typo in the comment. The command run
               | locally was just a strait pipe.
               | 
               | Using both invocation variants, I ran:
               | 
               | 8a5ecac63e44999e14cdf16d5ed689d5770c101f (before buffered
               | changes)
               | 
               | 78188ecbc66af6e5889d14067d4a824081b4f0ad (after buffered
               | changes)
               | 
               | On my machine, they're all equally fast at ~28 us.
               | Clearly the changes only had an impact on machines with a
               | different configuration (kernel version or kernel config
               | or xxd version or hw).
               | 
               | One hypothesis outlined above is that the when you
               | pipeline all 3 applications, the single byte reader
               | version is doing back-to-back syscalls and that's causing
               | contention between your code and xxd on a kernel mutex
               | leading to things going to sleep extra long.
               | 
               | It's not a strong hypothesis though just because of how
               | little data there is and the fact that it doesn't repro
               | on my machine. To get a real explanation, I think you
               | have to actually do some profile measurements on a
               | machine that can repro and dig in to obtain a satisfiable
               | explanation of what exactly is causing the problem.
        
         | karmakaze wrote:
         | I would definitely classify the title as _clickbait_ because
         | the app didn 't go "10x faster".
        
       | underdeserver wrote:
       | If I were trying to optimize my code, I would start with loading
       | the entire benchmark bytecode to memory, then start the counter.
       | Otherwise I can't be sure how much time is spent reading from a
       | pipe/file to memory, and how much time is spent in my code.
       | 
       | Then I would try to benchmark what happens if it all fits in L1
       | cache, L2, L3, and main memory.
       | 
       | Of course, if the common use case is reading from a file,
       | network, or pipe, maybe you can optimize that, but I would take
       | it step by step.
        
       | jedisct1 wrote:
       | This is exactly the thing that feels obvious once you realize it,
       | but that can be puzzling until you don't.
        
       | john-tells-all wrote:
       | This is an excellent writeup, with interesting ideas and clear
       | description of actions taken. My idea of pipelines, also, was
       | flawed. Well done!
       | 
       | Nothing to do with Zig. Just a nice debugging story.
        
       | WalterBright wrote:
       | Back in college, a friend of mine decided to learn how to
       | program. He had never programmed before. He picked up the DEC
       | FORTRAN-10 manual and read it cover to cover.
       | 
       | He then wrote a program that generated some large amount of data
       | and wrote it to a file. Being much smarter than I am, his first
       | program worked the first time.
       | 
       | But it ran terribly slow. Baffled, he showed it to his friend,
       | who exclaimed why are you, in a loop, opening the file, appending
       | one character, and closing the file? That's going to run
       | incredibly slowly. Instead, open the file, write all the data,
       | then close it!
       | 
       | The reply was "the manual didn't say anything about that or how
       | to do I/O efficiently."
        
         | TylerE wrote:
         | I firmly believe that teaching how to idiomatically do both
         | character and line oriented file IO should be the first thing
         | any language tutorial teaches, almost. Just as soon as you've
         | introduced enough syntax.
        
           | PoignardAzur wrote:
           | FWIW, the Epitech cursus starts your very first C programming
           | lesson by making you write a "my_putchar" function that does
           | a "write" syscall with a single character. Then you spend the
           | next few days learning how to create my_putnbr, my_putstr,
           | etc, using that single building block.
           | 
           | I think that's the right choice, by the way. Baby developers
           | don't need to learn efficient I/O, they need to learn how you
           | make the pile of sand do smart things.
           | 
           | And if you've spent weeks doing I/O one syscall per
           | character, getting to the point you write hundreds of lines
           | that way, the moment some classmate shows you that you can
           | 100x your program's performance by batching I/O gets burned
           | in your memory forever, in a way "I'm doing it because the
           | manual said so" doesn't.
        
             | TylerE wrote:
             | I said idiomatic because that's the form they're going to
             | encounter it in the wild, in library doc, in stack overflow
             | answers.
             | 
             | I do t think rhe sort of low level bit banging you propose
             | is a worthwhile use of a students time, given the vast
             | amount they have to learn that won't be immediately
             | obsolete.
        
               | samatman wrote:
               | I firmly disagree with this. We're talking about learning
               | C, not learning an arbitrary programming language. The
               | course of study the GP comment suggests teaches syscalls,
               | pipes, buffering, and most important, it teaches
               | mechanical sympathy. All of which are things a C
               | programmer needs to understand.
               | 
               | More programming tasks than you might imagine are low-
               | level bit banging, and C remains the language of choice
               | for doing them. It might be Zig one day, and if so, the
               | same sort of deep-end dive into low-level detail will
               | remain a good way to approach such a language.
               | 
               | Far from becoming "rapidly obsolete", learning in this
               | style will prevent this sort of mistake for years into
               | the future: https://news.ycombinator.com/item?id=39766130
        
               | TylerE wrote:
               | "we" are certainly not talking about C. I never mentioned
               | C, nor any language. This was intentional.
        
             | mcguire wrote:
             | This is similar to the presentation in _Software Tools_ ,
             | IIRC.
        
           | everforward wrote:
           | I would argue that they should be teaching how file IO works
           | at a low level at some point (preferably as one of the first
           | "complicated" bits).
           | 
           | Everybody should, at some early point, interact with basic
           | file descriptors and see how they map to syscalls. Preferably
           | including building their own character and line oriented
           | abstractions on top of that, so they can see how they work.
           | 
           | I'm convinced that IO is in the same category as parallelism;
           | most devs understand it poorly at best, and the ones who do
           | understand it are worth their weight in gold.
        
             | TylerE wrote:
             | I do not think the typical language tutorial is well served
             | by trying to teach a condensed version of all of CS.
        
             | WalterBright wrote:
             | I beat benchmark after benchmark in 80s on disk I/O. I was
             | amazed that nobody else figured out what my compiler was
             | doing - using a 16K buffer for the floppy drive rather than
             | 512 bytes.
        
               | pixl97 wrote:
               | Heh, wasn't 16k most of the memory in the machine? Large
               | buffers do have other interesting and fun side effects,
               | though back then you probably didn't have any threads or
               | any/many of the things buffers cause these days.
        
           | WalterBright wrote:
           | The FORTRAN-10 manual was not a tutorial, it was a
           | specification.
        
       | michael1999 wrote:
       | If you want create something like the pipe behaviour the author
       | expected (buffer all output before sending to the next command),
       | the sponge command from moreutils can help.
        
       | styfle wrote:
       | > By adding a benchmarking script to my continuous integration
       | and archiving the results, it was easy for me to identify when my
       | measurements changed.
       | 
       | This assumes CI runs on the same machine with same hardware every
       | time, but most CI doesn't do that.
        
         | boesboes wrote:
         | And that the hardware is not overbooked. I found that my ci/cd
         | runs would vary between 8 and 14 minutes (for a specific task
         | in the pipeline, no cache involved) between reruns.
         | 
         | And it seemed correlated to time of day. So pretty sure they
         | had some contention there.
         | 
         | Edit: and that was with all the same cpu's reported to the os
         | atleast
        
       | mcguire wrote:
       | There seems to be a small misunderstanding on the behavior of
       | pipes here. All the commands in a bash pipeline do start at the
       | same time, but output goes into the pipeline buffer whenever the
       | writing process writes it. There is no specific point where the
       | "output from jobA is ready".
       | 
       | The author's example code, " _jobA starts, sleeps for three
       | seconds, prints to stdout, sleeps for two more seconds, then
       | exits_ " and " _jobB starts, waits for input on stdin, then
       | prints everything it can read from stdin until stdin closes_ " is
       | measuring 5 seconds not because the input to jobB is not ready
       | until jobA terminates but because jobB is waiting for the pipe to
       | close which doesn't happen until jobA ends. That explains the
       | timing of the output:                   $ ./jobA | ./jobB
       | 09:11:53.326 jobA is starting         09:11:53.326 jobB is
       | starting         09:11:53.328 jobB is waiting on input
       | 09:11:56.330 jobB read 'result of jobA is...' from input
       | 09:11:58.331 jobA is terminating         09:11:58.331 jobB read
       | '42' from input         09:11:58.333 jobB is done reading input
       | 09:11:58.335 jobB is terminating
       | 
       | The bottom line is that it's important to actually measure what
       | you want to measure.
        
         | mtlynch wrote:
         | Author here.
         | 
         | Thanks for reading!
         | 
         | > _All the commands in a bash pipeline do start at the same
         | time, but output goes into the pipeline buffer whenever the
         | writing process writes it. There is no specific point where the
         | "output from jobA is ready"._
         | 
         | Right, I didn't mean to give the impression that there's a time
         | at which all input from jobA is ready at once. But there is a
         | time when jobB can start reading stdin, and there's a time when
         | jobA closes the handle to its stdout.
         | 
         | The reason I split jobA's output into two commands is to show
         | that jobB starts reading 3 seconds after the command begins,
         | and jobB finishes reading 2 seconds after reading the first
         | output from jobA.
        
       | dsm9000 wrote:
       | This post is another example of why I like zig so much. It seems
       | to get people talking about performance in a way which helps them
       | learn how things work below today's heavily abstracted veneer
        
       ___________________________________________________________________
       (page generated 2024-03-20 23:02 UTC)