[HN Gopher] Why does an extraneous build step make my Zig app 10...
___________________________________________________________________
Why does an extraneous build step make my Zig app 10x faster?
Author : ojosilva
Score : 219 points
Date : 2024-03-20 09:18 UTC (13 hours ago)
(HTM) web link (mtlynch.io)
(TXT) w3m dump (mtlynch.io)
| Rygian wrote:
| The TL:DR; is that the build step masks the wait for input from a
| shell pipe. With a side dish of "do buffered input" and then a
| small "avoid memory allocation for fun."
| FreeFull wrote:
| You can easily hit a similar problem in other languages too. For
| example, in Rust, std::fs::File isn't buffered, so reading single
| bytes from it will also be rather slow.
| aforty wrote:
| There is general wisdom about bash pipelines here that I think
| most people will miss simply because of the title. Interesting
| though, my mental model of bash piping was wrong too.
| alas44 wrote:
| We are two!
| Joker_vD wrote:
| There were several reasons why pipes were added to Unix, and
| the ability to run producer/consumer processes concurrently was
| one of them. Before that (and for many years after on non-Unix
| systems) indeed the most prevalent paradigm were to run multi-
| stage pipelines with the moral equivalent of the following:
| stage1.exe /in:input.dat /out:stage1.dat stage2.exe
| /in:stage1.dat /out:stage2.dat del stage1.dat
| stage3.exe /in:stage2.dat /out:result.dat del
| stage2.dat
| jakogut wrote:
| Pipes are so useful. I find myself more and more using shell
| script and pipes for complex multi-stage tasks. This also
| simplifies any non-shell code I must write, as there are
| already high quality, performant implementations of hashing
| and compression algorithms I can just pipe to.
| number6 wrote:
| "The programmer scoffed at Master Foo and rose to depart.
| But Master Foo nodded to his student Nubi, who wrote a line
| of shell script on a nearby whiteboard, and said: "Master
| programmer, consider this pipeline. Implemented in pure C,
| would it not span ten thousand lines?""
|
| http://catb.org/~esr/writings/unix-koans/ten-thousand.html
| keybored wrote:
| Ugh. I don't feel that the spirit of those satirical Zen
| Koans is to be so self-congratulatory.
| capitol_ wrote:
| What programming language do you use where there isn't
| performant hashing/compression algorithms implemented as
| libraries?
| jvanderbot wrote:
| Well they all do, but in terms of ease of use, tar and
| zip are much simpler to implement in a cli pipeline than
| to write bespoke code. At least that has been my
| experience.
| jerf wrote:
| It is hard to compete with "| gzip" in any programming
| language. Just importing a library and you're already
| well past that. Just typing "import" and you're tied!
| Overbudget if I drop the space in "| gzip".
|
| This is one of the reasons why, for all its faults, shell
| just isn't going anywhere any time soon.
| deathanatos wrote:
| It _is_ hard to compete with.
|
| You can also (assuming your language supports it),
| execute gzip, and assuming your language gives you some
| writable-handle to the pipe, then write data into it. So,
| you get the concurrency "for free", but you don't have to
| go all the way to "do all of it in process".
|
| I've also done the "trick" of executing [bash, -c,
| <stuff>] in a higher language, too. I'd personally rather
| see the work better suited for the high language done in
| the higher language, but if shell is easier, then as such
| it is.
|
| It's sort of like unsafe blocks: minimize the shell to a
| reasonable portion, clearly define the inputs/outputs,
| and make sure you're not vulnerable to shell-isms, as
| best as you can, at the boundary.
|
| But I still think I see the reverse far more often. Like,
| `steam` is ... all the time, apparently ... exec'ing a
| shell to then exec ... xdg-user-dir? (And the error seems
| to indicate that that's it...) Which seems more like the
| sort of "you could just exec this yourself?". (But Steam
| is also mostly a web-app of sorts, so, for all I know
| there's JS under there, and I think node is one of those
| "makes exec(2) hard/impossible" langs.)
| jvanderbot wrote:
| import os import os.subprocess #is that right?
| subprocess.execute(f'tar cvzf t.tar.gz {'
| '.join(list_of_files)}')
|
| Did I do that right?
|
| or was it
|
| `tar cvzf t.tar.gz *`
| Joker_vD wrote:
| import subprocess subprocess.run(['tar',
| 'cvzf', 't.tar.gz', *list_of_files])
|
| or indeed import os, subprocess
| subprocess.run(['tar', 'cvzf', 't.tar.gz', *(f.path for f
| in os.scandir('.'))])
|
| if you need files from the current directory
| jvanderbot wrote:
| My biggest annoyance is when I get some tooling from some
| other team, and they're like "oh just extend this Python
| script". It'll operate on local files, using shell
| commands, in a non-reentrant way, with only customization
| from commenting out code. Maybe there's some argparse but
| you end up writing a program using their giant args as
| primitives.
|
| Guys just write small programs and chain them. The wisdom
| of the ancients is continuously lost.
| anonymous-panda wrote:
| I would recommend the python sh module instead of writing
| bash for more complex code. Python's devenv and tooling
| is way more mature and safer.
| hiccuphippo wrote:
| Python comes with a built-in module called fileinput that
| makes this very easy. It checks sys.argv[1] and reads
| from it or from stdin if it's empty or a dash.
|
| https://docs.python.org/3/library/fileinput.html
| dan_mctree wrote:
| It's just a preference thing, I loathe the small program
| chaining style and cannot work with it at all. Give me a
| python script and I'm good though. I can't for the life
| of me imagine why people would want to do pseudo
| programming through piping magic when chaining is so
| limited compared to actual programming
| dgfitz wrote:
| Chaining pipes in python is quite obnoxious.
| samatman wrote:
| This is of course a false dichotomy, there's nothing
| pseudo about using bash (perhaps you mean sudo?) and bash
| scripts orchestrate what you call 'actual' programs.
|
| I commonly write little python scripts to filter logs,
| which I have read from stdin. That means I can filter a
| log to stdout: cat logfile.log | python
| parse_logs.py
|
| Or filter them as they're generated:
| tail -f logfile.log | python parse_logs.py
|
| Or write the filtered output to a file:
| cat logfile.log | python parse_logs.py > filtered.log
|
| Or both: tail -f logfile.log | python
| parse_logs.py | tee filtered.log
|
| It would be possible, I suppose, to configure a single
| python script to do all those things, with flags or
| whatever.
|
| But who on Earth has the time for that?
| fuzztester wrote:
| Sometimes you want the intermediate files as well, though.
| For example, if doing some kind of exploratory analysis of
| the different output stages of the pipeline, or even just for
| debugging.
|
| Tee can be useful for that. Maybe pv (pipe viewer) too. I
| have not tried it yet.
| adql wrote:
| ...how ? It's called pipe, not "infinitely large buffer that
| will wait indefintely till the command ends to pass its output
| further"
| zamfi wrote:
| Can't speak for OP, but one might reasonably expect later
| stages to only start execution once _at least some_ data is
| available--rather than immediately, before any data is
| available for them to consume.
|
| Of course, there many reasons you wouldn't want this--
| processes can take time to start up, for example--but it's
| not an unreasonable mental model.
| OJFord wrote:
| Not even that they might be particularly slow to start in
| absolute terms, but just that they might be slow relative
| to how fast the previous stage starts cranking out some
| input for it.
|
| (Since, as GP said, not an infinite buffer.)
| Joker_vD wrote:
| Well, it could be implemented like this, it's just more
| cumbersome than "create N-1 anonymous pipes, fork N
| processes, wait for the last process to finish": at the
| very least you'll need to select() on the last unattached
| pipe, and when job control comes into the picture, you'd
| really would like the "setting up the pipeline" and
| "monitoring the pipeline's execution" parts to be
| disentangled.
| hawski wrote:
| I know this about Unix pipes from a very long time. Whenever
| they are introduced it is always said, but I guess people can
| miss it.
|
| Though now I will break your mind as my mind was broken not a
| long time ago. Powershell, which is often said to be a better
| shell, works like that. It doesn't run things in parallel. I
| think the same is to be said about Windows cmd/batch, but
| don't cite me on that. That one thing makes Powershell
| insufficient to ever be a full replacement of a proper shell.
| MatejKafka wrote:
| Not exactly. Non-native PowerShell pipelines are executed
| in a single thread, but the steps are interleaved, not
| buffered. That is, each object is passed through the whole
| pipeline before the next object is processed. This is non-
| ideal for high-performance data processing (e.g. `cat`ing a
| 10GB file, searching through it and gzipping the output),
| but for 99% of daily commands, it does not make any
| difference.
|
| cmd.exe uses standard OS pipes and behaves the same as UNIX
| shells, same as Powershell invoking native binaries.
| hawski wrote:
| Oh, that's what I missed! I managed to find out about it
| while trying to do an equivalent of `curl ... | tar xzf
| -` in Powershell. I was stumped. I guess the thing is
| that a Unix shell would do a subshell automatically.
| poizan42 wrote:
| > Though now I will break your mind as my mind was broken
| not a long time ago. Powershell, which is often said to be
| a better shell, works like that. It doesn't run things in
| parallel. I think the same is to be said about Windows
| cmd/batch, but don't cite me on that. That one thing makes
| Powershell insufficient to ever be a full replacement of a
| proper shell.
|
| A Pipeline is PowerShell is definitely streaming unless you
| accidentally forces the output into a list/array at some
| point, e.g. try this for yourself (somewhere you can
| interrupt the script obviously as it's going to run
| forever) class InfiniteEnumerator :
| System.Collections.IEnumerator { hidden
| [ulong]$countMod2e64 = 0 [object]
| get_Current() { return
| $this.countMod2e64 }
| [bool] MoveNext() { $this.countMod2e64 += 1
| return $true } Reset()
| { $this.countMod2e64 = 0 }
| } class InfiniteEnumerable :
| System.Collections.IEnumerable {
| InfiniteEnumerable() {}
| [System.Collections.IEnumerator] GetEnumerator() {
| return [InfiniteEnumerator]::new() } }
| [InfiniteEnumerable]::new() | ForEach-Object { Write-Host
| "Element number mod 2^64: $_" }
|
| Whether it runs in parallel depends on the implementation
| of each side. Interpreted powershell code does not run in
| parallel unless you run it a job, use ForEach-Object
| -Parallel, or explicitly put it on another thread. But the
| data is not collected together before being sent from one
| step from the next.
| MatejKafka wrote:
| More compact example (not to scare the POSIX people away
| :) ): 0..1000000 | where {$_ % 10 -eq
| 0} | foreach {"Got Value: $_"}
| poizan42 wrote:
| The streaming behavior of the range operator is weird
| though. This is tested on PowerShell 7.4.1
| > 0..1000000000 | % { $_ } # Starts printing out
| numbers immediately > 0..1000000000 #
| Hangs longer than I had patience to wait for >
| $x=0..100 > $x.GetType() # IsPublic
| IsSerial Name BaseType # -------- --------
| ---- -------- # True True Object[]
| System.Array
|
| It's an array when I save it in a variable, but it's
| obviously not an array on the LHS of a pipe.
| lylejantzi3rd wrote:
| Pipe, |, was also commonly used as an "OR" operator. I wonder
| if the idea that you could "pipe" data between commands came
| later.
| hawski wrote:
| I think the math usage was first. i.e. absolute value: |x|
| adrian_b wrote:
| The language APL\360 of IBM (August 1968) and the other
| APL dialects that have followed it have used a single "|"
| as a monadic prefix operator that computes the absolute
| value and also as a dyadic infix operator that computes
| the remainder of the division (but with the operand order
| reversed in comparison with the language C, which is
| usually much more convenient, especially in APL, where
| this order avoids the need for parentheses in most
| cases).
| samatman wrote:
| Not to get all semiotic about it, but |x| notation is a
| pair of vertical lines. I'm sure that _someone_ has
| written a calculator program where two 0x7D characters
| bracketing a symbol means absolute value, but if I 've
| ever seen it, I can't recall.
|
| Although 0x7D is overly specific, since if a sibling
| comment is correct (I have no reason to think otherwise),
| | for bitwise OR originates in PL/1, where it would have
| been encoded in EBCDIC, which codes it as 0x4F.
|
| I'm not really disagreeing with you, the |abs| notation
| is quite a bit older than computers, just musing on what
| should count as the first use of "|". I'm inclined to say
| that it should go to the first use of _an_ encoding of
| "|", not to the similarly-appearing pen and paper
| notation, and definitely not the first use of ASCII "|"
| aka 0x7D in a programming language. But I don't think
| there's a right answer here, it's a matter of taste.
|
| Because one could argue back to the Roman numeral I, if
| one were determined to do so: when written sans serif,
| it's just a vertical line, after all. Somehow, abs
| notation and "first use of an encoded vertical bar" both
| seem reasonable, while the Roman numeral and
| specifically-ASCII don't, but I doubt I can unpack that
| intuition in any detail.
| adrian_b wrote:
| The character "|" has been introduced in computers in the
| language NPL at IBM in December 1964 as a notation for
| bitwise OR, replacing ".OR.", which had been used by IBM in
| its previous programming language, "FORTRAN IV" (OR was
| between dots to distinguish it from identifiers, marking it
| as an operator).
|
| The next year the experimental NPL (New Programming
| Language) has been rebranded as PL/I and it has become a
| commercial product of IBM.
|
| Following PL/I, other programming languages have begun to
| use "&" and "|" for AND and OR, including the B language,
| the predecessor of C.
|
| The pipe and its notation have been introduced in the Third
| Edition of UNIX (based on a proposal made by M. D.
| McIlroy), in 1972, so after the language B had been used
| for a few years and before the development of C. The oldest
| documentation about pipes that I have seen is in "UNIX
| Programmer's Manual Third Edition" from February 1973.
|
| Before NPL, the vertical bar had already been used in the
| Backus-Naur notation introduced in the report about ALGOL
| 60 as a separator between alternatives in the description
| of the grammar of the language, so with a meaning somewhat
| similar to OR.
| hollerith wrote:
| >as a notation for bitwise OR, replacing ".OR.", which
| had been used by IBM in its previous programming
| language, "FORTRAN IV".
|
| Untrue: ".OR." in FORTRAN meant ordinary OR, not
| _bitwise_ OR. I don 't remember ever seeing bitwise OR or
| AND or XOR in FORTRAN IV.
| adrian_b wrote:
| That is right, but I did not want to provide too many
| details that did not belong to the topic.
|
| FORTRAN IV did not have bit strings, it had only Boolean
| values ("LOGICAL").
|
| Therefore all the logical operators could be applied only
| to Boolean operands, giving a Boolean result.
|
| The same was true for all earlier high-level programming
| languages.
|
| The language NPL, renamed PL/I in 1965, has been the
| first high-level programming language that has introduced
| bit string values, so the AND, OR and NOT operators could
| operate on bit strings, not only on single Boolean
| values.
|
| If PL/I would have remained restricted to the smaller
| character set accepted by FORTRAN IV in source texts,
| they would have retained the FORTRAN IV operators
| ".NOT.", ".AND.", ".OR.", extending their meaning as bit
| string operators.
|
| However IBM has decided to extend the character set,
| which has allowed the use of dedicated symbols for the
| logical operators and also for other operators that
| previously had to use keywords, like the relational
| operators, and also for new operators introduced by PL/I,
| like the concatenation operator.
| arp242 wrote:
| Usually mental models develop "organically" from when one was
| a n00b, without much thought, and sometimes it can take a
| long time for them to be unseated, even though it's kind of
| obvious in hindsight that the mental model is wrong (e.g. one
| can see that from "slow-program | less", and things like
| that).
| maicro wrote:
| I think a main reason for this is that you can have a "good
| enough" working mental model of a process, that holds up to
| your typical use cases and even moderate scrutiny. It's
| often only once you run into a case where your mental model
| fails that you even think to challenge the assumptions it
| was built on - at least, this has been my experience.
| m000 wrote:
| That is called a sponge! SPONGE(1)
| moreutils SPONGE(1) NAME
| sponge - soak up standard input and write to a file
| SYNOPSIS sed '...' file | grep '...' | sponge
| [-a] file DESCRIPTION sponge reads
| standard input and writes it out to the specified file.
| Unlike a shell redirect, sponge soaks up all its input before
| writing the output file. This allows
| constructing pipelines that read from and write
| to the same file.
| shiomiru wrote:
| DOS also has a "pipe", which works exactly like that.
| (Obviously, since DOS can't run multiple programs in
| parallel.)
| Symmetry wrote:
| My first guess involved caching but I was thinking about whether
| the binary itself had to be read from disk or was already cached
| in RAM. Great linux-fu post.
| OJFord wrote:
| I was so confused about why this mattered/made _such_ a
| difference - until I went back and re-read from the top: OP does
| the benchmark timing in `main`, in the Zig app under test.
|
| If you don't do that, if you use the `time` CLI for example, this
| wouldn't have been a problem in the first place. Though sure you
| couldn't have compared to compiling fresh & running anyway, and
| at least on small inputs would've wanted to do the input prep
| first anyway.
|
| But I think if you put the benchmark code inside the DUT you're
| setting yourself up for all kinds of gotchas like this.
| hawski wrote:
| I don't want to belittle the author, but I am surprised, that
| people using a low-level language on Linux wouldn't know how Unix
| pipelines work or that reading one byte per syscall is quite
| inefficient. I understand that the author is still learning
| (aren't we all?), but I just felt it is a pretty fundamental
| knowledge. At the same time author managed to have better
| performance that the official thing had. I guess many things feel
| fundamental in the retrospect.
| mtlynch wrote:
| Author here.
|
| Thanks for reading!
|
| > _I am surprised, that people using a low-level language on
| Linux wouldn 't know ... that reading one byte per syscall is
| quite inefficient._
|
| In my defense, it wasn't that I didn't realize one byte per
| syscall was inefficient; it was that I didn't realize that I
| was doing one syscall per byte read.
|
| I'm coming back to low-level programming after 8ish years of
| Go/Python/JS, so I wasn't really registering that I'd forgotten
| to layer in a buffered reader on top of stdin's reader.
|
| Alex Kladov (matklad) made an interesting point on the Ziggit
| thread[0] that the Zig standard library could adjust the API to
| make this kind of mistake less likely:
|
| > _I'd say readByte is a flawed API to have on a Reader. While
| you technically can read a byte-at-time from something like TCP
| socket, it just doesn't make sense. The reader should only
| allow reading into a slice._
|
| > _Byte-oriented API belongs to a buffered reader._
|
| [0] https://ziggit.dev/t/zig-build-run-is-10x-faster-than-
| compil...
| joelfried wrote:
| Thank you for sharing it!
|
| Articles like this are how one learns the nuances of such
| things, and it's good for people to keep putting them out
| there.
| hawski wrote:
| Zig certainly needs more work. That part is more on
| familiarity with Zig and how intuitive it is or isn't.
|
| In any case I would recommend anyone investigating things
| like that to run things through strace. It is often my first
| step in trying to understand what happens with anything -
| like a cryptic error "No such file or directory" without
| telling me what a thing tried to access. You would run:
|
| $ strace -f sh -c 'your | pipeline | here' -o strace.log
|
| You could then track things easily and see what is really
| happening.
|
| Cheers!
| mtlynch wrote:
| Thanks for the tip! I don't have experience with strace,
| and I'm wondering if I'm misunderstanding what you're
| saying.
|
| I tried running your suggested command, and I got 2,800
| lines of output like this:
| execve("/nix/store/vqvj60h076bhqj6977caz0pfxs6543nb-
| bash-5.2-p15/bin/sh", ["sh", "-c", "echo
| \"60016000526001601ff3\" | xx"...], 0x7fffffffcf38 /* 106
| vars */) = 0 brk(NULL)
| = 0x4ea000 arch_prctl(0x3001 /* ARCH_??? */,
| 0x7fffffffcdd0) = -1 EINVAL (Invalid argument)
| access("/etc/ld-nix.so.preload", R_OK) = -1 ENOENT (No
| such file or directory) openat(AT_FDCWD, "/nix/stor
| e/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libdl.s
| o.2", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\
| 1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\0\0\0\0\0\0\0"...,
| 832) = 832 newfstatat(3, "", {st_mode=S_IFREG|0555,
| st_size=15688, ...}, AT_EMPTY_PATH) = 0 mmap(NULL,
| 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
| 0) = 0x7ffff7fc2000 mmap(NULL, 16400, PROT_READ,
| MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7ffff7fbd000
| mmap(0x7ffff7fbe000, 4096, PROT_READ|PROT_EXEC,
| MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1000) =
| 0x7ffff7fbe000
|
| Am I doing it wrong? Or is there more training involved
| before one could usefully integrate this into debugging?
| Because to me, the output is pretty inscrutable.
| hawski wrote:
| There is a lot of output here, but you can grep around or
| filter with strace CLI. If you used -f option you should
| get PID numbers later on. Then you can look for all
| execve's to see how PIDs map to parts of the pipeline.
| For now maybe grep the log file with something like:
| "grep -e clone -e execve -e write -e read". You can do
| this with strace CLI, but I never remember the syntax and
| usually analyze the log extensively.
|
| I think something like this could work:
| strace -f -e execve,clone,write,read -o strace.log sh -c
| '...'
|
| Clone is fork, so a creation of a new process, before
| eventual execve (with echo there will probably be just
| clone).
| dgoldstein0 wrote:
| strace tells you every syscall the process under it
| makes. So very helpful to understanding how a program
| interacts with the operating system - and I/O as all IO
| mechanisms are managed by the operating system.
|
| As for how to filter this I'll leave that to the other
| comments, but I personally would look at the man page or
| Google around for tips
| amluto wrote:
| As a very general tip:
|
| > execution time: 438.059us
|
| That's a rather short time. (It's a lot of _cycles_ , but
| there are plenty of things one might do on a computer that
| take time comparable to this, especially anything involving
| IO. It's only a small fraction of a disk head seek if you're
| using spinning disks, and it's only a handful of non-
| overlapping random accesses even on NVMe.)
|
| So, when you benchmark anything and get a time this short,
| you should make sure you're benchmarking the right thing.
| Watch out for fixed costs or correct for them. Run in a loop
| and see how time varies with iteration count. Consider using
| a framework that can handle this type of benchmarking.
|
| A result like "this program took half a millisecond to run"
| doesn't really tell you much about how long any part of it
| took.
| vlovich123 wrote:
| FWIW, the current version of the code (i.e. with the buffered
| reader), on my machine at least, runs identically fast with
| and without the tmp file.
|
| Here's a possibly more detailed reason as to why
| https://news.ycombinator.com/item?id=39764287#39768022.
| nebulous1 wrote:
| I feel like you are underestimating how many fundamental
| misunderstandings people can have (including both of us) even
| though they have deep understanding of adjacent issues.
| 0x000xca0xfe wrote:
| There are many fundamental things People Should Know(tm), but
| we are not LLMs that have ingested entire libraries worth of
| books.
|
| Exploratory programming and being curious about strange effects
| is a great way to learn the fundamentals. I already knew how
| pipes and processes work, but I don't know the Ethereum VM. The
| author now knows both.
| cpuguy83 wrote:
| It's one of those things that you don't usually need to think
| about, so you don't.
|
| Not too long ago I hit this same realization with pipes because
| my "grep ... file | sed > file" (or something of that nature)
| was racey.
|
| I took the time to think about it and realized "oh I guess
| that's how pipes would _have_ to be implemented".
| boffinAudio wrote:
| This deleterious effect is a factor in computing. We deal with
| it every few years: kids graduate, having ignored the history
| of their elders in order to focus on the new and cool - and
| once they hit industry, find that they really, really should
| have learned POSIX or whatever.
|
| Its not uncommon. As a professional developer I have observed
| this obfuscation of prior technology countless times,
| especially with junior devs.
|
| There is a lot to learn. Always. It doesn't ever stop.
| franciscop wrote:
| This is exactly what surprised me as well. I'm literally now
| learning in depth WebStreams[1] in JS (vs the traditional Node
| Streams) and I've seen too many times the comparison of how
| "pipe() and pipeTo() behave just like Unix's pipes |". Reading
| this article makes me think this might not be the best
| comparison, specially since for many webdevs it's the first
| time for approaching some CS concepts. OTOH, the vast majority
| of webdevs don't really need to learn WebStreams in-depth.
|
| [1] https://exploringjs.com/nodejs-shell-scripting/ch_web-
| stream...
| LAC-Tech wrote:
| Most people have gaps somewhere in their knowledge. I learned
| very on, as a general superstition, to always try and batch
| things that dealt with the world without like file writes,
| allocations, network requests etc. But for years I had no idea
| what a syscall even was.
| 0x000xca0xfe wrote:
| Also a great reminder to always benchmark with different data set
| sizes.
| masto wrote:
| I realize one needs a catchy title and some storytelling to get
| people to read a blog article, but for a summary of the main
| points:
|
| * This is not about a build step that makes the app perform
| better
|
| * The app isn't 10x faster (or faster at all; it's the same
| binary)
|
| * The author ran a benchmark two ways, one of which inadvertently
| included the time taken to generate sample input data, because it
| was coming from a pipe
|
| * Generating the data before starting the program under test
| fixes the measurement
| meowface wrote:
| Another semi-summary of the core part of the article:
|
| >"echo '60016000526001601ff3' | xxd -r -p | zig build run
| -Doptimize=ReleaseFast" is much faster than "echo
| '60016000526001601ff3' | xxd -r -p | ./zig-out/bin/count-bytes"
| (compiling + running the program is faster than just running an
| already-compiled program)
|
| >When you execute the program directly, xxd and count-bytes
| start at the same time, so the pipe buffer is empty when count-
| bytes first tries to read from stdin, requiring it to wait
| until xxd fills it. But when you use zig build run, xxd gets a
| head start while the program is compiling, so by the time
| count-bytes reads from stdin, the pipe buffer has been filled.
|
| >Imagine a simple bash pipeline like the following: "./jobA |
| ./jobB". My mental model was that jobA would start and run to
| completion and then jobB would start with jobA's output as its
| input. It turns out that all commands in a bash pipeline start
| at the same time.
| anonymous-panda wrote:
| That doesn't make sense unless you have only 1 or 2 physical
| CPUs with contention. In a modern CPU the latter should be
| faster and I'm left unsatisfied by the correctness of the
| explanation. Am I just being thick or is there a more
| plausible explanation?
| DougBTX wrote:
| It depends on where the timing code is. If the timer starts
| after all the data has already been loaded, the time
| recorded will be lower (even if the total time for the
| whole process is higher).
| anonymous-panda wrote:
| I'm not following how that would result in a 10x
| discrepancy. The amount of data we're talking about here
| is laughably small (it's like 32 bytes or something)
| masklinn wrote:
| > The amount of data we're talking about here is
| laughably small
|
| So is the runtime.
| masklinn wrote:
| The latter is faster in actual CPU time, however note that
| TFA the measurement only starts with the program, it does
| not start with the start of the pipeline.
|
| Because the compilation time overlaps with the pipes
| filling up, blocking on the pipe is mostly excluded from
| the measurement in the former case (by the time the program
| starts there's enough data in the pipe that the program can
| slurp a bunch of it, especially reading it byte by byte),
| but included in the latter.
| anonymous-panda wrote:
| My hunch is that if you added the buffered reader and
| kept the original xxd in the pipe you'd see similar
| timings.
|
| The amount of input data is just laughably small here to
| result in a huge timing discrepancy.
|
| I wonder if there's an added element where the constant
| syscalls are reading on a contended mutex and that
| contention disappears if you delay the start of the
| program.
| vlovich123 wrote:
| Good hunch. On my machine (13900k) & zig 0.11, the latest
| version of the code:
|
| > INFILE="$(mktemp)" && echo $INFILE && \ echo
| '60016000526001601ff3' | xxd -r -p > "${INFILE}" && \ zig
| build run -Doptimize=ReleaseFast < "${INFILE}" >
| execution time: 27.742us
|
| vs
|
| > echo '60016000526001601ff3' | xxd -r -p | zig build run
| -Doptimize=ReleaseFast > execution time: 27.999us
|
| The idea that the overlap of execution here by itself
| plays a role is nonsensical. The overlap of execution +
| reading a byte at a time causing kernel mutex contention
| seems like a more plausible explanation although I would
| expect someone better knowledgeable (& more motivated)
| about capturing kernel perf measurements to confirm. If
| this is the explanation, I'm kind of surprised that there
| isn't a lock-free path for pipes in the kernel.
| rofrol wrote:
| This @mtlynch
| vlovich123 wrote:
| To sanity check myself, I reran this without the buffered
| reader and still don't see the slow execution time:
|
| > echo '60016000526001601ff3' | xxd -r -p > | zig build
| run -Doptimize=ReleaseFast
|
| > execution time: 28.889us
|
| So I think my machine config for whatever reason isn't
| representative of whatever OP is using.
|
| Linux-ck 6.8 CONFIG_NO_HZ=y CONFIG_HZ_1000=y
|
| Intel 13900k
|
| zig 0.11
|
| bash 5.2.26
|
| xxd 2024-02-10
|
| Would be good if someone that can repro it compares the
| two invocation variants with buffered reader implemented
| & lists their config.
| mtlynch wrote:
| Based on what you've shared, the second version can start
| reading instantly because "INFILE" was populated in the
| previous test. Did you clear it between tests?
|
| Here are the benchmarks before and after fixing the
| benchmarking code:
|
| Before: https://output.circle-
| artifacts.com/output/job/2f6666c1-1165...
|
| After: https://output.circle-
| artifacts.com/output/job/457cd247-dd7c...
|
| What would explain the drastic performance increase if
| the pipelining behavior is irrelevant?
| vlovich123 wrote:
| That was just a typo in the comment. The command run
| locally was just a strait pipe.
|
| Using both invocation variants, I ran:
|
| 8a5ecac63e44999e14cdf16d5ed689d5770c101f (before buffered
| changes)
|
| 78188ecbc66af6e5889d14067d4a824081b4f0ad (after buffered
| changes)
|
| On my machine, they're all equally fast at ~28 us.
| Clearly the changes only had an impact on machines with a
| different configuration (kernel version or kernel config
| or xxd version or hw).
|
| One hypothesis outlined above is that the when you
| pipeline all 3 applications, the single byte reader
| version is doing back-to-back syscalls and that's causing
| contention between your code and xxd on a kernel mutex
| leading to things going to sleep extra long.
|
| It's not a strong hypothesis though just because of how
| little data there is and the fact that it doesn't repro
| on my machine. To get a real explanation, I think you
| have to actually do some profile measurements on a
| machine that can repro and dig in to obtain a satisfiable
| explanation of what exactly is causing the problem.
| karmakaze wrote:
| I would definitely classify the title as _clickbait_ because
| the app didn 't go "10x faster".
| underdeserver wrote:
| If I were trying to optimize my code, I would start with loading
| the entire benchmark bytecode to memory, then start the counter.
| Otherwise I can't be sure how much time is spent reading from a
| pipe/file to memory, and how much time is spent in my code.
|
| Then I would try to benchmark what happens if it all fits in L1
| cache, L2, L3, and main memory.
|
| Of course, if the common use case is reading from a file,
| network, or pipe, maybe you can optimize that, but I would take
| it step by step.
| jedisct1 wrote:
| This is exactly the thing that feels obvious once you realize it,
| but that can be puzzling until you don't.
| john-tells-all wrote:
| This is an excellent writeup, with interesting ideas and clear
| description of actions taken. My idea of pipelines, also, was
| flawed. Well done!
|
| Nothing to do with Zig. Just a nice debugging story.
| WalterBright wrote:
| Back in college, a friend of mine decided to learn how to
| program. He had never programmed before. He picked up the DEC
| FORTRAN-10 manual and read it cover to cover.
|
| He then wrote a program that generated some large amount of data
| and wrote it to a file. Being much smarter than I am, his first
| program worked the first time.
|
| But it ran terribly slow. Baffled, he showed it to his friend,
| who exclaimed why are you, in a loop, opening the file, appending
| one character, and closing the file? That's going to run
| incredibly slowly. Instead, open the file, write all the data,
| then close it!
|
| The reply was "the manual didn't say anything about that or how
| to do I/O efficiently."
| TylerE wrote:
| I firmly believe that teaching how to idiomatically do both
| character and line oriented file IO should be the first thing
| any language tutorial teaches, almost. Just as soon as you've
| introduced enough syntax.
| PoignardAzur wrote:
| FWIW, the Epitech cursus starts your very first C programming
| lesson by making you write a "my_putchar" function that does
| a "write" syscall with a single character. Then you spend the
| next few days learning how to create my_putnbr, my_putstr,
| etc, using that single building block.
|
| I think that's the right choice, by the way. Baby developers
| don't need to learn efficient I/O, they need to learn how you
| make the pile of sand do smart things.
|
| And if you've spent weeks doing I/O one syscall per
| character, getting to the point you write hundreds of lines
| that way, the moment some classmate shows you that you can
| 100x your program's performance by batching I/O gets burned
| in your memory forever, in a way "I'm doing it because the
| manual said so" doesn't.
| TylerE wrote:
| I said idiomatic because that's the form they're going to
| encounter it in the wild, in library doc, in stack overflow
| answers.
|
| I do t think rhe sort of low level bit banging you propose
| is a worthwhile use of a students time, given the vast
| amount they have to learn that won't be immediately
| obsolete.
| samatman wrote:
| I firmly disagree with this. We're talking about learning
| C, not learning an arbitrary programming language. The
| course of study the GP comment suggests teaches syscalls,
| pipes, buffering, and most important, it teaches
| mechanical sympathy. All of which are things a C
| programmer needs to understand.
|
| More programming tasks than you might imagine are low-
| level bit banging, and C remains the language of choice
| for doing them. It might be Zig one day, and if so, the
| same sort of deep-end dive into low-level detail will
| remain a good way to approach such a language.
|
| Far from becoming "rapidly obsolete", learning in this
| style will prevent this sort of mistake for years into
| the future: https://news.ycombinator.com/item?id=39766130
| TylerE wrote:
| "we" are certainly not talking about C. I never mentioned
| C, nor any language. This was intentional.
| mcguire wrote:
| This is similar to the presentation in _Software Tools_ ,
| IIRC.
| everforward wrote:
| I would argue that they should be teaching how file IO works
| at a low level at some point (preferably as one of the first
| "complicated" bits).
|
| Everybody should, at some early point, interact with basic
| file descriptors and see how they map to syscalls. Preferably
| including building their own character and line oriented
| abstractions on top of that, so they can see how they work.
|
| I'm convinced that IO is in the same category as parallelism;
| most devs understand it poorly at best, and the ones who do
| understand it are worth their weight in gold.
| TylerE wrote:
| I do not think the typical language tutorial is well served
| by trying to teach a condensed version of all of CS.
| WalterBright wrote:
| I beat benchmark after benchmark in 80s on disk I/O. I was
| amazed that nobody else figured out what my compiler was
| doing - using a 16K buffer for the floppy drive rather than
| 512 bytes.
| pixl97 wrote:
| Heh, wasn't 16k most of the memory in the machine? Large
| buffers do have other interesting and fun side effects,
| though back then you probably didn't have any threads or
| any/many of the things buffers cause these days.
| WalterBright wrote:
| The FORTRAN-10 manual was not a tutorial, it was a
| specification.
| michael1999 wrote:
| If you want create something like the pipe behaviour the author
| expected (buffer all output before sending to the next command),
| the sponge command from moreutils can help.
| styfle wrote:
| > By adding a benchmarking script to my continuous integration
| and archiving the results, it was easy for me to identify when my
| measurements changed.
|
| This assumes CI runs on the same machine with same hardware every
| time, but most CI doesn't do that.
| boesboes wrote:
| And that the hardware is not overbooked. I found that my ci/cd
| runs would vary between 8 and 14 minutes (for a specific task
| in the pipeline, no cache involved) between reruns.
|
| And it seemed correlated to time of day. So pretty sure they
| had some contention there.
|
| Edit: and that was with all the same cpu's reported to the os
| atleast
| mcguire wrote:
| There seems to be a small misunderstanding on the behavior of
| pipes here. All the commands in a bash pipeline do start at the
| same time, but output goes into the pipeline buffer whenever the
| writing process writes it. There is no specific point where the
| "output from jobA is ready".
|
| The author's example code, " _jobA starts, sleeps for three
| seconds, prints to stdout, sleeps for two more seconds, then
| exits_ " and " _jobB starts, waits for input on stdin, then
| prints everything it can read from stdin until stdin closes_ " is
| measuring 5 seconds not because the input to jobB is not ready
| until jobA terminates but because jobB is waiting for the pipe to
| close which doesn't happen until jobA ends. That explains the
| timing of the output: $ ./jobA | ./jobB
| 09:11:53.326 jobA is starting 09:11:53.326 jobB is
| starting 09:11:53.328 jobB is waiting on input
| 09:11:56.330 jobB read 'result of jobA is...' from input
| 09:11:58.331 jobA is terminating 09:11:58.331 jobB read
| '42' from input 09:11:58.333 jobB is done reading input
| 09:11:58.335 jobB is terminating
|
| The bottom line is that it's important to actually measure what
| you want to measure.
| mtlynch wrote:
| Author here.
|
| Thanks for reading!
|
| > _All the commands in a bash pipeline do start at the same
| time, but output goes into the pipeline buffer whenever the
| writing process writes it. There is no specific point where the
| "output from jobA is ready"._
|
| Right, I didn't mean to give the impression that there's a time
| at which all input from jobA is ready at once. But there is a
| time when jobB can start reading stdin, and there's a time when
| jobA closes the handle to its stdout.
|
| The reason I split jobA's output into two commands is to show
| that jobB starts reading 3 seconds after the command begins,
| and jobB finishes reading 2 seconds after reading the first
| output from jobA.
| dsm9000 wrote:
| This post is another example of why I like zig so much. It seems
| to get people talking about performance in a way which helps them
| learn how things work below today's heavily abstracted veneer
___________________________________________________________________
(page generated 2024-03-20 23:02 UTC)