[HN Gopher] Wc2: Investigates optimizing 'wc', the Unix word cou...
       ___________________________________________________________________
        
       Wc2: Investigates optimizing 'wc', the Unix word count program
        
       Author : PaulHoule
       Score  : 147 points
       Date   : 2024-06-20 13:54 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | lesuorac wrote:
       | State machines are an underrated approached.
       | 
       | Just remember, if you're ever debugging something and some values
       | in a Class/Object/Group of them are set and others are unset the
       | state machine avoids that issue. When you go to transition
       | between states check that the future state will be valid and if-
       | not go to an error state with that information. The amount of
       | time I've seen spent debugging where spurious `null`s came from
       | completely dwarfs the time it would've taken just to write that
       | class/algorithm as a state machine.
       | 
       | Kinda surprised WC wasn't a state machine to beginning with.
       | Isn't it effectively a special characters counter where if you
       | see a charactered followed by a space bump up the word count? I'm
       | judging by the repos comment of "The real programs spend most of
       | their time in functions like mbrtowc() to parse multi-byte
       | characters and iswspace() to test if they are spaces -- which re-
       | implementations of wc skip." that the big improvement is removing
       | unnecessary work of those methods. mbrtowc [1] appears to re-
       | create the provided substring which isn't necessary to count.
       | 
       | [1]: https://en.cppreference.com/w/cpp/string/multibyte/mbrtowc
        
         | munchler wrote:
         | > Just remember, if you're ever debugging something and some
         | values in a Class/Object/Group of them are set and others are
         | unset the state machine avoids that issue.
         | 
         | Better yet, switch to a language that supports discriminated
         | unions to avoid invalid state without having to write a low-
         | level state machine. Each case in the union represents one of
         | the valid states of the type. This is one of the many benefits
         | of functional programming.
        
           | tialaramex wrote:
           | This comes down to "Make Invalid States Unrepresentable"
           | 
           | Notably Go's choice here is instead "Make Representable
           | States Valid". So for example in Go it's not possible for a
           | type not to have a default value - every type must have a
           | value you get by default, you could name this unwanted
           | default value FUCK_OFF or DO_NOT_USE if you like, but in a
           | larger system (where Go is supposed to thrive) you can be
           | certain you'll find FUCK_OFF and DO_NOT_USE values in the
           | wild.
           | 
           | This is one of the few things the C++ type system almost gets
           | right. You really can say for your C++ type no, it doesn't
           | have a default. Unfortunately exception safety means we can
           | easily end up in a place where too bad it's in _some_ state
           | anyway, but if we turn off exceptions or discount them we can
           | live without nonsensical defaults.
        
             | typon wrote:
             | Maybe something like CppFront can allow metaprogramming
             | such that the default constructor is deleted by default on
             | all types and becomes opt-in.
        
             | ekidd wrote:
             | > _This is one of the few things the C++ type system almost
             | gets right. You really can say for your C++ type no, it
             | doesn 't have a default._
             | 
             | Rust does even better here, I think, for fairly subtle
             | reasons:
             | 
             | - Exceptions are replaced by Result, with explicitly-marked
             | return points. They're far less magic. ("panic!" is still
             | magic, but it's allowed to be a call to "abort", so you
             | only use it when the world is burning.)
             | 
             | - "Move semantics" by default makes a lot of edge cases
             | cleaner. If you move out of an object by default, it gets
             | consumed, not set to a default value. Similarly, moving
             | values into a newly-constructed object can't fail, because
             | it's just memmove.
             | 
             | - Object construction is atomic from the programmer's
             | perspective.
             | 
             | - Default initialization only works if you implement
             | "Default". Which is actually just a normal trait with no
             | magic.
             | 
             | - If you discover that you have a state machine, you can
             | switch to discriminated unions ("enum") to make your states
             | mutually exclusive and clearly represented.
             | 
             | This whole area is one of the better parts of Rust's
             | design.
        
       | im3w1l wrote:
       | I'm surprised there is no mention of simd. Like I'm sure this is
       | "fast enough" but if you want to make a really fast wc for fun
       | wouldn't that be a natural direction?
        
         | thequux wrote:
         | A hand-written SIMD wc is likely to be even faster than a state
         | machine, but at the cost of orders of magnitude more work. The
         | huge advantage of state machines are that they are a relatively
         | generic approach that provides massive speedups nearly every
         | time. A SIMD wc algorithm is a significant effort that can't be
         | generalized, and is only likely to provide a 4x speedup or so:
         | rarely worth it except in cases where the task is a performance
         | hotspot for a _lot_ of applications.
        
           | Joker_vD wrote:
           | SIMD-based reimplementations of wc already exist, e.g. this
           | [0] one, and it's rather straightforward.
           | 
           | [0] https://github.com/expr-fi/fastlwc
        
           | mananaysiempre wrote:
           | An ASCII SIMD wc would likely be much more than 4x faster.
           | The 320 MB/s is speed TFA quotes is good for a byte-at-a-time
           | solution, but pitiful as far as the capabilities of modern
           | machines are concerned. A decent SSD is an order of magnitude
           | faster than that. Around 1 GB/s (i.e. ~100 cycles/fetch) is
           | table stakes for a task of this complexity.
           | 
           | (TFA almost nerd-sniped me into writing a SIMD implementation
           | already, and you finished the job. End result: on an aging
           | 2.5 GHz Broadwell, I'm seeing 300 MB/s for TFA's C
           | implementation and 3200 MB/s for my first attempt at an
           | x86-64-v3 one. So more like a 10x speedup, and its highly
           | likely one can do better. Admittedly I did spend around an
           | hour on these 50 lines, and would need to spend more in
           | reality to check for CPU features at runtime and so on.)
           | 
           | However, TFA calls going ASCII-only "cheating", and I can see
           | its point. I don't know how difficult a Unicode SIMD wc would
           | be or if it's even possible to do well. (Bonus points: do you
           | want to define a word boundary as a space/nonspace boundary
           | the way[1] POSIX tells you to, or the more complex way[2]
           | Unicode tells you to?)
           | 
           | [1] https://pubs.opengroup.org/onlinepubs/9699919799/utilitie
           | s/w...
           | 
           | [2] https://www.unicode.org/reports/tr29/#Word_Boundaries
        
             | mananaysiempre wrote:
             | The (cheating, ASCII-only) SIMD implementation for
             | reference:                 #if 0 /* shell polyglot */
             | exec ${CC:-cc} $CPPFLAGS -g -O2 -march=x86-64-v3 $CFLAGS -o
             | "${0%.c}" "$0" || exit $?       #endif       /* SPDX-
             | License-Identifier: CC0-1.0 */              #include
             | <immintrin.h>       #include <stdint.h>       #include
             | <stdio.h>       #include <unistd.h>              int
             | main(int argc, char **argv) {           const char *const
             | progname = argc ? argv[0] : "<unknown>";           unsigned
             | long long bytes = 0, words = 0, lines = 0;
             | uint32_t prev = -1;           for (;;) {
             | static __m256i buf[1024 * 1024];               ssize_t k, n
             | = 0;               do {                   k =
             | read(STDIN_FILENO, (char *)buf + n, sizeof buf - n);
             | } while (k > 0 && (n += k) < sizeof buf);               if
             | (k < 0) { perror(progname); return 1; }               if
             | (!n) break;               bytes += n;
             | while (n % sizeof *buf) ((char *)buf)[n++] = ' ';
             | n /= sizeof *buf;                      for (size_t i = 0; i
             | < n; i++) {                   __m256i x =
             | _mm256_load_si256(&buf[i]);
             | __m256i nl = _mm256_cmpeq_epi8(x, _mm256_set1_epi8('\n'));
             | lines += _mm_popcnt_u32(_mm256_movemask_epi8(nl));
             | const __m256i table =
             | _mm256_broadcastsi128_si256(_mm_setr_epi8(
             | /* 0 */ ' ', 0,    0,    0,    0,    0,    0, 0,
             | /* 8 */ 0,   '\t', '\n', '\v', '\f', '\r', 0, 0));
             | __m256i ws = _mm256_cmpeq_epi8(x,
             | _mm256_shuffle_epi8(table,
             | _mm256_and_si256(x, _mm256_set1_epi8(0x0F))));
             | uint64_t mask = _mm256_movemask_epi8(ws);
             | words += _mm_popcnt_u32(mask ^ (mask << 32 | prev) >> 31);
             | prev = mask;               }           }           words +=
             | !(prev >> 31);           printf("%llu %llu %llu\n", lines,
             | words / 2, bytes);           return 0;       }
        
         | Galanwe wrote:
         | I think this is but a detail in the context of the article.
         | 
         | Yes, you can implement your state machine on SIMD sized states,
         | but there are also millions of other optimizations you can do
         | like that, aligning your reads, prefetching the file,
         | parallelizing, etc.
         | 
         | Doesn't change the core concept
        
       | mjamil wrote:
       | "Many programmers think pointer-arithmetic is faster." Don't
       | modern compilers make this statement false (i.e., both approaches
       | are implemented via the same machine instructions)?
        
         | IshKebab wrote:
         | They definitely do in simple cases (e.g. a simple for loop). In
         | more complex cases, especially when the index/pointer escapes a
         | function then the code will be different.
        
         | jimbobthrowawy wrote:
         | I checked using godbolt, and when compiling with -O2 (like his
         | makefile) only the parse_chunk_pp version is in the emitted
         | assembly and is called regardless of the -P option. I think
         | compilers have done this where applicable for a long time.
         | 
         | Compiling without any arguments leads to two lines differing in
         | the assembly for the functions.
        
         | trealira wrote:
         | Something funny is that GCC may compile your code that uses
         | pointer arithmetic into one that uses an index register, and
         | compile your code that uses an index into an array as pointer
         | arithmetic.
         | 
         | https://godbolt.org/z/9KPnnM7Po
        
       | markstos wrote:
       | Was this ever submitted an update to the `wc` shipped on systems?
        
         | canucker2016 wrote:
         | I doubt it - from the project's readme page:
         | 
         | "...Now the real wc works with a lot more character-sets, and
         | we don't do that. But by implementing UTF-8, we've shown that
         | it's possible, and that the speed for any character-set is the
         | same."
         | 
         | Looking at the command line options for wc2.c and comparing to
         | the FSF wc man page, it looks like there are some
         | missing/unsupported options in wc2.c as well.
        
           | anon1040 wrote:
           | > ...the real wc works with a lot more character-sets...
           | 
           | This should read: the real wc works with a lot more character
           | _encodings_
        
       | YesThatTom2 wrote:
       | I love state machines and every time I use one my workers think I
       | invented it because they've never seen them before.
       | 
       | The data for state machine in this article might be best prepared
       | by generating it from a program. That generator program doesn't
       | need to care (too much) about performance since it is run during
       | the build process. I like the idea of doing a lot of work now to
       | save time in the future.
        
         | pletnes wrote:
         | Do you have any accessible references on state machines? I've
         | just read about them a bit but never coded one in a project.
         | Seems like a useful concept for testable, reliable code.
        
           | GrantMoyer wrote:
           | The video _Finite State Machines explained_ by Abelardo
           | Pardo[1] seems like a good introduction (I 'm not familiar
           | with the author; I just searched Finite State machine on
           | youtube and found the first result which wasn't a jumbled
           | mess of abstract or misused jargon).
           | 
           | It may seem simple, but that's truely all there is to finite
           | state machines, a set of finite states and a set of events
           | which cause transitions between states. Their specific
           | applications to various domains is a much larger topic, but
           | with a basic understanding, any application should be clear
           | whenever it comes up. There's no one way to implement a state
           | machine, and you've probably used them all the time, even if
           | you haven't thought about them formally. For example, a
           | simple but common 2-state machine takes the form:
           | running = true        while running:            ...
           | if <some condition>:                running = false
           | 
           | [1]: https://www.youtube.com/watch?v=hJIST1cEf6A
        
             | d0mine wrote:
             | It could be written as:                   while True:
             | ...             if <exit condition>:                 break
        
           | trealira wrote:
           | A finite state machine has a set of input symbols, a set of
           | states, and a transition function, which maps the current
           | state and current input to the next state (and, when
           | programming it, it doesn't have to be a literal function; for
           | example, it could be a 2D array). It also has a start state,
           | and a set of accept states.
           | 
           | It's also common for state machines to have some sort of
           | output; otherwise, all an FSM can do is recognize whether an
           | input string is "good." So, there's also an output function,
           | so that the FSM actually does something while processing the
           | input string. There are two types of FSMs: a Mealy Machine,
           | and a Moore machine. In a Mealy machine, the output function
           | maps the current state and the current input to an output;
           | so, one state could be associated with multiple outputs. In a
           | Moore machine, the output is _purely_ a function of the
           | current state, no matter the input. Usually, a Moore machine
           | has more states than the equivalent Mealy machine. In
           | practice, Moore machines are a lot rarer than Mealy machines.
           | 
           | You can model FSMs in a variety of ways. You could use the
           | state pattern [0]. You could have a table-based state
           | machine; examples in C here [1] [2]. It's also common to
           | model them using a loop and a switch statement with integers
           | to represent the current state - or just use "goto" to go to
           | the next state, if it's available in your language.
           | 
           | Also, this is tangential, but something nice about coroutines
           | is that they save can you from writing an explicit state
           | machine, as the state is implicit in your code. One example
           | of this is how it's easier to write iterators as coroutines
           | that yield results as they run, rather than as structures
           | that update themselves based on their current state,
           | transition to the next state, and yield a result.
           | 
           | So far, what I've covered is basically a deterministic finite
           | automaton (DFA). There are also non-deterministic finite
           | automata (NDFAs), which are used to implement regular
           | expressions (at least when regular expressions aren't
           | extended to require more powerful recognizers) [3], as well
           | as pushdown automata (PDAs), which are basically finite
           | automata with a stack, and can parse e.g. programming
           | languages. Finally, there are Turing machines.
           | 
           | For theory on that, I'd recommend reading Sipser's
           | _Introduction to the Theory of Computation._ If you look up
           | the book on Google, it 's not difficult to find a PDF of it.
           | 
           | [0]: https://refactoring.guru/design-patterns/state
           | 
           | [1]: https://nullprogram.com/blog/2020/12/31
           | 
           | [2]: https://ideone.com/94U6aJ (this one removes C-style
           | comments from source code - I didn't make it)
           | 
           | [3]: https://swtch.com/~rsc/regexp/regexp1.html
        
         | SoftTalker wrote:
         | I agree, and I raised an eyebrow at the "The algorithm is known
         | as an "asynchronous state-machine parser". It's a technique for
         | parsing that you don't learn in college."
         | 
         | I certainly learned about state machines in college. Not sure
         | we ever studied this particular algorithm but the concept was
         | definitely covered.
         | 
         | And I also agree that when you are dealing with a known, finite
         | set of values, pre-computing lookup or index tables for
         | techniques like this can be a big win.
        
         | FroshKiller wrote:
         | I'm a little surprised to hear this. Writing a state machine to
         | simulate an alarm clock used to be an introductory project for
         | first-year computer science undergraduates.
        
       | jmholla wrote:
       | > No, you aren't suppose to be able to see how the word-count
       | works by looking at this code. The complexity happens elsewhere,
       | setting up the state-machine.
       | 
       | This is like most of the reason I opened this article. I wish
       | they'd spend more time talking about this. In fact, most of the
       | README covers details that are important, but, irrelevant to
       | their algorithmic improvements.
       | 
       | Maybe they expect you to open up the code, but even there the
       | comments are largely describing what things do and not why, so I
       | feel you need to have some understanding of the algorithm in
       | question. For example, the parts where the state table gets
       | constructed rather unhelpful comments to the uninformed:
       | /**          * Build an ASCII row. This configures low-order
       | 7-bits, which should          * be roughly the same for all
       | states          */         static void
       | build_basic(unsigned char *row, unsigned char default_state,
       | unsigned char ubase)                  ...                  /**
       | * This function compiles a DFA-style state-machine for parsing
       | UTF-8          * variable-length byte sequences.          */
       | static void         compile_utf8_statemachine(int is_multibyte)
       | 
       | Even `wc2o.c` doesn't delve into the magic numbers it has chosen.
       | I was hoping this repo would be more educational and explain how
       | the state table works and why it's constructed the way it is.
       | 
       | Does anyone have a good resource for learning more about
       | asynchronous state-machine parsers, that also could hopefully
       | help explain why this is better than just iterating over
       | characters? I'm guessing maybe it's the lack of branching?
        
         | VMG wrote:
         | I think the documentation might be stale, `build_basic` clearly
         | has code for non-ASCII characters
        
         | thechao wrote:
         | I'm with you. It appears to be a fairly bog-standard DFA, but
         | divining that from 4x4 `int`s and a 256 `char` array feels like
         | I'd just be doing all of the heavy lifting, intellectually.
         | However... it does look like it's trying to detect spaces
         | (that's the array of 256 `char`, as `char` is self-indexing). I
         | can't be bothered to remember how `wc` works, but my guess is
         | that the state-machine has transition states for `sp->sp`,
         | `sp->wd`, `wd->wd`, and `wd-sp`. It'd also need error modes. I
         | suppose it doesn't contain states for escapes?
        
         | zadokshi wrote:
         | > why this is better than just iterating over characters?
         | 
         | The explanation it provides regarding Apache vs Ngnix seemed to
         | me to imply that Apache requires memory allocation to read
         | headers into buffers and Ngnix uses a state machine to avoid
         | memory allocation. Memory allocation is slower than most people
         | realise.
        
           | loeg wrote:
           | Why is this better for wc, though? wc is not doing memory
           | allocation in the hot loop.
        
             | detourdog wrote:
             | I think it keeps wc simple if an algorithm is introduced
             | not directly related to count words it becomes more
             | complicated to debug. I see this a an example of the Unix
             | philosophy of do one thing well.
        
             | jxndnxu wrote:
             | Because classic wc is not iterating over every byte once,
             | but multiple times.
             | 
             | It's especially obvious in the Unicode case where it first
             | takes 1-4 bytes to get a Unicode character and then checks
             | this character with another function to see if it's
             | whitespace
             | 
             | But even with with naive ASCII approach, if you don't hand
             | roll a state machine you are checking multiple conditions
             | on each byte (is it a space and am I leaving a word etc)
             | 
             | Using a dfa has fixed compute per byte
        
               | kllrnohj wrote:
               | Those 1-4 bytes are sitting in a register the entire time
               | and thus basically free to read as often as you want,
               | though.
               | 
               | An actual sampled profile showing the two approaches
               | would be interesting. Naively it seems like it's just
               | because it has faster UTF8 handling and nothing to do
               | with being a state machine exactly
        
         | worewood wrote:
         | Writing text and writing code are two distinct abilities. It is
         | very rare the professional that possesses both.
        
         | James_K wrote:
         | You can figure it out by just looking at the table and output.
         | It prints the number of lines, then words, then characters.
         | Since the lines count is counts[1], you can conclude that when
         | a newline is encountered, the machine will transition to state
         | 1, hence why the newline is the only 2 entry in the character
         | table and all 2 entries in the state machine point to state 1.
         | From the character table, we can see that characters are split
         | between classes 0, 1, 2 which are word, space, and newline
         | characters.
         | 
         | To get the definitions of all other states, by working back
         | from the fact that state 2 is the word count, it must be
         | entered when a word character (class 0) is hit after some
         | space/newline characters. Hence state 0 represents the machine
         | being in the middle of a sequence of whitespace, state 1 is
         | when the machine is in a sequences of newlines, state 2 is the
         | first letter of a word, and state 3 is any subsequent letter of
         | a word. You can then verify that the machine transitions to the
         | correct state from those by checking what it does in case of a
         | word, space, or newline character. But I'm not sure why the
         | table is 4x4. The fourth transition from every state is
         | unreachable since the symbols are all in [0, 2].
         | 
         | As for why the approach is better, I think it's because it
         | avoids branching. If you think about the length of a word, it's
         | quite short compared to a CPU pipeline, so on a branching
         | version of this you're going to be stuck spending most of your
         | execution time on branch miss-prediction.
        
           | Vvector wrote:
           | I've read recently (here on HN) that branch predictors are
           | right 99% if the time. Is that inaccurate?
        
             | kardos wrote:
             | It may be true in some cases, but not in general. I think
             | of it as an observation that for large swathes of code the
             | branch is predictable. Eg, we usually don't go down the
             | error checking codepath. Faced with flow control that
             | depends on random data, such as "is this random number
             | even", the branch predictor will have a hard time doing
             | better than 50%.
        
             | James_K wrote:
             | This is the 1% of cases. As I understand it, most branches
             | take the same path most of the time. Think of a loop for
             | instance. With purely statistical prediction (go whichever
             | direction the branch takes more often), loops will be
             | predicted with extreme accuracy and the result is a huge
             | number of correct branches. Then think about error handling
             | code which isn't triggered most of the time due to input
             | being valid. Between these two easy cases, I think you
             | cover the majority of branches is programs which is why so
             | many of them are predictable.
        
             | drchickensalad wrote:
             | If you're like me, you need to stop thinking of it as "99%
             | of call sites" and start thinking of it as "99% of
             | iteration loops". I know it's super obvious on the face of
             | it, but the implications is that most call sites can
             | repeatedly fail but are totally overshadowed by insanely
             | high frequency hot loops that it almost always predicts
             | right (it's skewed, just like avg vs median)
        
           | tylerhou wrote:
           | There are fewer branches but there is now a data dependency
           | between loop iterations which makes each iteration slightly
           | slower (maybe 1-2 cycles additional latency per iteration).
           | 
           | Because newlines are relatively common and unpredictable a
           | state machine is likely better. But on a long file with no
           | new lines and no spaces the branching one should be slightly
           | faster. (All reasoning from first principles; I have not done
           | any benchmarking!)
        
             | anamexis wrote:
             | The README covers the exact case you describe - the
             | `word.txt` benchmark is just a file with 93MB of the char
             | `x`. The state machine is still faster in this case.
        
           | samatman wrote:
           | > _But I 'm not sure why the table is 4x4. The fourth
           | transition from every state is unreachable since the symbols
           | are all in [0, 2]._
           | 
           | I think the fourth transition represents illegal Unicode.
           | From there it stays in the illegal state until it hits legal
           | UTF-8, then goes back to counting.
        
             | James_K wrote:
             | That isn't relevant in the ASCII example, and the UTF-8 one
             | uses a 256x256 table.
        
         | sebastianmestre wrote:
         | Yeah, Sean Barrett has a good writeup about them and optimizing
         | them for branch-prediction, etc
         | 
         | It's framed in the context of programming-language
         | tokenization, but the principles are the same.
         | 
         | https://nothings.org/computer/lexing.html
        
         | joe_the_user wrote:
         | Hmm,
         | 
         | A project with all code working with special magic reminds me
         | of another project that some attention a while back (XZ lib -
         | supposed improvements with packages containing clever system
         | back doors).
         | 
         | Edit: The code is likely harmless but the more opaqueness is in
         | a system, the easier it is for malevolent opaqueness to hide.
        
       | rakoo wrote:
       | > This state machine approach always results in the same speed,
       | regardless of input.
       | 
       | This is strange to me, shouldn't it be O(N) with the size of the
       | input ? Or maybe in this benchmark the inputs are small enough
       | that the difference isn't measurable ?
        
         | VMG wrote:
         | he is talking in context of same-sized files (counted in bytes)
        
         | sebastianmestre wrote:
         | They meant that different characters don't change the
         | performance, unlike the Linux and MacOS versions, where
         | different characters go through different code paths.
         | 
         | But yes, a bigger file does take longer to process.
        
         | roywiggins wrote:
         | Speed as in bytes per second, regardless of which bytes you
         | give it.
        
       | fsckboy wrote:
       | while you are in there,                 wc --unicode       wc
       | --bom
        
       | aargh_aargh wrote:
       | I wanted to take a peek the state diagram (ASCII version). But I
       | was feeling lazy, so apologies for having ChatGPT do the work for
       | me. I wasn't going for correctness and didn't check it. If you're
       | like me, here you go:
       | 
       | https://dreampuf.github.io/GraphvizOnline/#digraph%20StateMa...
       | 
       | digraph StateMachine { rankdir=LR; size="8,5"; node [shape =
       | circle];                   LookingForWord -> InsideWord
       | [label="Non-space"];         LookingForWord -> LookingForWord
       | [label="Space"];         LookingForWord -> Newline
       | [label="Newline"];         LookingForWord -> LookingForWord
       | [label="Other"];                  Newline -> InsideWord
       | [label="Non-space"];         Newline -> LookingForWord
       | [label="Space"];         Newline -> Newline [label="Newline"];
       | Newline -> LookingForWord [label="Other"];
       | InsideWord -> ContinueWord [label="Non-space"];
       | InsideWord -> LookingForWord [label="Space"];         InsideWord
       | -> Newline [label="Newline"];         InsideWord ->
       | LookingForWord [label="Other"];                  ContinueWord ->
       | ContinueWord [label="Non-space"];         ContinueWord ->
       | LookingForWord [label="Space"];         ContinueWord -> Newline
       | [label="Newline"];         ContinueWord -> LookingForWord
       | [label="Other"];
       | 
       | }
       | 
       | EDIT: Under peer pressure, I checked it and it correctly reflects
       | the code apart from being designed for one specific line ending
       | sequence (as it should, being an example optimized for brevity).
       | 
       | As for the replies, as opposed to my approach, I'm sure when
       | you're browsing research papers, you're doing a full
       | reproducibility study for each one. I'm sorry I commented, I
       | should have waited for you to post your results.
        
         | bruce343434 wrote:
         | This is a useless contribution to the discussion, maybe even
         | negative... Why didn't you check it at least before sharing?
        
           | aftbit wrote:
           | Why? Because it's wrong? If it were correct, wouldn't it be
           | useful?
        
             | roywiggins wrote:
             | This is a bit like presenting someone with a raw egg on a
             | plate and saying "ah, but if I _had_ cooked it, might it
             | not be a tasty omelet? " Correctness is the hard part!
        
               | arp242 wrote:
               | More like a covered plate you're expected to eat without
               | looking. "This may be an amazing omelet and you will have
               | a great meal, or you may eat a raw egg, get salmonella,
               | and be sick for a week. It's probably an omelet, but I'm
               | not sure, so you'll probably be fine. Enjoy!"
        
         | roywiggins wrote:
         | Surely correctness is the most important thing? Anyone can
         | produce an _incorrect_ state diagram.
        
       | mananaysiempre wrote:
       | > This is analogous to NFA and DFA regular-expressions. If you
       | use the NFA approach, you need to buffer the entire chunk of
       | data, so that the regex can backtrack. Using the DFA approach,
       | input can be provided as a stream
       | 
       | Sorry, no, this is completely wrong. There are other reasons NFAs
       | or DFAs might be preferable, and there are other performance
       | considerations as far as backtracking is concerned, but NFA
       | implementations do not backtrack. I mean, they could, but they
       | don't, and in common usage a "NFA regex implementation" means a
       | non-backtracking one (that goes through all input once with a bag
       | of states in hand and advances each one of them on each step),
       | whereas a "backtracking regex implementation" means one that
       | doesn't use automata at all (and is usually worst-case
       | exponential in the input).
       | 
       | What this is actually talking about is a streaming parser. Just
       | call it like it is, it's an OK term that sees plenty of usage.
       | 
       | [What you _could_ mention here is PEG parsers, because even
       | linear PEG implementations may need to keep O(input) state and
       | thus are essentially incapable of streaming, unlike LL or LR
       | ones; and GLR et al. can be even worse, for what it's worth. But
       | this is a much more obscure topic.]
        
         | veltas wrote:
         | How do you implement a non-backtracking NFA other than
         | converting it to a DFA?
        
           | burntsushi wrote:
           | At search time, the principle difference between an NFA and a
           | DFA is that, in an NFA, you can be in more than one state at
           | the same time. There's no problem with implementing your
           | transition function to advance all states you are in
           | simultaneously for each character of input. You'll get `O(m *
           | n)` time, where `m` is proportional to the number of states
           | in the NFA and `n` is proportional to the length of the
           | input.
           | 
           | The GP is remarking on a confusable term, "NFA algorithm,"
           | that has come to mean something different from its underlying
           | theory. The PCRE2 documentation itself, for example[1],
           | remarks about how it uses the "NFA algorithm" as its standard
           | approach to matching. But of course, PCRE2 supports _way_
           | more than the theoretical model of non-deterministic finite
           | automata (NFA) actually supports. PCRE2 is much more
           | expressive. Yet, the implementation approach is still called
           | "NFA algorithm." Presumably the reason for this is that an
           | NFA _can_ be simulated by doing backtracking, but if you don
           | 't keep track of where you've visited, the worst case runtime
           | of this is superlinear, possibly catastrophically so[2]. The
           | problem is that the backtracking regex engines have evolved
           | well beyond what an NFA actually supports. But the name
           | stuck.
           | 
           | To make things even more confusing, as I mentioned above, an
           | NFA can be used to execute a search in O(m * n) time. (We
           | typically call that "linear in the size of the input" since
           | the NFA itself can usually, but not always, be regarded as a
           | constant factor. i.e., A literal regex string in your program
           | source code.) So when you say something like, "a finite
           | automata regex engine uses NFAs to guarantee linear time,"
           | you might be _really_ confused if your background is in
           | backtracking engines which... also use an  "NFA," but of
           | course cannot guarantee linear search time.
           | 
           | So what we have is ambiguous terminology, not unlike "regex"
           | itself, which could perhaps mean a real theoretical "regular
           | expression" (in the case of Hyperscan, RE2, Go's regexp
           | package and Rust's regex crate) or it could mean "a very
           | expressive DSL for pattern matching that supports more than
           | what regular languages can describe" (in the case of PCRE2,
           | ECMAScript's regex engine, Ruby's Oniguruma, Python's `re`
           | module, and more). I've made my peace with "regex" being
           | ambiguous, but the fact that "NFA algorithm" is itself
           | ambiguous despite "non-deterministic finite automata" being a
           | somewhat baroque and specific technical term, is very
           | unfortunate.
           | 
           | To make things even worse---and I shit you not, I am not
           | joking---the backtracking people seem to have taken to
           | calling a linear time NFA search the "DFA algorithm."[3] (At
           | least the PCRE2 docs mention it's not a "traditional" finite
           | state machine...) But... it's not a DFA. The backtracking
           | folks have effectively elevated the terms "NFA" and "DFA" to
           | mean something very different from their theoretical
           | underpinnings.
           | 
           | So when you have people that are only aware of one meaning of
           | "NFA algorithm" talking to other people only aware of the
           | other meaning of "NFA algorithm," chaos ensues.
           | 
           | [1]: https://www.pcre.org/current/doc/html/pcre2matching.html
           | 
           | [2]: https://github.com/BurntSushi/rebar?tab=readme-ov-
           | file#cloud...
           | 
           | [3]: https://www.pcre.org/current/doc/html/pcre2matching.html
           | #TOC...
        
             | fanf2 wrote:
             | Yeah, it's vexing. This backtracking / NFA / DFA confusion
             | originally came from Jeffrey Friedl's book "mastering
             | regular expressions" (O'Reilly, 1997) -- at least, that's
             | the first place I saw the mistake in print. Philip Hazel
             | relied a lot on Friedl's book when writing PCRE, which is
             | why the PCRE docs get it wrong. (I spoke to Philip about
             | this many years ago when we worked together.)
        
               | burntsushi wrote:
               | Yeah I didn't mention Friedl because I've dumped on him
               | in the past, and didn't want to belabor it. But Friedl
               | definitely didn't start this. According to him, he was
               | just using what was common vernacular even back then:
               | http://regex.info/blog/2006-09-15/248 (Note that blog is
               | from 2006, but he's actually quoting himself from ~10
               | years prior to that in 1997.)
               | 
               | But... he is the one who ultimately installed this
               | ambiguity into a classic book that is still read to this
               | day. So I think he can at least be blamed for
               | popularizing the confusion. And especially since the
               | entire book is framed around "NFA" and "DFA" regex
               | engines. It's not like it was just some one-off mention.
               | The mix-up is baked into the conceptual fabric of the
               | book. The landscape also looked different back then. It
               | predated RE2 for example, and RE2 is, I think,
               | principally responsible for bringing "backtracking
               | semantics" to finite automata oriented engines. So
               | Friedl's book is forever stuck in a false dichotomy that
               | only _happened_ to exist back when he wrote it.
        
           | loeg wrote:
           | As GP said, you can keep a bag of current states.
        
           | mananaysiempre wrote:
           | Assuming each state transition consumes a character
           | ("epsilon-free NFA"), just follow the definition. Before each
           | step, you have a list of NFA states you might be in,
           | initially containing only the start state. During each step,
           | go through each state on that list and enter any acceptable
           | successors into a new one. Deduplicate and repeat.
           | 
           | This is (potentially super)linear in the number of states,
           | but that doesn't preclude it from being linear in the input
           | length. In particular, even though it's essentially a bad
           | lazy implementation of standard NFA-to-DFA conversion, you're
           | avoiding the exponential state blowup you get in the eager
           | version. (I say it's a bad lazy implementation because a
           | proper one would memoize; that, however, would bring the
           | worst-case blowup back.) I think that, in addition to the
           | fact that people do actually do this kind of thing in
           | production implementations, qualifies it as a separate
           | approach.
           | 
           | If your NFA does have epsilon-transitions--like, say, the
           | NFAs obtained from the textbook Thompson regexp-to-NFA
           | construction--you can either use a different (Glushkov)
           | regexp-to-NFA construction[1], eliminate those transitions as
           | a separate pass[2], or adjust the algorithm above to
           | eliminate duplicate NFA states on the fly[3].
           | 
           | [1]: https://en.wikipedia.org/wiki/Glushkov%27s_construction_
           | algo...
           | 
           | [2]: https://news.ycombinator.com/item?id=19272990
           | 
           | [3]: https://swtch.com/~rsc/regexp/regexp2.html
        
       | bruce343434 wrote:
       | What about this is "asynchronous"?
        
         | roelschroeven wrote:
         | I don't get that either.
         | 
         | It feels like they compare the asynchronous-ness of their
         | algorithm to Nginx being asynchronous ("asynchronous web-
         | servers like Nginx use a state-machine parser. They parse the
         | bytes as they arrive, and discard them."), but I don't see how
         | that relates. The way web-servers handle requests (multiple
         | threads vs multiple processes vs asynchronous event-driven in
         | one thread) is completely orthogonal to how they parse headers.
         | 
         | The impression I have is that their algorithm works in a
         | streaming way, without having to allocate memory buffers, and
         | that they call that asynchronous (wrongly, as far as I can
         | see).
         | 
         | I also don't really see what they mean by their algorithm being
         | more scalable.
        
           | laurent_du wrote:
           | My understanding was that scalability came from the removal
           | of memory buffers.
        
       | smaudet wrote:
       | I was a bit surprised by the macos/linux speed differential here:
       | 
       | > The difference in macOS and Linux speed is actually the
       | difference in clang and gcc speed. The LLVM clang compiler is
       | doing better optimizations for x86 processors here.
       | 
       | I know GCC has been getting much, much more aggressive as of
       | late, but there have been some complaints that it is now much,
       | much less safe (checks intended for security getting optimized
       | away, e.g.).
       | 
       | I wonder if you were to go back to 5 years ago, if you'd see the
       | linux code was generally safer than the speedy osx llvm code...
        
         | kstrauser wrote:
         | I'd assume the difference between Linux's GNU and macOS's BSD
         | implementations would be much more significant.
        
           | pjdesno wrote:
           | Exactly. MacOS (really BSD) grep sucks, 5-10x slower than GNU
           | Grep for simple searches on the same hardware, but MacOS/BSD
           | wc is fast.
        
       | aftbit wrote:
       | It would have been nice for the author to label the states and
       | explain how and why they chose the states that they did.
       | Otherwise, it's hard to understand how to apply this sort of
       | thing to larger problems.
        
         | jxndnxu wrote:
         | How would if have helped? You either design a DFA by hand or
         | use a compiled from a regular language
        
       | cryptos wrote:
       | Now we need another round of the language battle, because now
       | this approach needs to be implemented in Rust, Go, Zig, Nim ...
       | you name it!
        
       | rep_movsd wrote:
       | Remember "The Zen of Code Optimization" by Michael Abrash in the
       | 90s?
       | 
       | This word count challenge was the first one he described. The
       | table driven approach was discussed, but was not necessarily the
       | fastest due to the large cache overflowing (on 90s processors)
       | table needed.
       | 
       | I used the same example in a chapter I wrote for a C++ book
        
       | moody__ wrote:
       | The "asynchronous state machine" name here is a bit strange, when
       | searching for this term used elsewhere I couldn't find any formal
       | definition what it is. Reading further in the README it looks
       | like the author implies that it really just means a DFA? Not
       | entirely sure.
       | 
       | I'd also like to add the Plan 9 implementation[0], which also
       | uses the properties of utf8 as part of its state machine and
       | anecdotally has been quite performant when I've used it.
       | 
       | [0]
       | http://git.9front.org/plan9front/plan9front/107a7ba9717429ae...
        
         | rhelz wrote:
         | "Asynchronous" isn't part of the name of some really cool state
         | machine :-) Its just an adjective and means the same as when
         | you put it in front of any other noun.
         | 
         | A _synchronous_ state machine is one where the incoming stream
         | of events is always  "in sync" with the state transitions, in
         | the following sense:
         | 
         | 1. When an event happens, the state machine can transition to
         | the next state and perform any necessary actions _before_ the
         | next event happens
         | 
         | 2. After the the state machine has transitioned and performed
         | any necessary actions, the program must wait for the next event
         | to happen. It can't do anything else until it does.
         | 
         | An _asynchronous_ state machine doesn 't make the main program
         | wait until the next event happens. It can go on and do other
         | things in the meantime. It doesn't have to wait for then next
         | event to arrive.
         | 
         | When the next event _does_ arrive, the program pauses whatever
         | else it is doing, and hands control back to the state machine,
         | which process the event, and then hands control back over to
         | the main program.
        
           | PaulHoule wrote:
           | As I see it, state machines are particularly good for
           | expressing logic in asynchronous systems. For instance in the
           | late 1980s I wrote assembly language XMODEM implementations
           | for the 6809 and the 80286 and since that kind of code is
           | interrupt drive it is efficient to make a state machine that
           | processes one character at a time. Today when you use
           | async/await the compiler converts your code, loops and all,
           | into a state machine.
        
             | rhelz wrote:
             | // 6809
             | 
             | The good old days :-)
        
           | moody__ wrote:
           | I was not treating "asynchronous state machine" as a noun,
           | even if taken as a generic adjective it doesn't make sense in
           | this context. What "other things" is this wc2.c doing while
           | the state machine is churning? There is no multi threading or
           | multi processing going on here. So I find it hard to believe
           | that this use of "asynchronous" is inside of what I would
           | generally see it used as. As such I thought perhaps it
           | referred to a specific methodology for designing the code,
           | something akin to how the "lock free" adjective implies a
           | certain design sensibility.
        
             | rhelz wrote:
             | // What "other things" is this wc2.c doing...
             | 
             | AFAICT, wc2.c isn't written to be an asynchronous state
             | machine. It doesn't ever seem to transfer the control to
             | any other place.
             | 
             | // So I find it hard to believe that this use of
             | "asynchronous" is inside of what I would generally see it
             | used as
             | 
             | Yeah, you are legitimately confused. The post talks about
             | asynchronous state machines, but w2c.c isn't an example of
             | that. I'm sure this gave you a severe case of WTF?!??
             | 
             | // thought perhaps it referred to a specific methodology
             | for designing the code
             | 
             | It does---that's exactly what it is, a programming
             | methodolog, or perhaps better put, a design pattern. But
             | w2c.c isn't an example of code written using that
             | methodology. Again, you are legitimately confused here,
             | because the post talks about something and w2c.c isn't
             | that.
             | 
             | Do you know python? If you google for "asynchronous
             | programming in python" you'll get all kinds of blog posts
             | and youtube videos which explain the technique.
        
       | JonChesterfield wrote:
       | Recognising that wc is a lexer problem and implementing it using
       | re2c would be my first choice here. That'll be building a similar
       | state machine to what is done by hand here, especially since the
       | hand rolled one is still byte at a time (the main drawback to
       | re2c's generated output).
        
       | GrantMoyer wrote:
       | > The algorithm is known as an "asynchronous state-machine
       | parser". It's a technique for parsing that you don't learn in
       | college.
       | 
       | The mapping from regular and context free languages to state
       | machines and their table-based implementations was covered in
       | depth in the Compilers course I took. A table based state machine
       | is literally the textbook algorithm for parsing these classes of
       | languages, and implementing them has been almost entirely
       | automated by tools like Flex and Bison since the 80s.
        
         | mcguire wrote:
         | For example, see Chapter 3 of the Dragon Book, _Compilers:
         | Principles, Techniques, and Tools_ by Aho, Ullman, et al. Or
         | probably any other compilers text book.
        
         | mywittyname wrote:
         | Compilers was a graduate course at my school.
         | 
         | CS as a field has grown quite a lot over the years, and it
         | seems that colleges are responding by pushing the more
         | theoretical classes into grad school to make room for the more
         | applied classes.
        
       | throwaway55533 wrote:
       | From reading the article, how is this more efficient? Doesn't any
       | word counting algorithm have to iterate through all the
       | characters and count spaces?
       | 
       | What makes this better than the standard algorithm of
       | wc = 0         prev = null         for ch in charStream:
       | if !isSpace(ch) and isSpace(prev):                wc += 1
       | prev = ch
        
         | GrantMoyer wrote:
         | Basically, it's just a lookup table for isSpace, rather than
         | whatever logic is in the original function (which probably has
         | conditional branches).
         | 
         | There's a little bit of a complication in the fact that a state
         | machine can implicitly share parts of the computation for utf-8
         | encoded codepoints with shared prefixes, so instead of a
         | 2^32-element lookup table, you only need 4 2^8-element lookup
         | tables (and instead of 1 state, parsing codepoint, you have 4,
         | parsing first byte, parsing second byte, etc.).
        
           | throwaway55533 wrote:
           | Ah. Is 8 bits really optimal? I don't know how many UTF space
           | characters there are, I thought there were only a few. Why
           | not a 16-bit lookup?
        
             | GrantMoyer wrote:
             | UTF-8 encoded codepoints can have an odd number of bytes,
             | so processing a file 2 bytes at a time would be a little
             | more complicated. Processing the file one codepoint at a
             | time works because you can decode the UTF-8 stream into a
             | stream of 32-bit codepoints before passing the codepoints
             | to the lookup table. I suppose you could also transform
             | from UTF-8 to UTF-16 before passing values to the lookup
             | table.
             | 
             | Processing byte by byte isn't necessarily faster than
             | processing codepoint by codepoint, or any other size. You'd
             | need to measure performance empirically, and it probably
             | depends on caches sizes and other factors. In theory, you
             | could also process bit by bit -- then you'd only need 32
             | 2-element lookup tables -- but that's unlikely to be
             | efficient, since you'd need to do a lot of bit
             | manipulation.
             | 
             | Edit: Upon inspection, the method I described doesn't
             | appear to be the method used by the featured program. It
             | still basically uses a lookup table for detecting space
             | characters, byte by byte, but the states are not how I
             | described. Instead of the states representing which byte of
             | a UTF-8 encoded codepoint is being processed, and the word
             | count being incremented upon certain transitions -- a Mealy
             | machine -- the state represent the class of the codepoint
             | last fully processed, and the count is always increased
             | based on the current state (often by zero) -- a Moore
             | machine.
        
         | jxndnxu wrote:
         | You have answered your question yourself: your algorithm looks
         | at each byte twice, not once
         | 
         | It's even more obvious in the UTF case where the classic
         | implementation first looks at 1-4 byte to parse a character and
         | only then checks if it's a space
        
       | JackSlateur wrote:
       | Reading the code to help me understand how things are
       | 
       | I got the wc_lines from coreutils:
       | https://github.com/coreutils/coreutils/blob/master/src/wc.c#...
       | 
       | And I thought "damn, I understand nothing about the state-machine
       | stuff, how did they made this faster ?"
       | 
       | Truth is: they did not Of course, this is just the "count line"
       | part. Other parts are indeed faster.
       | 
       | Coreutils 9.4:                 0.67 [jack:/tmp] /usr/bin/time -v
       | wc -l debian-12.2.0-amd64-netinst.iso        2524855
       | debian-12.2.0-amd64-netinst.iso        Command being timed: "wc
       | -l debian-12.2.0-amd64-netinst.iso"      User time (seconds):
       | 0.00      System time (seconds): 0.07      Percent of CPU this
       | job got: 98%      Elapsed (wall clock) time (h:mm:ss or m:ss):
       | 0:00.07      Average shared text size (kbytes): 0      Average
       | unshared data size (kbytes): 0      Average stack size (kbytes):
       | 0      Average total size (kbytes): 0      Maximum resident set
       | size (kbytes): 2176      Average resident set size (kbytes): 0
       | Major (requiring I/O) page faults: 0      Minor (reclaiming a
       | frame) page faults: 99      Voluntary context switches: 1
       | Involuntary context switches: 0      Swaps: 0      File system
       | inputs: 0      File system outputs: 0      Socket messages sent:
       | 0      Socket messages received: 0      Signals delivered: 0
       | Page size (bytes): 4096      Exit status: 0
       | 
       | And wc2 (from gcc -O3 -o wc2 wc2.c):                 0.47
       | [jack:/tmp] /usr/bin/time -v ./wc2 -l
       | debian-12.2.0-amd64-netinst.iso        2524855
       | debian-12.2.0-amd64-netinst.iso      Command being timed: "./wc2
       | -l debian-12.2.0-amd64- netinst.iso"      User time (seconds):
       | 0.97      System time (seconds): 0.05      Percent of CPU this
       | job got: 99%      Elapsed (wall clock) time (h:mm:ss or m:ss):
       | 0:01.03      Average shared text size (kbytes): 0      Average
       | unshared data size (kbytes): 0      Average stack size (kbytes):
       | 0      Average total size (kbytes): 0      Maximum resident set
       | size (kbytes): 2176      Average resident set size (kbytes): 0
       | Major (requiring I/O) page faults: 0      Minor (reclaiming a
       | frame) page faults: 146      Voluntary context switches: 1
       | Involuntary context switches: 1      Swaps: 0      File system
       | inputs: 0      File system outputs: 0      Socket messages sent:
       | 0      Socket messages received: 0      Signals delivered: 0
       | Page size (bytes): 4096      Exit status: 0
       | 
       | Gotta keep reading
        
         | orf wrote:
         | Any notable difference if you pipe the file in, rather than
         | read from disk?
         | 
         | IO caches are a thing as well, don't use the first measurement.
        
           | JackSlateur wrote:
           | I've accounted for IO cache, the test is running from an in-
           | memory dataset
           | 
           | After checking a bit more, wc2 is indeed faster than wc in a
           | specific use case: counting characters. That is, transforming
           | a sequence of one or multiple bytes into meaningful character
           | (wc --chars somefile).
           | 
           | In wc, the related code is here: https://github.com/coreutils
           | /coreutils/blob/master/src/wc.c#...
           | 
           | So in the end, wc uses the function while wc2 does not:
           | https://en.cppreference.com/w/c/string/multibyte/mbrtoc32
           | 
           | Now, what I cannot tell : is wc2's code really comparable
           | versus mbrtoc32 ? From a couple of tests, it works against
           | ascii and utf8 dataset, but is that all ? Or are there
           | edgecases cleverly handled by mbrtoc32 (with a performance
           | cost) ?
        
         | roetlich wrote:
         | From looking at the code, the clever bit is how this is doing
         | word count. (and word & line count at the same time). So yeah,
         | it makes -l isn't better.
        
       | NunoSempere wrote:
       | From a previous project:
       | https://nunosempere.com/blog/2023/09/15/wc/ that was looking not
       | at speed, but at simplicity, here is a comparison of different
       | historical versions of wc:
       | 
       | ---
       | 
       | The [version of wc.c](https://git.nunosempere.com/personal/wc/src
       | /branch/master/sr...) in this repository sits at 44 lines. It
       | decides to read from stdin if the number of arguments fed to it
       | is otherwise zero, and uses the standard C function getc to read
       | character by character. It doesn't have flags, instead, there are
       | further utilities in the src/extra/ folder for counting
       | characters and lines, sitting at 32 and 35 lines of code,
       | respectively. This version also has little error checking.
       | 
       | [Here](https://github.com/dspinellis/unix-history-
       | repo/blob/Researc...) is a version of wc from UNIX V7, at 86
       | lines. It allows for counting characters, words and lines. I
       | couldn't find a version in UNIX V6, so I'm guessing this is one
       | of the earliest versions of this program. It decides to read from
       | stdin if the number of arguments fed to it is zero, and reads
       | character by character using the standard C getc function.
       | 
       | The busybox version ([git.busybox.net](https://git.busybox.net/bu
       | sybox/tree/coreutils/wc.c)) of wc sits at 257 lines (162 with
       | comments stripped), while striving to be [POSIX-
       | compliant](https://pubs.opengroup.org/onlinepubs/9699919799/),
       | meaning it has a fair number of flags and a bit of complexity. It
       | reads character by character by using the standard getc function,
       | and decides to read from stdin or not using its own
       | fopen_or_warn_stdin function. It uses two GOTOs to get around,
       | and has some incomplete Unicode support.
       | 
       | The [plan9](https://9p.io/sources/plan9/sys/src/cmd/wc.c) version
       | implements some sort of table method in 331 lines. It uses plan9
       | rather than Unix libraries and methods, and seems to read from
       | stdin if the number of args is 0.
       | 
       | The plan9port version of wc ([github](https://github.com/9fans/pl
       | an9port/blob/master/src/cmd/wc.c)) also implements some sort of
       | table method, in 352 lines. It reads from stdin if the number of
       | args is 0, and uses the Linux read function to read character by
       | character.
       | 
       | The [OpenBSD](https://github.com/openbsd/src/blob/master/usr.bin/
       | wc/wc.c) version is just *nice*. It reads from stdin by default,
       | and uses a bit of buffering using read to speed things up. It
       | defaults to using fstat when counting characters. It is generally
       | pleasantly understandable, nice to read. I'm actually surprised
       | at how pleasant it is to read.
       | 
       | The [FreeBSD
       | version](https://cgit.freebsd.org/src/tree/usr.bin/wc/wc.c) sits
       | at 367 lines. It has enough new things that I can't parse all
       | that it's doing: in lines 137-143, what is capabilities mode?
       | what is casper?, but otherwise it decides whether to read from
       | stdin by the number of arguments, in line 157. It uses a
       | combination of fstat and read, depending on the type of file.
       | 
       | Finally, the GNU utils version ([github](https://github.com/coreu
       | tils/coreutils/tree/master/src/wc.c), [savannah](http://git.savan
       | nah.gnu.org/gitweb/?p=coreutils.git;a=blob;f...)) is a bit over
       | 1K lines of C. It does many things and checks many possible
       | failure modes. I think it detects whether it should be reading
       | from stdin using some very wrapped fstat, and it reads character
       | by character using its own custom function.
       | 
       | So this utility started out reasonably small, then started
       | getting more and more complex. [The POSIX
       | committee](https://pubs.opengroup.org/onlinepubs/9699919799/)
       | ended up codifying that complexity, and now we are stuck with it
       | because even implementations like busybox which strive to be
       | quite small try to keep to POSIX.
        
         | trealira wrote:
         | There's another version from Unix V10: http://tuhs.org/cgi-
         | bin/utree.pl?file=V10/cmd/wc.c
         | 
         | Also, from V8: https://www.tuhs.org/cgi-
         | bin/utree.pl?file=V8/usr/src/cmd/wc...
        
       | btown wrote:
       | State machines are great for complex situations, but when it
       | comes to performance, it's not at all clear to me that they're
       | the most scalable approach with modern systems.
       | 
       | The data dependency between a loop iteration for each character
       | _might_ be pipelined really well when executed, and we can assume
       | large enough L1 /L2 cache for our lookup tables. But we're still
       | using at least one lookup per character.
       | 
       | Projects like https://github.com/simdjson/simdjson?tab=readme-ov-
       | file#abou... are truly fascinating, because they're based on SIMD
       | instructions that can process 64 or more bytes with a single
       | instruction. Very much worth checking out the papers at that
       | link.
        
         | jxndnxu wrote:
         | My understanding of the article's use of scalable was "fixed
         | overhead more or less regardless of the complexity of the state
         | machine and input" not "fastest implementation available"
        
       | apitman wrote:
       | I didn't know you were allowed to use a language other than Rust
       | for something like this.
        
       | nitwit005 wrote:
       | The primary reason their code is faster as they've made incorrect
       | code that assumes you always have a UTF-8 locale. The normal
       | isspace does not assume this:
       | 
       | > The real programs spend most of their time in functions like
       | mbrtowc() to parse multi-byte characters and iswspace() to test
       | if they are spaces
       | 
       | This will produce bad results in the real world. People have
       | previously posted about bugs related to this on Hacker News:
       | https://news.ycombinator.com/item?id=36216389
        
         | scottlamb wrote:
         | Are we looking at the same thread? Folks there seem to be
         | complaining the old interfaces are anything from outdated to
         | confusing to simply wrong, and I agree.
         | 
         | I think it's totally reasonable for a program designed in 2024
         | to say it only supports ASCII and UTF-8 encodings. Whether/how
         | it should support the full spectrum of Unicode definitions of
         | characters/graphemes/... is more interesting. For a lot of
         | backend code (processing text-based network protocols, for
         | example), focusing on ASCII is arguably best (e.g. functions
         | like isspace and isdigit only returning true for ASCII
         | characters). For more user-focused stuff, most of the world
         | would say they'd like their native language supported well.
         | Programs written with both use cases in mind should probably
         | have a switch. (Or parallel APIs: e.g. Rust has
         | {u8,char}::is_ascii_digit vs. char::is_numeric, which makes
         | much more sense there than having one that switches behaviors
         | based on an environment variable, as the correct behavior
         | really depends on the call site.)
         | 
         | Of course "say it only supports ASCII and UTF-8 encodings" is
         | mutually exclusive with claiming to be a drop-in replacement
         | for a utility that is specified as having POSIX locale support.
         | This project does not make that claim, or even that it's
         | intended to be actually used rather than illustrative of a
         | performance technique.
        
         | leni536 wrote:
         | It would be just as fast if it detected the locale correctly,
         | and used the UTF8 state machine for the UTF8 locale only. Other
         | locales could have their own state machines or just fall back
         | to a generic implementation.
        
       | 1vuio0pswjnm7 wrote:
       | "The wc program included with macOS and Linux are completely
       | different. Therefore, the following table shows them benchmarked
       | against each other on the same hardware."
       | 
       | Appears BSD wc ("the macOS program") is much faster than the GNU
       | coreutils wc ("the Linux program").
       | 
       | It certainly illustrates a point about people rewriting wc in
       | newer languages to demonstrate alleged speed against C.
       | 
       | Which wc are they using for comparison.
       | 
       | Pehaps they should use wc2 or wc2o.
        
       ___________________________________________________________________
       (page generated 2024-06-20 23:00 UTC)