[HN Gopher] Wc2: Investigates optimizing 'wc', the Unix word cou...
___________________________________________________________________
Wc2: Investigates optimizing 'wc', the Unix word count program
Author : PaulHoule
Score : 147 points
Date : 2024-06-20 13:54 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| lesuorac wrote:
| State machines are an underrated approached.
|
| Just remember, if you're ever debugging something and some values
| in a Class/Object/Group of them are set and others are unset the
| state machine avoids that issue. When you go to transition
| between states check that the future state will be valid and if-
| not go to an error state with that information. The amount of
| time I've seen spent debugging where spurious `null`s came from
| completely dwarfs the time it would've taken just to write that
| class/algorithm as a state machine.
|
| Kinda surprised WC wasn't a state machine to beginning with.
| Isn't it effectively a special characters counter where if you
| see a charactered followed by a space bump up the word count? I'm
| judging by the repos comment of "The real programs spend most of
| their time in functions like mbrtowc() to parse multi-byte
| characters and iswspace() to test if they are spaces -- which re-
| implementations of wc skip." that the big improvement is removing
| unnecessary work of those methods. mbrtowc [1] appears to re-
| create the provided substring which isn't necessary to count.
|
| [1]: https://en.cppreference.com/w/cpp/string/multibyte/mbrtowc
| munchler wrote:
| > Just remember, if you're ever debugging something and some
| values in a Class/Object/Group of them are set and others are
| unset the state machine avoids that issue.
|
| Better yet, switch to a language that supports discriminated
| unions to avoid invalid state without having to write a low-
| level state machine. Each case in the union represents one of
| the valid states of the type. This is one of the many benefits
| of functional programming.
| tialaramex wrote:
| This comes down to "Make Invalid States Unrepresentable"
|
| Notably Go's choice here is instead "Make Representable
| States Valid". So for example in Go it's not possible for a
| type not to have a default value - every type must have a
| value you get by default, you could name this unwanted
| default value FUCK_OFF or DO_NOT_USE if you like, but in a
| larger system (where Go is supposed to thrive) you can be
| certain you'll find FUCK_OFF and DO_NOT_USE values in the
| wild.
|
| This is one of the few things the C++ type system almost gets
| right. You really can say for your C++ type no, it doesn't
| have a default. Unfortunately exception safety means we can
| easily end up in a place where too bad it's in _some_ state
| anyway, but if we turn off exceptions or discount them we can
| live without nonsensical defaults.
| typon wrote:
| Maybe something like CppFront can allow metaprogramming
| such that the default constructor is deleted by default on
| all types and becomes opt-in.
| ekidd wrote:
| > _This is one of the few things the C++ type system almost
| gets right. You really can say for your C++ type no, it
| doesn 't have a default._
|
| Rust does even better here, I think, for fairly subtle
| reasons:
|
| - Exceptions are replaced by Result, with explicitly-marked
| return points. They're far less magic. ("panic!" is still
| magic, but it's allowed to be a call to "abort", so you
| only use it when the world is burning.)
|
| - "Move semantics" by default makes a lot of edge cases
| cleaner. If you move out of an object by default, it gets
| consumed, not set to a default value. Similarly, moving
| values into a newly-constructed object can't fail, because
| it's just memmove.
|
| - Object construction is atomic from the programmer's
| perspective.
|
| - Default initialization only works if you implement
| "Default". Which is actually just a normal trait with no
| magic.
|
| - If you discover that you have a state machine, you can
| switch to discriminated unions ("enum") to make your states
| mutually exclusive and clearly represented.
|
| This whole area is one of the better parts of Rust's
| design.
| im3w1l wrote:
| I'm surprised there is no mention of simd. Like I'm sure this is
| "fast enough" but if you want to make a really fast wc for fun
| wouldn't that be a natural direction?
| thequux wrote:
| A hand-written SIMD wc is likely to be even faster than a state
| machine, but at the cost of orders of magnitude more work. The
| huge advantage of state machines are that they are a relatively
| generic approach that provides massive speedups nearly every
| time. A SIMD wc algorithm is a significant effort that can't be
| generalized, and is only likely to provide a 4x speedup or so:
| rarely worth it except in cases where the task is a performance
| hotspot for a _lot_ of applications.
| Joker_vD wrote:
| SIMD-based reimplementations of wc already exist, e.g. this
| [0] one, and it's rather straightforward.
|
| [0] https://github.com/expr-fi/fastlwc
| mananaysiempre wrote:
| An ASCII SIMD wc would likely be much more than 4x faster.
| The 320 MB/s is speed TFA quotes is good for a byte-at-a-time
| solution, but pitiful as far as the capabilities of modern
| machines are concerned. A decent SSD is an order of magnitude
| faster than that. Around 1 GB/s (i.e. ~100 cycles/fetch) is
| table stakes for a task of this complexity.
|
| (TFA almost nerd-sniped me into writing a SIMD implementation
| already, and you finished the job. End result: on an aging
| 2.5 GHz Broadwell, I'm seeing 300 MB/s for TFA's C
| implementation and 3200 MB/s for my first attempt at an
| x86-64-v3 one. So more like a 10x speedup, and its highly
| likely one can do better. Admittedly I did spend around an
| hour on these 50 lines, and would need to spend more in
| reality to check for CPU features at runtime and so on.)
|
| However, TFA calls going ASCII-only "cheating", and I can see
| its point. I don't know how difficult a Unicode SIMD wc would
| be or if it's even possible to do well. (Bonus points: do you
| want to define a word boundary as a space/nonspace boundary
| the way[1] POSIX tells you to, or the more complex way[2]
| Unicode tells you to?)
|
| [1] https://pubs.opengroup.org/onlinepubs/9699919799/utilitie
| s/w...
|
| [2] https://www.unicode.org/reports/tr29/#Word_Boundaries
| mananaysiempre wrote:
| The (cheating, ASCII-only) SIMD implementation for
| reference: #if 0 /* shell polyglot */
| exec ${CC:-cc} $CPPFLAGS -g -O2 -march=x86-64-v3 $CFLAGS -o
| "${0%.c}" "$0" || exit $? #endif /* SPDX-
| License-Identifier: CC0-1.0 */ #include
| <immintrin.h> #include <stdint.h> #include
| <stdio.h> #include <unistd.h> int
| main(int argc, char **argv) { const char *const
| progname = argc ? argv[0] : "<unknown>"; unsigned
| long long bytes = 0, words = 0, lines = 0;
| uint32_t prev = -1; for (;;) {
| static __m256i buf[1024 * 1024]; ssize_t k, n
| = 0; do { k =
| read(STDIN_FILENO, (char *)buf + n, sizeof buf - n);
| } while (k > 0 && (n += k) < sizeof buf); if
| (k < 0) { perror(progname); return 1; } if
| (!n) break; bytes += n;
| while (n % sizeof *buf) ((char *)buf)[n++] = ' ';
| n /= sizeof *buf; for (size_t i = 0; i
| < n; i++) { __m256i x =
| _mm256_load_si256(&buf[i]);
| __m256i nl = _mm256_cmpeq_epi8(x, _mm256_set1_epi8('\n'));
| lines += _mm_popcnt_u32(_mm256_movemask_epi8(nl));
| const __m256i table =
| _mm256_broadcastsi128_si256(_mm_setr_epi8(
| /* 0 */ ' ', 0, 0, 0, 0, 0, 0, 0,
| /* 8 */ 0, '\t', '\n', '\v', '\f', '\r', 0, 0));
| __m256i ws = _mm256_cmpeq_epi8(x,
| _mm256_shuffle_epi8(table,
| _mm256_and_si256(x, _mm256_set1_epi8(0x0F))));
| uint64_t mask = _mm256_movemask_epi8(ws);
| words += _mm_popcnt_u32(mask ^ (mask << 32 | prev) >> 31);
| prev = mask; } } words +=
| !(prev >> 31); printf("%llu %llu %llu\n", lines,
| words / 2, bytes); return 0; }
| Galanwe wrote:
| I think this is but a detail in the context of the article.
|
| Yes, you can implement your state machine on SIMD sized states,
| but there are also millions of other optimizations you can do
| like that, aligning your reads, prefetching the file,
| parallelizing, etc.
|
| Doesn't change the core concept
| mjamil wrote:
| "Many programmers think pointer-arithmetic is faster." Don't
| modern compilers make this statement false (i.e., both approaches
| are implemented via the same machine instructions)?
| IshKebab wrote:
| They definitely do in simple cases (e.g. a simple for loop). In
| more complex cases, especially when the index/pointer escapes a
| function then the code will be different.
| jimbobthrowawy wrote:
| I checked using godbolt, and when compiling with -O2 (like his
| makefile) only the parse_chunk_pp version is in the emitted
| assembly and is called regardless of the -P option. I think
| compilers have done this where applicable for a long time.
|
| Compiling without any arguments leads to two lines differing in
| the assembly for the functions.
| trealira wrote:
| Something funny is that GCC may compile your code that uses
| pointer arithmetic into one that uses an index register, and
| compile your code that uses an index into an array as pointer
| arithmetic.
|
| https://godbolt.org/z/9KPnnM7Po
| markstos wrote:
| Was this ever submitted an update to the `wc` shipped on systems?
| canucker2016 wrote:
| I doubt it - from the project's readme page:
|
| "...Now the real wc works with a lot more character-sets, and
| we don't do that. But by implementing UTF-8, we've shown that
| it's possible, and that the speed for any character-set is the
| same."
|
| Looking at the command line options for wc2.c and comparing to
| the FSF wc man page, it looks like there are some
| missing/unsupported options in wc2.c as well.
| anon1040 wrote:
| > ...the real wc works with a lot more character-sets...
|
| This should read: the real wc works with a lot more character
| _encodings_
| YesThatTom2 wrote:
| I love state machines and every time I use one my workers think I
| invented it because they've never seen them before.
|
| The data for state machine in this article might be best prepared
| by generating it from a program. That generator program doesn't
| need to care (too much) about performance since it is run during
| the build process. I like the idea of doing a lot of work now to
| save time in the future.
| pletnes wrote:
| Do you have any accessible references on state machines? I've
| just read about them a bit but never coded one in a project.
| Seems like a useful concept for testable, reliable code.
| GrantMoyer wrote:
| The video _Finite State Machines explained_ by Abelardo
| Pardo[1] seems like a good introduction (I 'm not familiar
| with the author; I just searched Finite State machine on
| youtube and found the first result which wasn't a jumbled
| mess of abstract or misused jargon).
|
| It may seem simple, but that's truely all there is to finite
| state machines, a set of finite states and a set of events
| which cause transitions between states. Their specific
| applications to various domains is a much larger topic, but
| with a basic understanding, any application should be clear
| whenever it comes up. There's no one way to implement a state
| machine, and you've probably used them all the time, even if
| you haven't thought about them formally. For example, a
| simple but common 2-state machine takes the form:
| running = true while running: ...
| if <some condition>: running = false
|
| [1]: https://www.youtube.com/watch?v=hJIST1cEf6A
| d0mine wrote:
| It could be written as: while True:
| ... if <exit condition>: break
| trealira wrote:
| A finite state machine has a set of input symbols, a set of
| states, and a transition function, which maps the current
| state and current input to the next state (and, when
| programming it, it doesn't have to be a literal function; for
| example, it could be a 2D array). It also has a start state,
| and a set of accept states.
|
| It's also common for state machines to have some sort of
| output; otherwise, all an FSM can do is recognize whether an
| input string is "good." So, there's also an output function,
| so that the FSM actually does something while processing the
| input string. There are two types of FSMs: a Mealy Machine,
| and a Moore machine. In a Mealy machine, the output function
| maps the current state and the current input to an output;
| so, one state could be associated with multiple outputs. In a
| Moore machine, the output is _purely_ a function of the
| current state, no matter the input. Usually, a Moore machine
| has more states than the equivalent Mealy machine. In
| practice, Moore machines are a lot rarer than Mealy machines.
|
| You can model FSMs in a variety of ways. You could use the
| state pattern [0]. You could have a table-based state
| machine; examples in C here [1] [2]. It's also common to
| model them using a loop and a switch statement with integers
| to represent the current state - or just use "goto" to go to
| the next state, if it's available in your language.
|
| Also, this is tangential, but something nice about coroutines
| is that they save can you from writing an explicit state
| machine, as the state is implicit in your code. One example
| of this is how it's easier to write iterators as coroutines
| that yield results as they run, rather than as structures
| that update themselves based on their current state,
| transition to the next state, and yield a result.
|
| So far, what I've covered is basically a deterministic finite
| automaton (DFA). There are also non-deterministic finite
| automata (NDFAs), which are used to implement regular
| expressions (at least when regular expressions aren't
| extended to require more powerful recognizers) [3], as well
| as pushdown automata (PDAs), which are basically finite
| automata with a stack, and can parse e.g. programming
| languages. Finally, there are Turing machines.
|
| For theory on that, I'd recommend reading Sipser's
| _Introduction to the Theory of Computation._ If you look up
| the book on Google, it 's not difficult to find a PDF of it.
|
| [0]: https://refactoring.guru/design-patterns/state
|
| [1]: https://nullprogram.com/blog/2020/12/31
|
| [2]: https://ideone.com/94U6aJ (this one removes C-style
| comments from source code - I didn't make it)
|
| [3]: https://swtch.com/~rsc/regexp/regexp1.html
| SoftTalker wrote:
| I agree, and I raised an eyebrow at the "The algorithm is known
| as an "asynchronous state-machine parser". It's a technique for
| parsing that you don't learn in college."
|
| I certainly learned about state machines in college. Not sure
| we ever studied this particular algorithm but the concept was
| definitely covered.
|
| And I also agree that when you are dealing with a known, finite
| set of values, pre-computing lookup or index tables for
| techniques like this can be a big win.
| FroshKiller wrote:
| I'm a little surprised to hear this. Writing a state machine to
| simulate an alarm clock used to be an introductory project for
| first-year computer science undergraduates.
| jmholla wrote:
| > No, you aren't suppose to be able to see how the word-count
| works by looking at this code. The complexity happens elsewhere,
| setting up the state-machine.
|
| This is like most of the reason I opened this article. I wish
| they'd spend more time talking about this. In fact, most of the
| README covers details that are important, but, irrelevant to
| their algorithmic improvements.
|
| Maybe they expect you to open up the code, but even there the
| comments are largely describing what things do and not why, so I
| feel you need to have some understanding of the algorithm in
| question. For example, the parts where the state table gets
| constructed rather unhelpful comments to the uninformed:
| /** * Build an ASCII row. This configures low-order
| 7-bits, which should * be roughly the same for all
| states */ static void
| build_basic(unsigned char *row, unsigned char default_state,
| unsigned char ubase) ... /**
| * This function compiles a DFA-style state-machine for parsing
| UTF-8 * variable-length byte sequences. */
| static void compile_utf8_statemachine(int is_multibyte)
|
| Even `wc2o.c` doesn't delve into the magic numbers it has chosen.
| I was hoping this repo would be more educational and explain how
| the state table works and why it's constructed the way it is.
|
| Does anyone have a good resource for learning more about
| asynchronous state-machine parsers, that also could hopefully
| help explain why this is better than just iterating over
| characters? I'm guessing maybe it's the lack of branching?
| VMG wrote:
| I think the documentation might be stale, `build_basic` clearly
| has code for non-ASCII characters
| thechao wrote:
| I'm with you. It appears to be a fairly bog-standard DFA, but
| divining that from 4x4 `int`s and a 256 `char` array feels like
| I'd just be doing all of the heavy lifting, intellectually.
| However... it does look like it's trying to detect spaces
| (that's the array of 256 `char`, as `char` is self-indexing). I
| can't be bothered to remember how `wc` works, but my guess is
| that the state-machine has transition states for `sp->sp`,
| `sp->wd`, `wd->wd`, and `wd-sp`. It'd also need error modes. I
| suppose it doesn't contain states for escapes?
| zadokshi wrote:
| > why this is better than just iterating over characters?
|
| The explanation it provides regarding Apache vs Ngnix seemed to
| me to imply that Apache requires memory allocation to read
| headers into buffers and Ngnix uses a state machine to avoid
| memory allocation. Memory allocation is slower than most people
| realise.
| loeg wrote:
| Why is this better for wc, though? wc is not doing memory
| allocation in the hot loop.
| detourdog wrote:
| I think it keeps wc simple if an algorithm is introduced
| not directly related to count words it becomes more
| complicated to debug. I see this a an example of the Unix
| philosophy of do one thing well.
| jxndnxu wrote:
| Because classic wc is not iterating over every byte once,
| but multiple times.
|
| It's especially obvious in the Unicode case where it first
| takes 1-4 bytes to get a Unicode character and then checks
| this character with another function to see if it's
| whitespace
|
| But even with with naive ASCII approach, if you don't hand
| roll a state machine you are checking multiple conditions
| on each byte (is it a space and am I leaving a word etc)
|
| Using a dfa has fixed compute per byte
| kllrnohj wrote:
| Those 1-4 bytes are sitting in a register the entire time
| and thus basically free to read as often as you want,
| though.
|
| An actual sampled profile showing the two approaches
| would be interesting. Naively it seems like it's just
| because it has faster UTF8 handling and nothing to do
| with being a state machine exactly
| worewood wrote:
| Writing text and writing code are two distinct abilities. It is
| very rare the professional that possesses both.
| James_K wrote:
| You can figure it out by just looking at the table and output.
| It prints the number of lines, then words, then characters.
| Since the lines count is counts[1], you can conclude that when
| a newline is encountered, the machine will transition to state
| 1, hence why the newline is the only 2 entry in the character
| table and all 2 entries in the state machine point to state 1.
| From the character table, we can see that characters are split
| between classes 0, 1, 2 which are word, space, and newline
| characters.
|
| To get the definitions of all other states, by working back
| from the fact that state 2 is the word count, it must be
| entered when a word character (class 0) is hit after some
| space/newline characters. Hence state 0 represents the machine
| being in the middle of a sequence of whitespace, state 1 is
| when the machine is in a sequences of newlines, state 2 is the
| first letter of a word, and state 3 is any subsequent letter of
| a word. You can then verify that the machine transitions to the
| correct state from those by checking what it does in case of a
| word, space, or newline character. But I'm not sure why the
| table is 4x4. The fourth transition from every state is
| unreachable since the symbols are all in [0, 2].
|
| As for why the approach is better, I think it's because it
| avoids branching. If you think about the length of a word, it's
| quite short compared to a CPU pipeline, so on a branching
| version of this you're going to be stuck spending most of your
| execution time on branch miss-prediction.
| Vvector wrote:
| I've read recently (here on HN) that branch predictors are
| right 99% if the time. Is that inaccurate?
| kardos wrote:
| It may be true in some cases, but not in general. I think
| of it as an observation that for large swathes of code the
| branch is predictable. Eg, we usually don't go down the
| error checking codepath. Faced with flow control that
| depends on random data, such as "is this random number
| even", the branch predictor will have a hard time doing
| better than 50%.
| James_K wrote:
| This is the 1% of cases. As I understand it, most branches
| take the same path most of the time. Think of a loop for
| instance. With purely statistical prediction (go whichever
| direction the branch takes more often), loops will be
| predicted with extreme accuracy and the result is a huge
| number of correct branches. Then think about error handling
| code which isn't triggered most of the time due to input
| being valid. Between these two easy cases, I think you
| cover the majority of branches is programs which is why so
| many of them are predictable.
| drchickensalad wrote:
| If you're like me, you need to stop thinking of it as "99%
| of call sites" and start thinking of it as "99% of
| iteration loops". I know it's super obvious on the face of
| it, but the implications is that most call sites can
| repeatedly fail but are totally overshadowed by insanely
| high frequency hot loops that it almost always predicts
| right (it's skewed, just like avg vs median)
| tylerhou wrote:
| There are fewer branches but there is now a data dependency
| between loop iterations which makes each iteration slightly
| slower (maybe 1-2 cycles additional latency per iteration).
|
| Because newlines are relatively common and unpredictable a
| state machine is likely better. But on a long file with no
| new lines and no spaces the branching one should be slightly
| faster. (All reasoning from first principles; I have not done
| any benchmarking!)
| anamexis wrote:
| The README covers the exact case you describe - the
| `word.txt` benchmark is just a file with 93MB of the char
| `x`. The state machine is still faster in this case.
| samatman wrote:
| > _But I 'm not sure why the table is 4x4. The fourth
| transition from every state is unreachable since the symbols
| are all in [0, 2]._
|
| I think the fourth transition represents illegal Unicode.
| From there it stays in the illegal state until it hits legal
| UTF-8, then goes back to counting.
| James_K wrote:
| That isn't relevant in the ASCII example, and the UTF-8 one
| uses a 256x256 table.
| sebastianmestre wrote:
| Yeah, Sean Barrett has a good writeup about them and optimizing
| them for branch-prediction, etc
|
| It's framed in the context of programming-language
| tokenization, but the principles are the same.
|
| https://nothings.org/computer/lexing.html
| joe_the_user wrote:
| Hmm,
|
| A project with all code working with special magic reminds me
| of another project that some attention a while back (XZ lib -
| supposed improvements with packages containing clever system
| back doors).
|
| Edit: The code is likely harmless but the more opaqueness is in
| a system, the easier it is for malevolent opaqueness to hide.
| rakoo wrote:
| > This state machine approach always results in the same speed,
| regardless of input.
|
| This is strange to me, shouldn't it be O(N) with the size of the
| input ? Or maybe in this benchmark the inputs are small enough
| that the difference isn't measurable ?
| VMG wrote:
| he is talking in context of same-sized files (counted in bytes)
| sebastianmestre wrote:
| They meant that different characters don't change the
| performance, unlike the Linux and MacOS versions, where
| different characters go through different code paths.
|
| But yes, a bigger file does take longer to process.
| roywiggins wrote:
| Speed as in bytes per second, regardless of which bytes you
| give it.
| fsckboy wrote:
| while you are in there, wc --unicode wc
| --bom
| aargh_aargh wrote:
| I wanted to take a peek the state diagram (ASCII version). But I
| was feeling lazy, so apologies for having ChatGPT do the work for
| me. I wasn't going for correctness and didn't check it. If you're
| like me, here you go:
|
| https://dreampuf.github.io/GraphvizOnline/#digraph%20StateMa...
|
| digraph StateMachine { rankdir=LR; size="8,5"; node [shape =
| circle]; LookingForWord -> InsideWord
| [label="Non-space"]; LookingForWord -> LookingForWord
| [label="Space"]; LookingForWord -> Newline
| [label="Newline"]; LookingForWord -> LookingForWord
| [label="Other"]; Newline -> InsideWord
| [label="Non-space"]; Newline -> LookingForWord
| [label="Space"]; Newline -> Newline [label="Newline"];
| Newline -> LookingForWord [label="Other"];
| InsideWord -> ContinueWord [label="Non-space"];
| InsideWord -> LookingForWord [label="Space"]; InsideWord
| -> Newline [label="Newline"]; InsideWord ->
| LookingForWord [label="Other"]; ContinueWord ->
| ContinueWord [label="Non-space"]; ContinueWord ->
| LookingForWord [label="Space"]; ContinueWord -> Newline
| [label="Newline"]; ContinueWord -> LookingForWord
| [label="Other"];
|
| }
|
| EDIT: Under peer pressure, I checked it and it correctly reflects
| the code apart from being designed for one specific line ending
| sequence (as it should, being an example optimized for brevity).
|
| As for the replies, as opposed to my approach, I'm sure when
| you're browsing research papers, you're doing a full
| reproducibility study for each one. I'm sorry I commented, I
| should have waited for you to post your results.
| bruce343434 wrote:
| This is a useless contribution to the discussion, maybe even
| negative... Why didn't you check it at least before sharing?
| aftbit wrote:
| Why? Because it's wrong? If it were correct, wouldn't it be
| useful?
| roywiggins wrote:
| This is a bit like presenting someone with a raw egg on a
| plate and saying "ah, but if I _had_ cooked it, might it
| not be a tasty omelet? " Correctness is the hard part!
| arp242 wrote:
| More like a covered plate you're expected to eat without
| looking. "This may be an amazing omelet and you will have
| a great meal, or you may eat a raw egg, get salmonella,
| and be sick for a week. It's probably an omelet, but I'm
| not sure, so you'll probably be fine. Enjoy!"
| roywiggins wrote:
| Surely correctness is the most important thing? Anyone can
| produce an _incorrect_ state diagram.
| mananaysiempre wrote:
| > This is analogous to NFA and DFA regular-expressions. If you
| use the NFA approach, you need to buffer the entire chunk of
| data, so that the regex can backtrack. Using the DFA approach,
| input can be provided as a stream
|
| Sorry, no, this is completely wrong. There are other reasons NFAs
| or DFAs might be preferable, and there are other performance
| considerations as far as backtracking is concerned, but NFA
| implementations do not backtrack. I mean, they could, but they
| don't, and in common usage a "NFA regex implementation" means a
| non-backtracking one (that goes through all input once with a bag
| of states in hand and advances each one of them on each step),
| whereas a "backtracking regex implementation" means one that
| doesn't use automata at all (and is usually worst-case
| exponential in the input).
|
| What this is actually talking about is a streaming parser. Just
| call it like it is, it's an OK term that sees plenty of usage.
|
| [What you _could_ mention here is PEG parsers, because even
| linear PEG implementations may need to keep O(input) state and
| thus are essentially incapable of streaming, unlike LL or LR
| ones; and GLR et al. can be even worse, for what it's worth. But
| this is a much more obscure topic.]
| veltas wrote:
| How do you implement a non-backtracking NFA other than
| converting it to a DFA?
| burntsushi wrote:
| At search time, the principle difference between an NFA and a
| DFA is that, in an NFA, you can be in more than one state at
| the same time. There's no problem with implementing your
| transition function to advance all states you are in
| simultaneously for each character of input. You'll get `O(m *
| n)` time, where `m` is proportional to the number of states
| in the NFA and `n` is proportional to the length of the
| input.
|
| The GP is remarking on a confusable term, "NFA algorithm,"
| that has come to mean something different from its underlying
| theory. The PCRE2 documentation itself, for example[1],
| remarks about how it uses the "NFA algorithm" as its standard
| approach to matching. But of course, PCRE2 supports _way_
| more than the theoretical model of non-deterministic finite
| automata (NFA) actually supports. PCRE2 is much more
| expressive. Yet, the implementation approach is still called
| "NFA algorithm." Presumably the reason for this is that an
| NFA _can_ be simulated by doing backtracking, but if you don
| 't keep track of where you've visited, the worst case runtime
| of this is superlinear, possibly catastrophically so[2]. The
| problem is that the backtracking regex engines have evolved
| well beyond what an NFA actually supports. But the name
| stuck.
|
| To make things even more confusing, as I mentioned above, an
| NFA can be used to execute a search in O(m * n) time. (We
| typically call that "linear in the size of the input" since
| the NFA itself can usually, but not always, be regarded as a
| constant factor. i.e., A literal regex string in your program
| source code.) So when you say something like, "a finite
| automata regex engine uses NFAs to guarantee linear time,"
| you might be _really_ confused if your background is in
| backtracking engines which... also use an "NFA," but of
| course cannot guarantee linear search time.
|
| So what we have is ambiguous terminology, not unlike "regex"
| itself, which could perhaps mean a real theoretical "regular
| expression" (in the case of Hyperscan, RE2, Go's regexp
| package and Rust's regex crate) or it could mean "a very
| expressive DSL for pattern matching that supports more than
| what regular languages can describe" (in the case of PCRE2,
| ECMAScript's regex engine, Ruby's Oniguruma, Python's `re`
| module, and more). I've made my peace with "regex" being
| ambiguous, but the fact that "NFA algorithm" is itself
| ambiguous despite "non-deterministic finite automata" being a
| somewhat baroque and specific technical term, is very
| unfortunate.
|
| To make things even worse---and I shit you not, I am not
| joking---the backtracking people seem to have taken to
| calling a linear time NFA search the "DFA algorithm."[3] (At
| least the PCRE2 docs mention it's not a "traditional" finite
| state machine...) But... it's not a DFA. The backtracking
| folks have effectively elevated the terms "NFA" and "DFA" to
| mean something very different from their theoretical
| underpinnings.
|
| So when you have people that are only aware of one meaning of
| "NFA algorithm" talking to other people only aware of the
| other meaning of "NFA algorithm," chaos ensues.
|
| [1]: https://www.pcre.org/current/doc/html/pcre2matching.html
|
| [2]: https://github.com/BurntSushi/rebar?tab=readme-ov-
| file#cloud...
|
| [3]: https://www.pcre.org/current/doc/html/pcre2matching.html
| #TOC...
| fanf2 wrote:
| Yeah, it's vexing. This backtracking / NFA / DFA confusion
| originally came from Jeffrey Friedl's book "mastering
| regular expressions" (O'Reilly, 1997) -- at least, that's
| the first place I saw the mistake in print. Philip Hazel
| relied a lot on Friedl's book when writing PCRE, which is
| why the PCRE docs get it wrong. (I spoke to Philip about
| this many years ago when we worked together.)
| burntsushi wrote:
| Yeah I didn't mention Friedl because I've dumped on him
| in the past, and didn't want to belabor it. But Friedl
| definitely didn't start this. According to him, he was
| just using what was common vernacular even back then:
| http://regex.info/blog/2006-09-15/248 (Note that blog is
| from 2006, but he's actually quoting himself from ~10
| years prior to that in 1997.)
|
| But... he is the one who ultimately installed this
| ambiguity into a classic book that is still read to this
| day. So I think he can at least be blamed for
| popularizing the confusion. And especially since the
| entire book is framed around "NFA" and "DFA" regex
| engines. It's not like it was just some one-off mention.
| The mix-up is baked into the conceptual fabric of the
| book. The landscape also looked different back then. It
| predated RE2 for example, and RE2 is, I think,
| principally responsible for bringing "backtracking
| semantics" to finite automata oriented engines. So
| Friedl's book is forever stuck in a false dichotomy that
| only _happened_ to exist back when he wrote it.
| loeg wrote:
| As GP said, you can keep a bag of current states.
| mananaysiempre wrote:
| Assuming each state transition consumes a character
| ("epsilon-free NFA"), just follow the definition. Before each
| step, you have a list of NFA states you might be in,
| initially containing only the start state. During each step,
| go through each state on that list and enter any acceptable
| successors into a new one. Deduplicate and repeat.
|
| This is (potentially super)linear in the number of states,
| but that doesn't preclude it from being linear in the input
| length. In particular, even though it's essentially a bad
| lazy implementation of standard NFA-to-DFA conversion, you're
| avoiding the exponential state blowup you get in the eager
| version. (I say it's a bad lazy implementation because a
| proper one would memoize; that, however, would bring the
| worst-case blowup back.) I think that, in addition to the
| fact that people do actually do this kind of thing in
| production implementations, qualifies it as a separate
| approach.
|
| If your NFA does have epsilon-transitions--like, say, the
| NFAs obtained from the textbook Thompson regexp-to-NFA
| construction--you can either use a different (Glushkov)
| regexp-to-NFA construction[1], eliminate those transitions as
| a separate pass[2], or adjust the algorithm above to
| eliminate duplicate NFA states on the fly[3].
|
| [1]: https://en.wikipedia.org/wiki/Glushkov%27s_construction_
| algo...
|
| [2]: https://news.ycombinator.com/item?id=19272990
|
| [3]: https://swtch.com/~rsc/regexp/regexp2.html
| bruce343434 wrote:
| What about this is "asynchronous"?
| roelschroeven wrote:
| I don't get that either.
|
| It feels like they compare the asynchronous-ness of their
| algorithm to Nginx being asynchronous ("asynchronous web-
| servers like Nginx use a state-machine parser. They parse the
| bytes as they arrive, and discard them."), but I don't see how
| that relates. The way web-servers handle requests (multiple
| threads vs multiple processes vs asynchronous event-driven in
| one thread) is completely orthogonal to how they parse headers.
|
| The impression I have is that their algorithm works in a
| streaming way, without having to allocate memory buffers, and
| that they call that asynchronous (wrongly, as far as I can
| see).
|
| I also don't really see what they mean by their algorithm being
| more scalable.
| laurent_du wrote:
| My understanding was that scalability came from the removal
| of memory buffers.
| smaudet wrote:
| I was a bit surprised by the macos/linux speed differential here:
|
| > The difference in macOS and Linux speed is actually the
| difference in clang and gcc speed. The LLVM clang compiler is
| doing better optimizations for x86 processors here.
|
| I know GCC has been getting much, much more aggressive as of
| late, but there have been some complaints that it is now much,
| much less safe (checks intended for security getting optimized
| away, e.g.).
|
| I wonder if you were to go back to 5 years ago, if you'd see the
| linux code was generally safer than the speedy osx llvm code...
| kstrauser wrote:
| I'd assume the difference between Linux's GNU and macOS's BSD
| implementations would be much more significant.
| pjdesno wrote:
| Exactly. MacOS (really BSD) grep sucks, 5-10x slower than GNU
| Grep for simple searches on the same hardware, but MacOS/BSD
| wc is fast.
| aftbit wrote:
| It would have been nice for the author to label the states and
| explain how and why they chose the states that they did.
| Otherwise, it's hard to understand how to apply this sort of
| thing to larger problems.
| jxndnxu wrote:
| How would if have helped? You either design a DFA by hand or
| use a compiled from a regular language
| cryptos wrote:
| Now we need another round of the language battle, because now
| this approach needs to be implemented in Rust, Go, Zig, Nim ...
| you name it!
| rep_movsd wrote:
| Remember "The Zen of Code Optimization" by Michael Abrash in the
| 90s?
|
| This word count challenge was the first one he described. The
| table driven approach was discussed, but was not necessarily the
| fastest due to the large cache overflowing (on 90s processors)
| table needed.
|
| I used the same example in a chapter I wrote for a C++ book
| moody__ wrote:
| The "asynchronous state machine" name here is a bit strange, when
| searching for this term used elsewhere I couldn't find any formal
| definition what it is. Reading further in the README it looks
| like the author implies that it really just means a DFA? Not
| entirely sure.
|
| I'd also like to add the Plan 9 implementation[0], which also
| uses the properties of utf8 as part of its state machine and
| anecdotally has been quite performant when I've used it.
|
| [0]
| http://git.9front.org/plan9front/plan9front/107a7ba9717429ae...
| rhelz wrote:
| "Asynchronous" isn't part of the name of some really cool state
| machine :-) Its just an adjective and means the same as when
| you put it in front of any other noun.
|
| A _synchronous_ state machine is one where the incoming stream
| of events is always "in sync" with the state transitions, in
| the following sense:
|
| 1. When an event happens, the state machine can transition to
| the next state and perform any necessary actions _before_ the
| next event happens
|
| 2. After the the state machine has transitioned and performed
| any necessary actions, the program must wait for the next event
| to happen. It can't do anything else until it does.
|
| An _asynchronous_ state machine doesn 't make the main program
| wait until the next event happens. It can go on and do other
| things in the meantime. It doesn't have to wait for then next
| event to arrive.
|
| When the next event _does_ arrive, the program pauses whatever
| else it is doing, and hands control back to the state machine,
| which process the event, and then hands control back over to
| the main program.
| PaulHoule wrote:
| As I see it, state machines are particularly good for
| expressing logic in asynchronous systems. For instance in the
| late 1980s I wrote assembly language XMODEM implementations
| for the 6809 and the 80286 and since that kind of code is
| interrupt drive it is efficient to make a state machine that
| processes one character at a time. Today when you use
| async/await the compiler converts your code, loops and all,
| into a state machine.
| rhelz wrote:
| // 6809
|
| The good old days :-)
| moody__ wrote:
| I was not treating "asynchronous state machine" as a noun,
| even if taken as a generic adjective it doesn't make sense in
| this context. What "other things" is this wc2.c doing while
| the state machine is churning? There is no multi threading or
| multi processing going on here. So I find it hard to believe
| that this use of "asynchronous" is inside of what I would
| generally see it used as. As such I thought perhaps it
| referred to a specific methodology for designing the code,
| something akin to how the "lock free" adjective implies a
| certain design sensibility.
| rhelz wrote:
| // What "other things" is this wc2.c doing...
|
| AFAICT, wc2.c isn't written to be an asynchronous state
| machine. It doesn't ever seem to transfer the control to
| any other place.
|
| // So I find it hard to believe that this use of
| "asynchronous" is inside of what I would generally see it
| used as
|
| Yeah, you are legitimately confused. The post talks about
| asynchronous state machines, but w2c.c isn't an example of
| that. I'm sure this gave you a severe case of WTF?!??
|
| // thought perhaps it referred to a specific methodology
| for designing the code
|
| It does---that's exactly what it is, a programming
| methodolog, or perhaps better put, a design pattern. But
| w2c.c isn't an example of code written using that
| methodology. Again, you are legitimately confused here,
| because the post talks about something and w2c.c isn't
| that.
|
| Do you know python? If you google for "asynchronous
| programming in python" you'll get all kinds of blog posts
| and youtube videos which explain the technique.
| JonChesterfield wrote:
| Recognising that wc is a lexer problem and implementing it using
| re2c would be my first choice here. That'll be building a similar
| state machine to what is done by hand here, especially since the
| hand rolled one is still byte at a time (the main drawback to
| re2c's generated output).
| GrantMoyer wrote:
| > The algorithm is known as an "asynchronous state-machine
| parser". It's a technique for parsing that you don't learn in
| college.
|
| The mapping from regular and context free languages to state
| machines and their table-based implementations was covered in
| depth in the Compilers course I took. A table based state machine
| is literally the textbook algorithm for parsing these classes of
| languages, and implementing them has been almost entirely
| automated by tools like Flex and Bison since the 80s.
| mcguire wrote:
| For example, see Chapter 3 of the Dragon Book, _Compilers:
| Principles, Techniques, and Tools_ by Aho, Ullman, et al. Or
| probably any other compilers text book.
| mywittyname wrote:
| Compilers was a graduate course at my school.
|
| CS as a field has grown quite a lot over the years, and it
| seems that colleges are responding by pushing the more
| theoretical classes into grad school to make room for the more
| applied classes.
| throwaway55533 wrote:
| From reading the article, how is this more efficient? Doesn't any
| word counting algorithm have to iterate through all the
| characters and count spaces?
|
| What makes this better than the standard algorithm of
| wc = 0 prev = null for ch in charStream:
| if !isSpace(ch) and isSpace(prev): wc += 1
| prev = ch
| GrantMoyer wrote:
| Basically, it's just a lookup table for isSpace, rather than
| whatever logic is in the original function (which probably has
| conditional branches).
|
| There's a little bit of a complication in the fact that a state
| machine can implicitly share parts of the computation for utf-8
| encoded codepoints with shared prefixes, so instead of a
| 2^32-element lookup table, you only need 4 2^8-element lookup
| tables (and instead of 1 state, parsing codepoint, you have 4,
| parsing first byte, parsing second byte, etc.).
| throwaway55533 wrote:
| Ah. Is 8 bits really optimal? I don't know how many UTF space
| characters there are, I thought there were only a few. Why
| not a 16-bit lookup?
| GrantMoyer wrote:
| UTF-8 encoded codepoints can have an odd number of bytes,
| so processing a file 2 bytes at a time would be a little
| more complicated. Processing the file one codepoint at a
| time works because you can decode the UTF-8 stream into a
| stream of 32-bit codepoints before passing the codepoints
| to the lookup table. I suppose you could also transform
| from UTF-8 to UTF-16 before passing values to the lookup
| table.
|
| Processing byte by byte isn't necessarily faster than
| processing codepoint by codepoint, or any other size. You'd
| need to measure performance empirically, and it probably
| depends on caches sizes and other factors. In theory, you
| could also process bit by bit -- then you'd only need 32
| 2-element lookup tables -- but that's unlikely to be
| efficient, since you'd need to do a lot of bit
| manipulation.
|
| Edit: Upon inspection, the method I described doesn't
| appear to be the method used by the featured program. It
| still basically uses a lookup table for detecting space
| characters, byte by byte, but the states are not how I
| described. Instead of the states representing which byte of
| a UTF-8 encoded codepoint is being processed, and the word
| count being incremented upon certain transitions -- a Mealy
| machine -- the state represent the class of the codepoint
| last fully processed, and the count is always increased
| based on the current state (often by zero) -- a Moore
| machine.
| jxndnxu wrote:
| You have answered your question yourself: your algorithm looks
| at each byte twice, not once
|
| It's even more obvious in the UTF case where the classic
| implementation first looks at 1-4 byte to parse a character and
| only then checks if it's a space
| JackSlateur wrote:
| Reading the code to help me understand how things are
|
| I got the wc_lines from coreutils:
| https://github.com/coreutils/coreutils/blob/master/src/wc.c#...
|
| And I thought "damn, I understand nothing about the state-machine
| stuff, how did they made this faster ?"
|
| Truth is: they did not Of course, this is just the "count line"
| part. Other parts are indeed faster.
|
| Coreutils 9.4: 0.67 [jack:/tmp] /usr/bin/time -v
| wc -l debian-12.2.0-amd64-netinst.iso 2524855
| debian-12.2.0-amd64-netinst.iso Command being timed: "wc
| -l debian-12.2.0-amd64-netinst.iso" User time (seconds):
| 0.00 System time (seconds): 0.07 Percent of CPU this
| job got: 98% Elapsed (wall clock) time (h:mm:ss or m:ss):
| 0:00.07 Average shared text size (kbytes): 0 Average
| unshared data size (kbytes): 0 Average stack size (kbytes):
| 0 Average total size (kbytes): 0 Maximum resident set
| size (kbytes): 2176 Average resident set size (kbytes): 0
| Major (requiring I/O) page faults: 0 Minor (reclaiming a
| frame) page faults: 99 Voluntary context switches: 1
| Involuntary context switches: 0 Swaps: 0 File system
| inputs: 0 File system outputs: 0 Socket messages sent:
| 0 Socket messages received: 0 Signals delivered: 0
| Page size (bytes): 4096 Exit status: 0
|
| And wc2 (from gcc -O3 -o wc2 wc2.c): 0.47
| [jack:/tmp] /usr/bin/time -v ./wc2 -l
| debian-12.2.0-amd64-netinst.iso 2524855
| debian-12.2.0-amd64-netinst.iso Command being timed: "./wc2
| -l debian-12.2.0-amd64- netinst.iso" User time (seconds):
| 0.97 System time (seconds): 0.05 Percent of CPU this
| job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss):
| 0:01.03 Average shared text size (kbytes): 0 Average
| unshared data size (kbytes): 0 Average stack size (kbytes):
| 0 Average total size (kbytes): 0 Maximum resident set
| size (kbytes): 2176 Average resident set size (kbytes): 0
| Major (requiring I/O) page faults: 0 Minor (reclaiming a
| frame) page faults: 146 Voluntary context switches: 1
| Involuntary context switches: 1 Swaps: 0 File system
| inputs: 0 File system outputs: 0 Socket messages sent:
| 0 Socket messages received: 0 Signals delivered: 0
| Page size (bytes): 4096 Exit status: 0
|
| Gotta keep reading
| orf wrote:
| Any notable difference if you pipe the file in, rather than
| read from disk?
|
| IO caches are a thing as well, don't use the first measurement.
| JackSlateur wrote:
| I've accounted for IO cache, the test is running from an in-
| memory dataset
|
| After checking a bit more, wc2 is indeed faster than wc in a
| specific use case: counting characters. That is, transforming
| a sequence of one or multiple bytes into meaningful character
| (wc --chars somefile).
|
| In wc, the related code is here: https://github.com/coreutils
| /coreutils/blob/master/src/wc.c#...
|
| So in the end, wc uses the function while wc2 does not:
| https://en.cppreference.com/w/c/string/multibyte/mbrtoc32
|
| Now, what I cannot tell : is wc2's code really comparable
| versus mbrtoc32 ? From a couple of tests, it works against
| ascii and utf8 dataset, but is that all ? Or are there
| edgecases cleverly handled by mbrtoc32 (with a performance
| cost) ?
| roetlich wrote:
| From looking at the code, the clever bit is how this is doing
| word count. (and word & line count at the same time). So yeah,
| it makes -l isn't better.
| NunoSempere wrote:
| From a previous project:
| https://nunosempere.com/blog/2023/09/15/wc/ that was looking not
| at speed, but at simplicity, here is a comparison of different
| historical versions of wc:
|
| ---
|
| The [version of wc.c](https://git.nunosempere.com/personal/wc/src
| /branch/master/sr...) in this repository sits at 44 lines. It
| decides to read from stdin if the number of arguments fed to it
| is otherwise zero, and uses the standard C function getc to read
| character by character. It doesn't have flags, instead, there are
| further utilities in the src/extra/ folder for counting
| characters and lines, sitting at 32 and 35 lines of code,
| respectively. This version also has little error checking.
|
| [Here](https://github.com/dspinellis/unix-history-
| repo/blob/Researc...) is a version of wc from UNIX V7, at 86
| lines. It allows for counting characters, words and lines. I
| couldn't find a version in UNIX V6, so I'm guessing this is one
| of the earliest versions of this program. It decides to read from
| stdin if the number of arguments fed to it is zero, and reads
| character by character using the standard C getc function.
|
| The busybox version ([git.busybox.net](https://git.busybox.net/bu
| sybox/tree/coreutils/wc.c)) of wc sits at 257 lines (162 with
| comments stripped), while striving to be [POSIX-
| compliant](https://pubs.opengroup.org/onlinepubs/9699919799/),
| meaning it has a fair number of flags and a bit of complexity. It
| reads character by character by using the standard getc function,
| and decides to read from stdin or not using its own
| fopen_or_warn_stdin function. It uses two GOTOs to get around,
| and has some incomplete Unicode support.
|
| The [plan9](https://9p.io/sources/plan9/sys/src/cmd/wc.c) version
| implements some sort of table method in 331 lines. It uses plan9
| rather than Unix libraries and methods, and seems to read from
| stdin if the number of args is 0.
|
| The plan9port version of wc ([github](https://github.com/9fans/pl
| an9port/blob/master/src/cmd/wc.c)) also implements some sort of
| table method, in 352 lines. It reads from stdin if the number of
| args is 0, and uses the Linux read function to read character by
| character.
|
| The [OpenBSD](https://github.com/openbsd/src/blob/master/usr.bin/
| wc/wc.c) version is just *nice*. It reads from stdin by default,
| and uses a bit of buffering using read to speed things up. It
| defaults to using fstat when counting characters. It is generally
| pleasantly understandable, nice to read. I'm actually surprised
| at how pleasant it is to read.
|
| The [FreeBSD
| version](https://cgit.freebsd.org/src/tree/usr.bin/wc/wc.c) sits
| at 367 lines. It has enough new things that I can't parse all
| that it's doing: in lines 137-143, what is capabilities mode?
| what is casper?, but otherwise it decides whether to read from
| stdin by the number of arguments, in line 157. It uses a
| combination of fstat and read, depending on the type of file.
|
| Finally, the GNU utils version ([github](https://github.com/coreu
| tils/coreutils/tree/master/src/wc.c), [savannah](http://git.savan
| nah.gnu.org/gitweb/?p=coreutils.git;a=blob;f...)) is a bit over
| 1K lines of C. It does many things and checks many possible
| failure modes. I think it detects whether it should be reading
| from stdin using some very wrapped fstat, and it reads character
| by character using its own custom function.
|
| So this utility started out reasonably small, then started
| getting more and more complex. [The POSIX
| committee](https://pubs.opengroup.org/onlinepubs/9699919799/)
| ended up codifying that complexity, and now we are stuck with it
| because even implementations like busybox which strive to be
| quite small try to keep to POSIX.
| trealira wrote:
| There's another version from Unix V10: http://tuhs.org/cgi-
| bin/utree.pl?file=V10/cmd/wc.c
|
| Also, from V8: https://www.tuhs.org/cgi-
| bin/utree.pl?file=V8/usr/src/cmd/wc...
| btown wrote:
| State machines are great for complex situations, but when it
| comes to performance, it's not at all clear to me that they're
| the most scalable approach with modern systems.
|
| The data dependency between a loop iteration for each character
| _might_ be pipelined really well when executed, and we can assume
| large enough L1 /L2 cache for our lookup tables. But we're still
| using at least one lookup per character.
|
| Projects like https://github.com/simdjson/simdjson?tab=readme-ov-
| file#abou... are truly fascinating, because they're based on SIMD
| instructions that can process 64 or more bytes with a single
| instruction. Very much worth checking out the papers at that
| link.
| jxndnxu wrote:
| My understanding of the article's use of scalable was "fixed
| overhead more or less regardless of the complexity of the state
| machine and input" not "fastest implementation available"
| apitman wrote:
| I didn't know you were allowed to use a language other than Rust
| for something like this.
| nitwit005 wrote:
| The primary reason their code is faster as they've made incorrect
| code that assumes you always have a UTF-8 locale. The normal
| isspace does not assume this:
|
| > The real programs spend most of their time in functions like
| mbrtowc() to parse multi-byte characters and iswspace() to test
| if they are spaces
|
| This will produce bad results in the real world. People have
| previously posted about bugs related to this on Hacker News:
| https://news.ycombinator.com/item?id=36216389
| scottlamb wrote:
| Are we looking at the same thread? Folks there seem to be
| complaining the old interfaces are anything from outdated to
| confusing to simply wrong, and I agree.
|
| I think it's totally reasonable for a program designed in 2024
| to say it only supports ASCII and UTF-8 encodings. Whether/how
| it should support the full spectrum of Unicode definitions of
| characters/graphemes/... is more interesting. For a lot of
| backend code (processing text-based network protocols, for
| example), focusing on ASCII is arguably best (e.g. functions
| like isspace and isdigit only returning true for ASCII
| characters). For more user-focused stuff, most of the world
| would say they'd like their native language supported well.
| Programs written with both use cases in mind should probably
| have a switch. (Or parallel APIs: e.g. Rust has
| {u8,char}::is_ascii_digit vs. char::is_numeric, which makes
| much more sense there than having one that switches behaviors
| based on an environment variable, as the correct behavior
| really depends on the call site.)
|
| Of course "say it only supports ASCII and UTF-8 encodings" is
| mutually exclusive with claiming to be a drop-in replacement
| for a utility that is specified as having POSIX locale support.
| This project does not make that claim, or even that it's
| intended to be actually used rather than illustrative of a
| performance technique.
| leni536 wrote:
| It would be just as fast if it detected the locale correctly,
| and used the UTF8 state machine for the UTF8 locale only. Other
| locales could have their own state machines or just fall back
| to a generic implementation.
| 1vuio0pswjnm7 wrote:
| "The wc program included with macOS and Linux are completely
| different. Therefore, the following table shows them benchmarked
| against each other on the same hardware."
|
| Appears BSD wc ("the macOS program") is much faster than the GNU
| coreutils wc ("the Linux program").
|
| It certainly illustrates a point about people rewriting wc in
| newer languages to demonstrate alleged speed against C.
|
| Which wc are they using for comparison.
|
| Pehaps they should use wc2 or wc2o.
___________________________________________________________________
(page generated 2024-06-20 23:00 UTC)