[HN Gopher] tolower() in bulk at speed
___________________________________________________________________
tolower() in bulk at speed
Author : fanf2
Score : 84 points
Date : 2022-06-27 21:12 UTC (1 hours ago)
(HTM) web link (dotat.at)
(TXT) w3m dump (dotat.at)
| [deleted]
| mananaysiempre wrote:
| This technique of SWAR ("SIMD within a register") with very
| narrow elements and no hardware support is explained very well in
| an SO answer[1] (and the linked slides) about the mythical (and
| useless) "fusion trees".
|
| [1] https://stackoverflow.com/a/56604876
| dragontamer wrote:
| Seems a bit odd to stop at SWAR on today's systems.
|
| Every system I'm aware of has 128-bit SIMD implemented: either
| SSE (x86), NEON (ARM), or AltiVec (POWERPC). As such, 128-bit
| SIMD is the "reliably portable" SIMD operation.
|
| Of course, for fastest speeds, you need to go to the largest
| SIMD-register size for your platform: 512-bit for some Intel
| processors, rumored AMD Zen4 and Centaur chips. 256-bit for most
| Intel / AMD Zen3 chips. 128-bit for Apple ARMs, 512-bit for
| Fujitsu A64 ARMs, etc. etc.
|
| > And these are the fastest kinds of instructions :-)
|
| And it should be noted that modern SIMD instructions execute at
| one-per-clock-tick. So the 64-bit instructions certainly are
| fast, but the SIMD instructions tie if you're using simple XOR or
| comparison operations.
|
| --------
|
| This style of code might even be "reliably auto-vectorizable" on
| GCC and CLANG actually. I wonder if I could get portable auto-
| vectorizing C from these examples.
| fanf2 wrote:
| Is there a good way to load an arbitrary number of bytes (up to
| the vector size) into a vector register? As I said in the
| article, I could not find one when looking at AVX2 or NEON
| reference manuals. Getting the data from RAM is the main
| bottleneck for short strings, which DNS names usually are.
| dragontamer wrote:
| You're not thinking in terms of SIMD.
|
| If you load 5 bytes into a 128-bit register (16-bytes) you'll
| have 5-bytes of data + 11-bytes of garbage.
|
| Perform the calculation over all 16-bytes. Yes, this makes
| 11-bytes of garbage at the end. Once you're done, just write
| back the first 5 bytes and you're set.
|
| The 11-bytes of garbage are "free". They didn't cost you
| anything.
|
| > Getting the data from RAM is the main bottleneck for short
| strings
|
| Your L1 and L2 cache-lines are 64-byte minimum read / write
| anyway (actually, some systems were 128-byte minimum IIRC). A
| 16-byte read is literally smaller than what your cache does
| on the path into your registers.
|
| EDIT: More importantly, modern CPUs only have 2x or 3x
| load/store units. Meaning a modern CPU can only perform ~2ish
| load/stores per clock tick. The SSE (16-byte / 128-bit) read
| / write to L1 cache will perform the same speed as a 8-byte
| /64-bit read/write to L1 cache in practice.
| fanf2 wrote:
| In fact the code that I left out of the blog post does
| almost exactly what you suggest :-) It's the "load 5 bytes"
| and "store 5 bytes" that I don't have a good solution for.
| At the moment I am using memmove() and relying on the
| compiler developers to have better ideas about optimizing
| it than I do... The bottleneck comes from the number of
| instructions and the branchiness, not the data bandwidth.
|
| I briefly considered playing games with overlong reads, but
| then asan gave me a slap and I reconsidered the path to
| wisdom.
| dzaima wrote:
| Doesn't seem to be for SSE & AVX at least (only 32-bit and
| 64-bit groups in AVX, and nothing for SSE). AVX-512 has
| _mm512_maskz_loadu_epi8, but not much has AVX-512.
|
| It's a shame that there aren't guarantees on being able to
| overread some number of garbage bytes. (for a SIMD-heavy
| project, I've just went with a custom allocator that
| guarantees the ability to read past the end of allocations,
| and storing by either maskstore for 32/64-bit, and blending
| with what's already there for 8/16-bit)
| dietrichepp wrote:
| Naive version to explore autovectorization:
|
| https://gcc.godbolt.org/z/bcefvznG3
|
| Yes, there's some amount of autovectorization here. Seems like
| a mess of a function. Here's something more like tolower8():
|
| https://gcc.godbolt.org/z/h8GTanodd
|
| The generated code definitely looks funny to me. Here is a
| manual vectorization, which is shorter:
|
| https://gcc.godbolt.org/z/1e4odEsKq
|
| The issue of loading into your vector register... well, there
| are some dirty tricks for that. The 16-byte slice containing
| the first byte, and the 16-byte slice containing the last byte,
| can both be loaded into registers and then shifted around in
| order to construct the desired value. Note the careful wording
| here... these slices might be the same slice. Or you can
| iterate over the 16-byte slices containing the array, and shift
| as you go, if you're storing into a different location. Or you
| can use various masked load/store operations on various
| architectures.
| dragontamer wrote:
| > The issue of loading into your vector register... well,
| there are some dirty tricks for that.
|
| Good discussion. I think the only method you haven't talked
| about is the simple "unaligned load" instructions (which
| might be the simplest, and most portable way, to do this). I
| know that ARM and x86 both can do unaligned loads no problem,
| but at a possible performance penalty.
| moonchild wrote:
| > modern SIMD instructions execute at one-per-clock-tick
|
| Or even two, or four on apple arm, plus load and store which
| can be done in parallel.
| dragontamer wrote:
| Yeah, good point. SIMD is also superscalar on most systems.
|
| For x86, Intel can do 3x AVX512 instructions per clock tick
| as long as each instruction is simple enough (add, AND, OR,
| NOT, XOR, maybe even multiply if you're not counting the
| latency issue)
| sedatk wrote:
| "labels are frequently less than 8 bytes long, and therefore
| fit inside a 64-bit register. So it probably isn't worth
| dealing with the portability issues of working with wide vector
| registers (Especially since I could not find a quick way to
| load an arbitrary number of bytes into a vector register with
| AVX2 nor with NEON.)"
| dragontamer wrote:
| DNS labels like "ycombina" (tor) you mean? :-)
| fanf2 wrote:
| Two out of three ain't bad :-)
|
| _news_.ycombinator. _com_
| londons_explore wrote:
| All those arithmetic calculations depend on the previous...
|
| You should get much more throughput if you can interleave them
| with other instructions...
|
| I wonder if that's what the benchmarks did?
| fanf2 wrote:
| Hmm, yes, there are only 2 or 3 instructions that can execute
| concurrently - but your observation made me realise I can eke
| out another by masking is_ascii more thoroughly. Thanks!
|
| Bulk throughput isn't really my aim, it's just a convenient way
| to get numbers big enough to be easily measurable :-)
| dzaima wrote:
| An out-of-order CPU will interleave things for you. Any modern
| CPU should have no issues running many iterations of the loop
| in parallel, as nothing should depend on the previous
| iteration.
|
| But, given that the task at hand usually deals with very short
| inputs, there's not much to interleave with anyway.
| londons_explore wrote:
| Worth patching libc?
| vinkelhake wrote:
| libc tolower() works on a single char at a time. This post is
| about the gains you can get by converting multiple chars at
| once.
| mananaysiempre wrote:
| The standard tolower() / toupper() take a single character
| only; bulk strlwr() / strupr() are nonstandard (if commonly
| present) and--more importantly--virtually unused. I suppose
| implementing this technique in the C/POSIX-locale version of
| nonstandard stricmp() / POSIX strcasecmp() might be helpful in
| some cases, because people do use that one, but I still expect
| that any situations that truly call for an ASCII-only case-
| insensitive comparison (parsers?) will be doing much more work
| per byte for other reasons (state machine, perfect hash, _etc._
| ).
| stabbles wrote:
| It's not the same, libc's tolower takes an int as an individual
| character
| Findecanor wrote:
| Although ... the existence of a libc with SIMD versions of
| its functions is not implausible. There are compilers that
| would produce such functions beside the normal ones if the
| source is decorated with the right #pragma.
|
| Such functions would be called only from within vectorised
| loops (or other SIMD versions of functions).
| Findecanor wrote:
| This type of bit manipulation will only work with pure ASCII,
| which had been designed to make this transformation simple with
| only bit manipulation -- just not in parallel.
|
| Most systems default to using Unicode these days, for which the
| problem is _much_ more complex even when the language is set to
| English.
| babelfish wrote:
| Great article. Have been brushing up on bit manipulation for
| interview prep lately and I love finding easy to digest, real-
| world use cases like this.
| MrYellowP wrote:
| Had I known anyone cares about this ...
|
| Anything else anyone cares about, that could use a speedup... ?
| charcircuit wrote:
| Since domains are UTF-8 doesn't the assumption of ASCII
| breakdown?
| sveiss wrote:
| Domain names are limited to a small subset of ASCII; the DNS
| protocol doesn't support UTF-8.
|
| To work around this, the international domain name standard
| defines an encoding called Punycode which maps Unicode to the
| limited character set DNS supports. The server is unaware of
| this, and so this optimised tolower() implementation works
| without any Unicode considerations.
| asveikau wrote:
| Lower case is locale-sensitive, however. For example,
| tolower('I') in Turkish should be 'i'.
|
| And within unicode, doing it in a "dumb ascii" way probably
| needs some normalization of diacritics. Eg. 'e' should be
| U+0065 U+0301 ("e\xcc\x81"), not U+00E9 ("\xc3\x89").
|
| Not sure how punycode handles this, I did once look deeply
| into it but that was years ago.
| jimmygrapes wrote:
| Absolutely. This article is reminiscent of the type of brief
| tutorials from the mid/late 90s in QBASIC newsletters
| introducing how to use ASM to get speed gains, with all the
| associated assumptions. The concept of "lowercase" breaks down
| terribly once you go beyond simple A-Z. You can apply similar
| rules to certain limited code sets, but a universal tolower()
| won't work so well with this method.
| mgcunha wrote:
| DNS labels are limited to some ASCII characters. UTF support in
| DNS is available via punycode which is an encoding from UTF
| onto that restricted ASCII set acceptable for DNS. Libraries as
| the ones discussed here typically perform UTF to punycode
| conversions before doing any label comparisons to ensure
| accuracy. In particular this tolower implementation would
| likely be used against the punycode encoded version of the UTF
| domain.
|
| https://en.wikipedia.org/wiki/Punycode
___________________________________________________________________
(page generated 2022-06-27 23:00 UTC)