[HN Gopher] Fast character case conversion, or how to really com...
___________________________________________________________________
Fast character case conversion, or how to really compress sparse
arrays
Author : apankrat
Score : 32 points
Date : 2021-02-22 17:58 UTC (5 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| lifthrasiir wrote:
| Ha, that's exactly what I've done for rust-encoding [1]! (Now
| obsolete, go use encoding_rs instead.) For TSP I resorted to a
| simple greedy algorithm, which was good enough for me.
|
| To be honest though this problem of reducing a large table in the
| random-accessible manner is hardly new (e.g. perfect hashing) and
| pretty much everything can be automated only with some expert
| guidance. I'd like to see a neat solution that can interactively
| search remappings and emit a neat code in various languages.
|
| (As a fellow Unicode commentator, I should also mention that the
| OP only concerns about _simple_ case mappings as called by
| Unicode. But they are pretty much the only mapping we like to
| optimize for their prevalence anyway.)
|
| [1] https://github.com/lifthrasiir/rust-
| encoding/blob/master/src...
| bear8642 wrote:
| Nice! Bit confused on link to TSP but cool!
| apankrat wrote:
| Say, we have our N blocks. We make these into the nodes in a
| graph. Every node is connected to all other nodes.
|
| An edge leading from A to B is a case when block A precedes
| block B. The weight of that edge is the AB overlap. Note that
| the same edge may have a different weight in the opposite
| direction, because it will be that of the BA overlap.
|
| So what we are after is the heaviest path through this graph
| that passes through all nodes exactly once. This can be reduced
| to finding the heaviest cyclic path for a specific starting
| point (and excluding the cheapest edge from the solution _and_
| then iterating through all points). This is pretty much TSP
| with an inverted goal.
| jcranmer wrote:
| I can see why you might be confused. Here's my attempt at an
| explanation:
|
| The goal is to find the permutation of blocks that minimizes
| the total length of the table. Each block contributes an amount
| of that length--the amount that did not overlap with the
| previous block. This amount is only dependent on the
| interaction between the block and its previous block. So we can
| make a graph where each node is a block to be encoded, each
| node has an edge to every other node, and the weight of that
| edge is the amount of overlap [1]. Compute an ordering of the
| nodes to visit that maximizes overlap, and you've a) done a TSP
| traversal and b) computed the optimal permutation.
|
| [1] Equivalently, non-overlap, gets you a minimum-weight
| problem, as in classical TSP. But the notes suggest it's framed
| as a maximal problem, finding the path that maximizes overlap--
| which is certainly easier to code.
| rurban wrote:
| I did a much better optimization for musl/freebsd, about 10x
| smaller, with binary search into a single table of ranges and
| single exceptions, for both cases, lower and upper. A few ifs are
| better than wasting lots of cache.
|
| https://github.com/rurban/musl/commit/7bcdc7e6e1ed2c4424d706...
|
| But I'm also trying the automatic data optimization route,
| because Unicode I'd changing every year. Case mapping is easy,
| but the changing normalization tables cause minor data structure
| hickups every year. Working in that here:
| https://github.com/rurban/optdata
| huhtenberg wrote:
| From the first glance, yours is going to be orders of magnitude
| slower.
| legulere wrote:
| I have the feeling that immutable data (with which I don't mean
| copy on write) has an extreme potential for optimization but
| there doesn't seem to be a good central collection of it.
|
| Some tips from me: mmap works really well and easy with immutable
| data as the operating system will manage the memory pretty well.
| If you store strings ordered in a string table you can compare
| the pointers for blazing my fast string comparison.
| rurban wrote:
| mmap'ng constant tables is nice, but still wasting your caches.
| Which makes it much slower than a few tricks to save data and
| cpu. Such as using proper algorithms and data structures. I'm
| working on an optimizing data compiler for such oneshot lookups
| in static const data. Range or interval queries would be
| different and easier (btree's only), but oneshot needs to try
| much more.
| kccqzy wrote:
| I think if you have a large immutable lookup table, the
| compiler will put it in the rodata section. Which means the
| kernel can evict that mapping if it's unused and repopulate via
| a page fault when it's needed.
| kccqzy wrote:
| > The casemap_lower comprises two distinct parts. It starts with
| 256 index entries, followed by 3866 deltas, to the grand total of
| 4122.
|
| This sounds like a code smell of using one variable for two
| distinct purposes. Why not separate them into two tables? After
| all the first lookup results in a base index to be added, and the
| second lookup results in a delta. They are different things.
| lifthrasiir wrote:
| Normally that's correct, but in this case you would like to put
| two tables as close as possible as you already know the second
| table would be accessed right after the first. As a side effect
| two tables would share the same base address, which can be
| beneficial for some platforms. Technically it also allows for
| the possibility to overlap two tables (I think it's unlikely,
| but still).
| apankrat wrote:
| Spotted the packing technique in the Wine's code and killed few
| days to see how much further it can be taken. Pretty much one of
| those rabbit holes that you'd occasionally run into in a course
| of regular programming and end up exploring instead of, you know,
| working. A good diversion, I suspect quite a few people can
| relate :)
| canucker2016 wrote:
| Based on the title, I thought you were Daniel Lemire with another
| writeup for benchmarking character conversion of some sort - for
| examples see https://hn.algolia.com/?q=lemire.me.
|
| Great work on reducing the size of the character conversion
| tables, but I don't see any evidence for the title of the web
| page.
|
| "Compact character case conversion" is more appropriate currently
| based on the data given in the readme.
|
| Data can be greatly compressed at the cost of access speed or the
| compressed data could speed up data access - with x86 CPU
| complexity these days, I wouldn't bet on determining the speed
| based on a quick glance of the code.
|
| You state the case conversion is fast, yet I don't see any
| benchmarks backing up the assertion.
|
| I'd expect a few different types of input to be benchmarked -
| ASCII upper/lower case alphabet chars (typical A-Z,a-z &
| space/punctuation thrown in to make it look like typical written
| language text inputs), Unicode upper/lower case et al., and
| random characters.
|
| Not gonna leave people in the lurch though - Daniel Lemire has a
| recent post which can be used as a scaffold for benchmarking.
|
| Code: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-
| blog/...
|
| Post: https://lemire.me/blog/2021/02/09/on-the-cost-of-
| converting-...
|
| If the code is to be timed on Windows, the code will need a
| Windows version of "clock_gettime(CLOCK_REALTIME, ...)" - see
| https://stackoverflow.com/questions/5404277/porting-clock-ge...
| lifthrasiir wrote:
| Anything necessarily requiring lookup tables would be
| (comparably) slow. Unicode lookup tables are large, so it would
| be inherently slow for some inputs. For the performance you
| should assume the input distribution (e.g. ASCII is frequent
| even in Unicode texts, so we detect a run of ASCII and go
| SIMD). You can also combine adjacent operations: instead of
| working with Unicode scalar values, you can directly accept
| UTF-8 string and optimize tables for them (Oniguruma regexp
| engine of Ruby fame does this).
| apankrat wrote:
| You want a benchmark that shows that one >>, one &, two [] and
| a + are going to be _fast_?
|
| The test loop that wraps the lookup code will likely take as
| much CPU as what's being benchmarked. Even if you are to use
| RDPMC and even if you are to use it very sparingly.
___________________________________________________________________
(page generated 2021-02-22 23:00 UTC)