[HN Gopher] Fast character case conversion, or how to really com...
       ___________________________________________________________________
        
       Fast character case conversion, or how to really compress sparse
       arrays
        
       Author : apankrat
       Score  : 32 points
       Date   : 2021-02-22 17:58 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | lifthrasiir wrote:
       | Ha, that's exactly what I've done for rust-encoding [1]! (Now
       | obsolete, go use encoding_rs instead.) For TSP I resorted to a
       | simple greedy algorithm, which was good enough for me.
       | 
       | To be honest though this problem of reducing a large table in the
       | random-accessible manner is hardly new (e.g. perfect hashing) and
       | pretty much everything can be automated only with some expert
       | guidance. I'd like to see a neat solution that can interactively
       | search remappings and emit a neat code in various languages.
       | 
       | (As a fellow Unicode commentator, I should also mention that the
       | OP only concerns about _simple_ case mappings as called by
       | Unicode. But they are pretty much the only mapping we like to
       | optimize for their prevalence anyway.)
       | 
       | [1] https://github.com/lifthrasiir/rust-
       | encoding/blob/master/src...
        
       | bear8642 wrote:
       | Nice! Bit confused on link to TSP but cool!
        
         | apankrat wrote:
         | Say, we have our N blocks. We make these into the nodes in a
         | graph. Every node is connected to all other nodes.
         | 
         | An edge leading from A to B is a case when block A precedes
         | block B. The weight of that edge is the AB overlap. Note that
         | the same edge may have a different weight in the opposite
         | direction, because it will be that of the BA overlap.
         | 
         | So what we are after is the heaviest path through this graph
         | that passes through all nodes exactly once. This can be reduced
         | to finding the heaviest cyclic path for a specific starting
         | point (and excluding the cheapest edge from the solution _and_
         | then iterating through all points). This is pretty much TSP
         | with an inverted goal.
        
         | jcranmer wrote:
         | I can see why you might be confused. Here's my attempt at an
         | explanation:
         | 
         | The goal is to find the permutation of blocks that minimizes
         | the total length of the table. Each block contributes an amount
         | of that length--the amount that did not overlap with the
         | previous block. This amount is only dependent on the
         | interaction between the block and its previous block. So we can
         | make a graph where each node is a block to be encoded, each
         | node has an edge to every other node, and the weight of that
         | edge is the amount of overlap [1]. Compute an ordering of the
         | nodes to visit that maximizes overlap, and you've a) done a TSP
         | traversal and b) computed the optimal permutation.
         | 
         | [1] Equivalently, non-overlap, gets you a minimum-weight
         | problem, as in classical TSP. But the notes suggest it's framed
         | as a maximal problem, finding the path that maximizes overlap--
         | which is certainly easier to code.
        
       | rurban wrote:
       | I did a much better optimization for musl/freebsd, about 10x
       | smaller, with binary search into a single table of ranges and
       | single exceptions, for both cases, lower and upper. A few ifs are
       | better than wasting lots of cache.
       | 
       | https://github.com/rurban/musl/commit/7bcdc7e6e1ed2c4424d706...
       | 
       | But I'm also trying the automatic data optimization route,
       | because Unicode I'd changing every year. Case mapping is easy,
       | but the changing normalization tables cause minor data structure
       | hickups every year. Working in that here:
       | https://github.com/rurban/optdata
        
         | huhtenberg wrote:
         | From the first glance, yours is going to be orders of magnitude
         | slower.
        
       | legulere wrote:
       | I have the feeling that immutable data (with which I don't mean
       | copy on write) has an extreme potential for optimization but
       | there doesn't seem to be a good central collection of it.
       | 
       | Some tips from me: mmap works really well and easy with immutable
       | data as the operating system will manage the memory pretty well.
       | If you store strings ordered in a string table you can compare
       | the pointers for blazing my fast string comparison.
        
         | rurban wrote:
         | mmap'ng constant tables is nice, but still wasting your caches.
         | Which makes it much slower than a few tricks to save data and
         | cpu. Such as using proper algorithms and data structures. I'm
         | working on an optimizing data compiler for such oneshot lookups
         | in static const data. Range or interval queries would be
         | different and easier (btree's only), but oneshot needs to try
         | much more.
        
         | kccqzy wrote:
         | I think if you have a large immutable lookup table, the
         | compiler will put it in the rodata section. Which means the
         | kernel can evict that mapping if it's unused and repopulate via
         | a page fault when it's needed.
        
       | kccqzy wrote:
       | > The casemap_lower comprises two distinct parts. It starts with
       | 256 index entries, followed by 3866 deltas, to the grand total of
       | 4122.
       | 
       | This sounds like a code smell of using one variable for two
       | distinct purposes. Why not separate them into two tables? After
       | all the first lookup results in a base index to be added, and the
       | second lookup results in a delta. They are different things.
        
         | lifthrasiir wrote:
         | Normally that's correct, but in this case you would like to put
         | two tables as close as possible as you already know the second
         | table would be accessed right after the first. As a side effect
         | two tables would share the same base address, which can be
         | beneficial for some platforms. Technically it also allows for
         | the possibility to overlap two tables (I think it's unlikely,
         | but still).
        
       | apankrat wrote:
       | Spotted the packing technique in the Wine's code and killed few
       | days to see how much further it can be taken. Pretty much one of
       | those rabbit holes that you'd occasionally run into in a course
       | of regular programming and end up exploring instead of, you know,
       | working. A good diversion, I suspect quite a few people can
       | relate :)
        
       | canucker2016 wrote:
       | Based on the title, I thought you were Daniel Lemire with another
       | writeup for benchmarking character conversion of some sort - for
       | examples see https://hn.algolia.com/?q=lemire.me.
       | 
       | Great work on reducing the size of the character conversion
       | tables, but I don't see any evidence for the title of the web
       | page.
       | 
       | "Compact character case conversion" is more appropriate currently
       | based on the data given in the readme.
       | 
       | Data can be greatly compressed at the cost of access speed or the
       | compressed data could speed up data access - with x86 CPU
       | complexity these days, I wouldn't bet on determining the speed
       | based on a quick glance of the code.
       | 
       | You state the case conversion is fast, yet I don't see any
       | benchmarks backing up the assertion.
       | 
       | I'd expect a few different types of input to be benchmarked -
       | ASCII upper/lower case alphabet chars (typical A-Z,a-z &
       | space/punctuation thrown in to make it look like typical written
       | language text inputs), Unicode upper/lower case et al., and
       | random characters.
       | 
       | Not gonna leave people in the lurch though - Daniel Lemire has a
       | recent post which can be used as a scaffold for benchmarking.
       | 
       | Code: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-
       | blog/...
       | 
       | Post: https://lemire.me/blog/2021/02/09/on-the-cost-of-
       | converting-...
       | 
       | If the code is to be timed on Windows, the code will need a
       | Windows version of "clock_gettime(CLOCK_REALTIME, ...)" - see
       | https://stackoverflow.com/questions/5404277/porting-clock-ge...
        
         | lifthrasiir wrote:
         | Anything necessarily requiring lookup tables would be
         | (comparably) slow. Unicode lookup tables are large, so it would
         | be inherently slow for some inputs. For the performance you
         | should assume the input distribution (e.g. ASCII is frequent
         | even in Unicode texts, so we detect a run of ASCII and go
         | SIMD). You can also combine adjacent operations: instead of
         | working with Unicode scalar values, you can directly accept
         | UTF-8 string and optimize tables for them (Oniguruma regexp
         | engine of Ruby fame does this).
        
         | apankrat wrote:
         | You want a benchmark that shows that one >>, one &, two [] and
         | a + are going to be _fast_?
         | 
         | The test loop that wraps the lookup code will likely take as
         | much CPU as what's being benchmarked. Even if you are to use
         | RDPMC and even if you are to use it very sparingly.
        
       ___________________________________________________________________
       (page generated 2021-02-22 23:00 UTC)