[HN Gopher] Fast character case conversion (or how to compress s...
       ___________________________________________________________________
        
       Fast character case conversion (or how to compress sparse arrays)
        
       Author : donmcc
       Score  : 49 points
       Date   : 2021-10-13 17:15 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | dhsysusbsjsi wrote:
       | How do these benchmark? I'd like to know if the extra cost of the
       | memory lookups is balanced by the smaller table and better cache
       | locality.
        
         | wheybags wrote:
         | Me too, considering the uncompressed table is tiny anyway, so
         | saving ram for its own sake is not a useful goal here.
        
       | jhallenworld wrote:
       | I did a four-level radix tree for the Unicode character class and
       | case conversion support in JOE (Joe's Own Editor, so that you can
       | edit Unicode on non-Unicode systems): index sizes of 7 bits, 5, 5
       | and 4 plus ASCII has a fast-path table.
       | 
       | The main difference is that it computes the tables at startup
       | from the sorted Unicode intervals. So the construction code has
       | to be fast. The same code is also used for user character classes
       | in regular expressions.
       | 
       | Anyway, it builds them in two passes. First pass it de-duplicates
       | nodes, but only the previously constructed node is a candidate
       | for de-duplication. This keeps the memory usage low during
       | construction. De-duplicated nodes can still be modified during
       | this construction, so they may be re-duplicated (there is a
       | reference counter to determine when this happens).
       | 
       | Second pass (after all data is loaded, no more changes allowed),
       | it globally de-duplicates the leaf nodes using a hash table. Many
       | of the leaf nodes are duplicates (and not just the all zero
       | ones).
        
         | GrayShade wrote:
         | Oh wow, JOE was the first editor I've used on Linux -- ages ago
         | -- and it's still so much better than other very popular
         | choices today. It's friendly, fast, feature-packed and the
         | keyboard shortcuts (of WordStar heritage) are lovely.
         | 
         | JOE is one of my favourite software projects, one I can only
         | look up to. I don't know anything about you, but thank you!
        
       | edflsafoiewq wrote:
       | Previously: https://news.ycombinator.com/item?id=26227754
        
       | willvarfar wrote:
       | My experience with this is to fast-path the common cases (ie
       | ascii and vast swathes of emptiness) with hardcoded bounds
       | comparisons, so the memory access is only done as a slow path
       | when needed.
        
         | joebob42 wrote:
         | Although as with any "common case performance improvement" this
         | can be risky if what the common case is for a given server
         | changes. We learned that this was what our server did when we
         | launched in a new country and our auth server started needing
         | 1.5x the computer per query. There was an ASCII special case
         | and then an especially bad / expensive Unicode lookup when that
         | failed, which we started triggering way more often.
        
       | edflsafoiewq wrote:
       | quickjs has pretty cool case conversion too. It special cases
       | ASCII, then does the rest with a binary search through some kinda
       | run-length encoded table. There's more code but the table itself
       | is only about 2K for both case directions.
       | 
       | https://github.com/bellard/quickjs/blob/b5e62895c619d4ffc75c...
        
       ___________________________________________________________________
       (page generated 2021-10-13 23:01 UTC)