[HN Gopher] Fast character case conversion (or how to compress s...
___________________________________________________________________
Fast character case conversion (or how to compress sparse arrays)
Author : donmcc
Score : 49 points
Date : 2021-10-13 17:15 UTC (5 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| dhsysusbsjsi wrote:
| How do these benchmark? I'd like to know if the extra cost of the
| memory lookups is balanced by the smaller table and better cache
| locality.
| wheybags wrote:
| Me too, considering the uncompressed table is tiny anyway, so
| saving ram for its own sake is not a useful goal here.
| jhallenworld wrote:
| I did a four-level radix tree for the Unicode character class and
| case conversion support in JOE (Joe's Own Editor, so that you can
| edit Unicode on non-Unicode systems): index sizes of 7 bits, 5, 5
| and 4 plus ASCII has a fast-path table.
|
| The main difference is that it computes the tables at startup
| from the sorted Unicode intervals. So the construction code has
| to be fast. The same code is also used for user character classes
| in regular expressions.
|
| Anyway, it builds them in two passes. First pass it de-duplicates
| nodes, but only the previously constructed node is a candidate
| for de-duplication. This keeps the memory usage low during
| construction. De-duplicated nodes can still be modified during
| this construction, so they may be re-duplicated (there is a
| reference counter to determine when this happens).
|
| Second pass (after all data is loaded, no more changes allowed),
| it globally de-duplicates the leaf nodes using a hash table. Many
| of the leaf nodes are duplicates (and not just the all zero
| ones).
| GrayShade wrote:
| Oh wow, JOE was the first editor I've used on Linux -- ages ago
| -- and it's still so much better than other very popular
| choices today. It's friendly, fast, feature-packed and the
| keyboard shortcuts (of WordStar heritage) are lovely.
|
| JOE is one of my favourite software projects, one I can only
| look up to. I don't know anything about you, but thank you!
| edflsafoiewq wrote:
| Previously: https://news.ycombinator.com/item?id=26227754
| willvarfar wrote:
| My experience with this is to fast-path the common cases (ie
| ascii and vast swathes of emptiness) with hardcoded bounds
| comparisons, so the memory access is only done as a slow path
| when needed.
| joebob42 wrote:
| Although as with any "common case performance improvement" this
| can be risky if what the common case is for a given server
| changes. We learned that this was what our server did when we
| launched in a new country and our auth server started needing
| 1.5x the computer per query. There was an ASCII special case
| and then an especially bad / expensive Unicode lookup when that
| failed, which we started triggering way more often.
| edflsafoiewq wrote:
| quickjs has pretty cool case conversion too. It special cases
| ASCII, then does the rest with a binary search through some kinda
| run-length encoded table. There's more code but the table itself
| is only about 2K for both case directions.
|
| https://github.com/bellard/quickjs/blob/b5e62895c619d4ffc75c...
___________________________________________________________________
(page generated 2021-10-13 23:01 UTC)