https://github.com/apankrat/notes/tree/master/fast-case-conversion Skip to content Sign up * Why GitHub? Features - + Mobile - + Actions - + Codespaces - + Packages - + Security - + Code review - + Issues - + Integrations - + GitHub Sponsors - + Customer stories- * Team * Enterprise * Explore + Explore GitHub - Learn and contribute + Topics - + Collections - + Trending - + Learning Lab - + Open source guides - Connect with others + The ReadME Project - + Events - + Community forum - + GitHub Education - + GitHub Stars program - * Marketplace * Pricing Plans - + Compare plans - + Contact Sales - + Education - [ ] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this user All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} apankrat / notes Public * Notifications * Star 56 * Fork 2 * * Code * Issues 0 * Pull requests 0 * Actions * Projects 0 * Wiki * Security * Insights More * Code * Issues * Pull requests * Actions * Projects * Wiki * Security * Insights master Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags notes/fast-case-conversion/ Go to file notes/fast-case-conversion/ Latest commit @apankrat apankrat Update README.md ... 1df8f74 Oct 12, 2021 Update README.md 1df8f74 Git stats * History Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time . . README.md Update README.md Oct 12, 2021 double-index-ex.png edits Feb 20, 2021 double-index.png edits Feb 20, 2021 single-index-ex.png edits Feb 20, 2021 single-index.png edits Feb 20, 2021 tolower-table.txt edits Feb 21, 2021 tolower_1618.c edits Feb 21, 2021 tolower_2112.c edits Feb 21, 2021 tolower_4422.c edits Feb 21, 2021 tolower_5750.c edits Feb 21, 2021 toupper_1738.c edits Feb 21, 2021 toupper_2258.c edits Feb 21, 2021 toupper_4638.c edits Feb 21, 2021 toupper_5866.c edits Feb 21, 2021 [ ] Fast character case conversion ... or how to really compress sparse arrays Lookup tables 8-bit ASCII Unicode Unicode, continued Table compression But wait... Using smaller blocks Rearranging blocks Full search Heuristic search Randomized shuffling Summarized Separating the index Recap Taking care of index Three arrays Five arrays In other words Conclusion README.md Fast character case conversion ... or how to really compress sparse arrays Converting strings and characters between lower and upper cases is a very common need. In particular, case conversion is often used to implement case-insensitive comparision, an operation that is often present on the program's fast paths as a part of data container lookups and content manipulation. So it is usually desirable to make case conversions as fast as possible. In this post we are going to look at one of the options - very fast case conversion using compressed lookup tables and also at some options for compressing these even further. If in rush, you can jump straight to the Conclusion. Lookup tables The simplest way to do the case conversion is with a lookup table: const char to_lower[256] = { ... }; // lower case versions of every character const char to_upper[256] = { ... }; // upper case versions of every character char a = to_lower[ 'A' ]; char Z = to_upper[ 'z' ]; ... This is also the fastest way. The cost of speed is the space needed to hold the lookup tables, the good old space-time tradeoff. If we are working with the 7-bit ASCII set, the tables will take 512 bytes. These can be allocated and initialized statically (and therefore stored in the executable binary) or they can be created at run-time. This is doable because the coversion rules are trivial: if ('A' <= ch && ch <= 'Z') ch += 'a' - 'A'; // to lower if ('a' <= ch && ch <= 'z') ch += 'A' - 'a'; // to upper In fact, for this case the benefits of using a lookup table are marginal at best. The above if construct works just as well. 8-bit ASCII Case conversion for an 8-bit ASCII set depends on the locale used. In terms of speeding things up the same applies as before - either hardcode static conversion table for each locale of interest, or code the conversion rules and then initialize the lookup tables at run-time. Unicode Unicode is where it gets interesting. At the time of writing the Unicode Standard is at its 13th revision. The principle list of characters is UnicodeData.txt with its format described here. The executive summary of this fascinating read is this: * There are around 1400 characters with lower/upper cases * Some have more than one lower/upper case * Some use multiple characters for one case and a single character for another For the purpose of this post we are going to ignore #2 and #3 and focus on characters that covert one-to-one between their cases. We will also focus on conversion of characters from 0x0000 - 0xD7FF and 0xE000 - 0xFFFF ranges, i.e. those that encode as a single 16-bit word under UTF-16 encoding. Unicode, continued A lookup table for the UTF-16 case requires 2 x 65536 bytes per case, 256KB in total. Both tables will also be very sparse with just 2% of a non-trivial fill. Which brings us to the interesting part - is it possible to somehow compress these tables without sacrificing the speed of lookups? Needless to say, the answer is Yes and one of the answers can be found in the Wine project or, more specifically, in its unicode.h. Pared down a bit it looks like this - wchar_t tolower(wchar_t ch) { extern const wchar_t casemap_lower[]; return ch + casemap_lower[casemap_lower[ch >> 8] + (ch & 0xff)]; } Two bit operations, two additions and two memory references. The casemap_lower table lives in casemap.c and it is quite incredibly just 8KB is size: const wchar_t wine_casemap_lower[4122] = { ... } Very fast, beautifully terse and, at the first glance, mysterious. Here's how it works. Table compression Both tolower and toupper tables are compressed the same way, so we'll just look at the compression of the former. There are 1158 conversion rules and they are listed here. So here's what's going on: 1. The first insight is to store deltas between the lower and upper case codes rather than their absolute values. That is, we do ch += lookup[...] instead of ch = lookup[...]. This fills the table with 0s for all non-caseable characters, creating lots of redundancy and making it way easier to compress. 2. Next, the full set of wchar_t values - all 65536 of them - is split into 256 blocks, 256 characters each. Entries for characters starting with 00xx (in hex) go into the first block, 01xx - into the second, etc. block 00 - [256 entries for values 0000 to 00FF] block 01 - [256 entries for values 0100 to 01FF] ... block FF - [256 entries for values FF00 to FFFF] 3. Once this is done, we may notice that only 17 blocks are not completely zero-filled. These blocks are: 00, 01, 02, 03, 04, 05, 10, 13, 1C, 1E, 1F, 21, 24, 2C, A6, A7, FF Remaining 239 blocks will contain just zeroes and nothing else. So we can store just these 17 blocks plus a zero-filled block and then use an index to map block's ID to its actual data in the table: Index | Table --------------------------+------------------------------------------- block offset | offset data 00xx -> 0000 | 0000 [... 256 deltas ...] 01xx -> 0100 | 0100 [... 256 deltas ...] ... | 0200 [... 256 deltas ...] 05xx -> 0500 | ... 06xx -> 1200 (zeroes) | 1100 [... 256 deltas ...] 07xx -> 1200 (zeroes) | 1200 [ 0 0 0 0 0 0 0 ...] (zeroes) ... | FExx -> 1200 (zeroes) | FFxx -> 1100 | This compresses table down to 4608 items (18 x 256) + the size of the index. This is already excellent - around 10K instead of 128K - but it can be compressed further. 4. Next insight is that remaining blocks still contain a lot of zeroes. More specifically, one block may end with lots of zeroes and the following one may start with a lot of them. Here's, for example, the end of the 0200 block - ... 0248 | 0001 0000 0001 0000 0001 0000 0001 0000 <- last non-zeros 0250 | 0000 0000 0000 0000 0000 0000 0000 0000 0258 | 0000 0000 0000 0000 0000 0000 0000 0000 ... 02e8 | 0000 0000 0000 0000 0000 0000 0000 0000 02f0 | 0000 0000 0000 0000 0000 0000 0000 0000 02f8 | 0000 0000 0000 0000 0000 0000 0000 0000 <- end of the block And here's the start of the 0300 block - 0300 | 0000 0000 0000 0000 0000 0000 0000 0000 <- start of the block 0308 | 0000 0000 0000 0000 0000 0000 0000 0000 0310 | 0000 0000 0000 0000 0000 0000 0000 0000 ... 0360 | 0000 0000 0000 0000 0000 0000 0000 0000 0368 | 0000 0000 0000 0000 0000 0000 0000 0000 0370 | 0001 0000 0001 0000 0000 0000 0001 0000 <- first non-zeros ... So the 0300 block can be "pulled up" to overlap with the 0200 block and its index entry adjusted to match. The blocks can be squished together if you will. Like so - ... 0248 | 0001 0000 0001 0000 0001 0000 0001 0000 0250 | 0000 0000 0000 0000 0000 0000 0000 0000 0258 | 0000 0000 0000 0000 0000 0000 0000 0000 0260 | 0000 0000 0000 0000 0000 0000 0000 0000 0268 | 0000 0000 0000 0000 0000 0000 0000 0000 0270 | 0000 0000 0000 0000 0000 0000 0000 0000 0278 | 0000 0000 0000 0000 0000 0000 0000 0000 0280 | 0000 0000 0000 0000 0000 0000 0000 0000 0288 | 0000 0000 0000 0000 0000 0000 0000 0000 0290 | 0000 0000 0000 0000 0000 0000 0000 0000 | 0300 0298 | 0000 0000 0000 0000 0000 0000 0000 0000 | 0308 02a0 | 0000 0000 0000 0000 0000 0000 0000 0000 | 0310 02a8 | 0000 0000 0000 0000 0000 0000 0000 0000 | 0318 02b0 | 0000 0000 0000 0000 0000 0000 0000 0000 | 0320 02b8 | 0000 0000 0000 0000 0000 0000 0000 0000 | 0328 02c0 | 0000 0000 0000 0000 0000 0000 0000 0000 | 0330 02c8 | 0000 0000 0000 0000 0000 0000 0000 0000 | 0338 02d0 | 0000 0000 0000 0000 0000 0000 0000 0000 | 0340 02d8 | 0000 0000 0000 0000 0000 0000 0000 0000 | 0348 02e0 | 0000 0000 0000 0000 0000 0000 0000 0000 | 0350 02e8 | 0000 0000 0000 0000 0000 0000 0000 0000 | 0358 02f0 | 0000 0000 0000 0000 0000 0000 0000 0000 | 0360 02f8 | 0000 0000 0000 0000 0000 0000 0000 0000 | 0368 | 0001 0000 0001 0000 0000 0000 0001 0000 | 0370 ... This little maneuver reduces our table by 112 items (14 rows x 8), and that's just for these two blocks. 5. Taking the idea a bit further, we have no constraints on where in the table our zero-filled block should be. So we find a pair of blocks that have the largest zero-filled overlap and we stick our zero-filled block between them. Wine places this block in the 7th spot and after a squish this brings table size down to 3866 items. 6. Finally, the index and the table are merged into a single array. The casemap_lower comprises two distinct parts. It starts with 256 index entries, followed by 3866 deltas, to the grand total of 4122. Going back to towlower() code - ... return ch + casemap_lower[casemap_lower[ch >> 8] + (ch & 0xff)]; casemap_lower[ch >> 8] is the index lookup which points at the block's start in the data section, followed by the retrieval of ch's delta value within the block at the offset of ch & 0xff. Nice, isn't it? If you know the origins of this code (or its author!), let me know and I'll add a proper reference. It feels like something that should've been really well researched back in the 1970s if not earlier. But wait... We can do better. Using smaller blocks Instead of splitting the original 65536 set into 256 x 256 blocks we can try other block sizes. Using larger blocks will reduce the size of the index, but it will also cause greater %-age of blocks to contain at least one non-zero entry. The "squish potential" will also increase with the block size, but the number of block pairs will be smaller. Block size | Used blocks | Items | Squish | Data size 2048 x 8 = 16384 - 4629 = 11755 1024 x 10 = 10240 - 1757 = 8483 512 x 13 = 6656 - 1101 = 5555 256 x 18 = 4608 - 742 = 3866 <- the original 128 x 28 = 3584 - 539 = 3045 64 x 44 = 2816 - 578 = 2238 32 x 55 = 1760 - 345 = 1415 16 x 66 = 1056 - 187 = 869 8 x 75 = 600 - 109 = 491 4 x 78 = 312 - 40 = 272 2 x 81 = 162 - 0 = 162 <- unsquishable The last column is the number of items in the table. But let's not forget the index: Block size | Index size | Data size | Total size 2048 32 + 11755 = 11785 1024 64 + 8483 = 8547 512 128 + 5555 = 5683 256 256 + 3866 = 4122 <- the original 128 512 + 3045 = 3557 64 1024 + 2238 = 3262 <- the smallest 32 2048 + 1415 = 3463 16 4096 + 869 = 4965 8 8192 + 491 = 8683 4 16384 + 272 = 16656 2 32768 + 162 = 32930 That is, by reducing the block size to 64 items we can compress the table down to 3262 items (6516 bytes). This is ~ 21% reduction compared to the original of 4122 (or 8244 bytes). The conversion function itself will look like so: ... return ch + casemap_lower[casemap_lower[ch >> 6] + (ch & 0x3f)]; Rearranging blocks Instead of finding the best spot for the zero-filled block, we can treat it as any other block and then look for the most squishable permutation of all blocks. If we squint and abstract a bit, this search will sure look suspiciously like the Travelling Salesman Problem, meaning no shortcuts and a rather unpleasant running time. More specifically, the search reduces to finding (1) the heaviest path (2) of a fixed length (3) on a bidirectional fully-meshed weighted graph. Full search For larger block sizes the number of uniques blocks will be quite low - 10, 13, etc. - so we can just brute-force our way through all possible block permutations. A moderately optimized code can do around 50M checks per second. With the block size of 2048 we have 8 unique blocks. That's just 40,320 permutations and the best exact squish is 5344. With the block size of 1024 we have 10 unique blocks. That's 3,628,800 permutations, which take 0.1 sec to find the answer - 2861. With the block size of 512 we have 13 unique blocks. That's 6,227,020,800 permutations or ~ 100 seconds of processing time. Still doable and the answer is 1900. But with the block size of 256 - Wine's original - the number grows to 18! or about 49 months of processing time. No bueno. Heuristic search Another approach is to try and construct "good" block sequences in a pseudo-intelligent way. It just happens that underlying N x N matrices (of graph edge weights) show some well-pronounced patterns - there are zero-filled rows and columns, non-zero cells are either rather large or fairly small with nothing in between, etc. This appears to help getting decent results even from some rudimentary "logical" guesses. For example, we can pick the most squishable pair of blocks and then keep prepending and appending it with blocks that squish the best with it. This alone yields a squish of 972 for Wine's original 256 block size - a decent improvement over 742 of the original. We can also check how just appending or just prepending fares in comparison. We can also prefer either appending or prepending when both yield the same improvement. Next, we can construct our block sequence out of multiple parts. That is, our sequence assembly loop will be making a choice between (1) adding a block to an existing part (2) merging two parts with a block or (3) starting yet another part from a pair of blocks. Once we have our sequence we can check if splitting it into two parts and swapping parts around would yield a better squish. We can also see if moving an item to another spot or swapping it with another item would increase a squish. These are essentially random guesses, but it turns out that for whatever magical reason they do work when tried together. In fact, they work so well that they manage to find the exact solution for 2048, 1024 and 512 block sizes. When used against other blocks sizes, this is the result: Block size Max exact squish Heuristic squish Zero-block squish 2048 5344 5344 4629 1024 2861 2861 1757 512 1900 1900 1101 256 ? 1244 742 128 ? 1113 539 64 ? 965 578 32 ? 628 345 16 ? 387 187 8 ? 224 109 4 ? 86 40 2 ? 0 0 Randomized shuffling In addition to educated guessing we can also try something even more enlightened - randomized shuffling. Alas in an overnight test it didn't manage to improve upon the heuristic solutions for any of the block sizes. Summarized Long story short, here's what block reshuffling gets us: Block size | Best squish | Items | Index | Total 2048 5344 -> 11040 + 32 = 11072 1024 2861 -> 7379 + 64 = 7443 512 1900 -> 4756 + 128 = 4884 256 1244 -> 3364 + 256 = 3620 <- the original 128 1113 -> 2471 + 512 = 2983 64 965 -> 1851 + 1024 = 2875 <- the smallest 32 628 -> 1132 + 2048 = 3180 16 387 -> 669 + 4096 = 4765 8 224 -> 376 + 8192 = 8568 4 86 -> 226 + 16384 = 16610 2 0 -> 162 + 32768 = 32930 We are now down to 2875 items for a Wine-style lookup table, which is 5750 bytes or ~ 30% smaller than the original. Separating the index The table size can be reduced a bit more by noticing that we never have more than 256 unique index entries. This allows building the index as uint8_t[] instead of uint16_t[]. However this requires using a secondary array like so: // for the block size of 64 uint8_t index[1024] = { ... }; uint16_t offsets[44] = { ... }; uint16_t items[1851] = { ... }; uint16_t block_offset = offsets[ index[ ch >> 6 ] ]; ch = items[block_offset + (ch & 0x3f) ]; Or, using in the style of Wine's merged array: uint8_t index[1024] = { ... }; uint16_t casemap[44+1851] = { ... }; ch = casemap[ casemap[ index[ ch >> 6 ] ] + (ch & 0x3f) ]; With this reduction in place, here are the new byte counts: Block size | Index | Offsets + Items | Total bytes 2048 32 + ( 8 + 11040) x 2 = 22128 1024 64 + (10 + 7379) x 2 = 14842 512 128 + (13 + 4756) x 2 = 9666 256 256 + (18 + 3364) x 2 = 7020 128 512 + (28 + 2471) x 2 = 5510 64 1024 + (44 + 1851) x 2 = 4814 32 2048 + (55 + 1132) x 2 = 4422 <- 4.3 KB 16 4096 + (66 + 669) x 2 = 5566 8 8192 + (75 + 376) x 2 = 9094 4 16384 + (78 + 226) x 2 = 16992 2 32768 + (81 + 162) x 2 = 33254 Recap We are now quite good at compressing large arrays filled with a relatively small number of unique values. We can also do it in two ways. Option A - two 16-bit arrays and the retrieval function of arr[ idx [ch >> ...] + (ch & 0x...) ] Single index 2 bit ops, 1 addition, 2 memory references Option B - two 16-bit arrays, one 8-bit array and the retrieval function of arr[ off[ idx[ch >> ...] ] + (ch & 0x...) ] Single index ex 2 bit ops, 1 addition, 3 memory references The best compression of our original tolower lookup table is 5750 bytes with A and 4422 bytes with B. But we can do better still. Taking care of index As we've seen, smaller blocks sizes create larger indexes. We might've also noticed that these indexes have very few distinct entries. The index for the block size of 32 is 2048 entries with only 55 uniques and the index for the block size of 4 is 16384 entries with only 78 uniques. Large arrays with massive redundancy. This looks ... familiar. So let's compress them too! To make things a bit less confusing, we'll refer to the index of this new compression as jndex and its blocks as sequences. For each block size we take its index and iterate through different sequence sizes looking for the most compressible option. Here are the results: Index size | Uniques | --------------- Best compression -------------- | Jndex size | Seq size | Unique seqs | Items 32 8 8 4 5 16 64 10 8 8 5 27 128 13 16 8 6 35 256 18 32 8 8 46 512 28 32 16 8 88 1024 44 64 16 10 119 2048 55 128 16 13 151 4096 66 128 32 13 299 8192 75 256 32 18 424 16384 78 256 64 18 843 32768 81 512 64 28 1240 Jndex size and Items columns give us the compressed index size. As expected, the index compresses really well. The next question is how to encode it. Three arrays One way to encode compressed index and our data table is an extension of Option A. This is simpler and faster to access, but it's a bit wasteful because all arrays are uint16_t. Double index 4 bit ops, 2 additions, 3 memory references The byte counts are as follows: Index size | Index compressed | Data compressed | Items | Bytes Jndex | Items | Total 32 8 + 16 = 24 + 11040 = 11064 22128 64 8 + 27 = 35 + 7379 = 7414 14828 128 16 + 35 = 51 + 4756 = 4807 9614 256 32 + 46 = 78 + 3364 = 3442 6884 512 32 + 88 = 120 + 2471 = 2591 5182 1024 64 + 119 = 183 + 1851 = 2034 4068 2048 128 + 151 = 279 + 1132 = 1411 2822 4096 128 + 299 = 427 + 669 = 1096 2192 8192 256 + 424 = 680 + 376 = 1056 2112 <-- ha 16384 256 + 843 = 1099 + 226 = 1325 2650 32768 512 + 1240 = 1752 + 162 = 1914 3828 That is, this takes us down to 2112 bytes. Five arrays The second option is to penny-pinch bytes by using a separate offsets array, just like in the Option B. Double index 4 bit ops, 2 additions, 5 memory references This reduces the overall byte count as follows: Index size | Index compressed | Data compressed | Bytes Jndex | Uniques | Items 32 8 + 5 + 16 + ( 8 + 11040) x 2 = 22125 64 8 + 5 + 27 + (10 + 7379) x 2 = 14818 128 16 + 6 + 35 + (13 + 4756) x 2 = 9595 256 32 + 8 + 46 + (18 + 3364) x 2 = 6850 512 32 + 8 + 88 + (28 + 2471) x 2 = 5126 1024 64 + 10 + 119 + (44 + 1851) x 2 = 3983 2048 128 + 13 + 151 + (55 + 1132) x 2 = 2666 4096 128 + 13 x 2 + 299 + (66 + 669) x 2 = 1923 8192 256 + 18 x 2 + 424 + (75 + 376) x 2 = 1618 <-- HA 16384 256 + 18 x 2 + 843 + (78 + 226) x 2 = 1743 32768 512 + 28 x 2 + 1240 + (81 + 162) x 2 = 2294 That is, with this trickery we are now down to just 1618 bytes. In other words If we prefer to use Wine-style lookup code, we can reduce the table size from 8244 to 5750 bytes, a reduction of ~ 30%. If we are OK with using an extra memory reference, we can shrink the table to 4422 bytes, a reduction of ~ 45%. If we add two more bit operations and one addition, we can have the table down to 2121 bytes, a reduction of ~ 74%. Finally, if we splurge on 2 more memory references, the table size can be reduced to 1618 bytes, a reduction of ~ 80%. Pick your favourite. Conclusion Very fast case conversion of (the vast majority of) Unicode characters can be done with a handful of CPU cycles and precomputed lookup tables sized between 1.6KB to 8.6KB in size. The 8.6KB version, courtesy of Wine: * casemap.c * unicode.h Recompressed versions of tolower tables: * 5.6KB - tolower_5750.c * 4.3KB - tolower_4422.c * 2.1KB - tolower_2112.c * 1.6KB - tolower_1618.c Recompressed versions of toupper tables: * 5.7KB - toupper_5866.c * 4.5KB - toupper_4638.c * 2.2KB - toupper_2258.c * 1.7KB - toupper_1738.c * (c) 2021 GitHub, Inc. * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.