Posts by dalias@hachyderm.io
 (DIR) Post #AucW3SMjgd0obnkgBk by dalias@hachyderm.io
       0 likes, 0 repeats
       
       Doing Unicode normalization means knowing all the mappings of precomposed characters and how they break down to combining characters, as well as knowing data about all the combining characters that affects how they do or don't reorder to a canonical order (so that, for example, ` above and then ~ below isn't treated differently from ~ below then ` above).There aren't *that many* characters that need any special consideration, but they're spread sporadically across a 17-bit space.In a naive representation, that easily amounts to many tens or more likely hundreds of kB of tables, to be embedded into any static-linked program that needs the functionality...11/N
       
 (DIR) Post #AucW3T8Eq10ez7ycEK by dalias@hachyderm.io
       0 likes, 0 repeats
       
       In an age where folks don't bat an eye at embedding a friggin outdated insecure fork of Chrome into every desktop application, hundreds of kB might not seam like a big deal. In some contexts it is, and in others it isn't.But what is a big deal is the perception by folks who don't want that cost, that "support for all those other weird languages is a burden I don't want bloating my system". It feeds a fascist-adjacent, xenophobic narrative that blames cultural "others" for "my software is bloated".And from the beginning, my goal with #musl has been that this never be a tradeoff. That multilingual stuff just work, basically just as efficiently as your old ASCII-only or 8-bit-codepage stuff did.12/N
       
 (DIR) Post #AucW3Tt222RLKFrzAO by dalias@hachyderm.io
       0 likes, 0 repeats
       
       So far, I've gotten an initial estimate of about 30k for the tables needed for Unicode normalization to NFD (decomposed), which was already *really small* compared to how this is usually done, down to an estimate of about 10k. This was through analyzing the data to be represented and adapting the data structures, building on the concepts used for other Unicode data in #musl.13/N
       
 (DIR) Post #AucW3Ujqregty4aAUq by dalias@hachyderm.io
       0 likes, 0 repeats
       
       Now, 10k is bonkers small. I'll be very happy with that if it works out as planned. Which it should, since the "estimates" were more like calculations, using a bunch of grep, cut, sed, tr, sort, uniq, and wc on UnicodeData.txt. 😁Sadly, normalization in libc is needed only for collation and (out-of-scope for this project, but pending) IDN DNS lookup support. There's no direct interface to the functionality; it only has indirect effects.But there's no reason the code can't be repurposed for other uses outside libc.14/N
       
 (DIR) Post #AucW3VTE8wzGEnoPDs by dalias@hachyderm.io
       0 likes, 0 repeats
       
       My rather arbitrary size demands aside, doing Unicode normalization where it's needed in libc (for collation) is particularly challenging already because you don't have any working space to normalize into, while normalization can perform reordering across arbitrarily long sequences of characters - think the cursed gigantic combining character stacks you'd see folks post on aviform social media, but scaled arbitrarily more hideous.The solution is of course having normalization be a sort of iterator, but even that is somewhat nontrivial if you want to avoid quadratic-time in length of one of those evil sequences.15/N
       
 (DIR) Post #AucW3W8hekAEJRDWs4 by dalias@hachyderm.io
       0 likes, 0 repeats
       
       Oddly, I'm not aware of glibc doing any Unicode normalization for collation, and I don't think it does.I initially explored whether it could be skipped by instead expanding the collation element tables to cover both precomposed and decomposed forms (UTS#10 calls this a "canonical closure" of the collation), but if you want to meet the requirement to support all non-canonical forms, not just a couple normalizations, the closure would be infinite.So I'm assuming glibc, implementing the POSIX localedef version of collation (vs the Unicode one) and relying on Unicode tables being translated to localedef form, just punts on (or hasn't even considered) UTS#10 conformance.16/N
       
 (DIR) Post #AucW3Wqf1JKGVlmdO4 by dalias@hachyderm.io
       0 likes, 0 repeats
       
       When I made the grant proposal, I spent some time justifying working directly with the Unicode form of the data rather than going through localedef, but at the time it didn't even occur to me that the localedef form (at least without supplementing it with explicit normalization) might not even be able to represent the full behavior needed. 🙃17/N
       
 (DIR) Post #AucW3Xitlei9Dz9wvY by dalias@hachyderm.io
       0 likes, 0 repeats
       
       So at this point it looks like we may end up with several pretty novel things:1. Full UTS#10 collation behavior (honoring canonical equivalence) in a libc LC_LOCALE implementation.2. Constant-space, linear-time algorithm for normalization to NFD. (This can't be really novel, but I don't think it's widespread or widely known that it's nontrivial.)3. An efficient way of representing the part of UnicodeData.txt needed for normalization. More on this to follow...18/N
       
 (DIR) Post #AucW3YWAoS7tgoDIjQ by dalias@hachyderm.io
       0 likes, 0 repeats
       
       So, what got the data for normalization down to around 30k, then around 10k, and made for efficient processing. was combining the two types of data needed -characters that decompose, and "canonical combining classes" for characters that compose - into one big table.(For the sake of not being too technical, it suffices to say the latter group combining marks by "where they stack" and control how they participate in reordering.)This gets us all the data we need for processing an input character from one lookup, and avoids having redundant multi-level table overhead for both things.19/N
       
 (DIR) Post #AucW3ZFu4Qhpydbp0i by dalias@hachyderm.io
       0 likes, 0 repeats
       
       OK, so back to the larger project.Aside from LC_COLLATE, the other new functionality to be added is LC_NUMERIC and LC_MONETARY. There are no big complex interfaces or data formats involved here, just finding a good way to put the "struct localeconv" data into the locale files and make it available at runtime.And that's what the main big deal of the rest of the project is: locale data representation.Because we (I) botched it before, and need to get it right this time.20/N
       
 (DIR) Post #AucW3ZufcrJe14gNYO by dalias@hachyderm.io
       0 likes, 0 repeats
       
       In particular, the non-working LC_TIME functionality currently in musl uses the gettext model where English strings (or C-locale format strings, in the case of strftime formats) are used as keys to lookup the translated string.I'd always considered this a bad idea, but against my better judgement, went with it because it avoided having to make decisions about a format, and seemed to be what translators like.Unfortunately, nobody realized that "abbreviation for the 5th month" and "full name for the 5th month" share the same key in English, making it impossible to accurately translate them both. 🤦🤦🤦21/N
       
 (DIR) Post #AucW3acczQTgDPFU4O by dalias@hachyderm.io
       0 likes, 0 repeats
       
       So, the big non-technical part of this project is arriving at some consensus on what the right way to key these lookups is, and on how to make other data: collation tables, numeric/monetary properties, etc., fit into the on-disk mappable file format used.This is where it's going to be great having funded help from a project, @postmarketOS, that intends to actually use the outcome in the immediate future, and helping out on coordination with other distros & stakeholders so that we end up with something that both works and that everyone is happy with.22/N
       
 (DIR) Post #Aw2aD9ygPAOsCbrDLE by dalias@hachyderm.io
       0 likes, 0 repeats
       
       @2something @bitwarden @keepassxc @cmccullough We need a hall of shame site for all the "FOSS" projects welcoming unlicensed unlicenseable unknown provenance slop.
       
 (DIR) Post #AwQuHjKdYviNt1cZFI by dalias@hachyderm.io
       0 likes, 0 repeats
       
       PSA: On stock mobile Firefox, you can bypass Mozilla's lockout of about:config using the alternate URL:chrome://geckoview/content/config.xhtmlNote that "chrome" here has nothing to do with the Google browser. Google stole the name from what was previously the Mozilla-internal name for "browser UI components".
       
 (DIR) Post #AxEgnd4YxOWquq2JdI by dalias@hachyderm.io
       0 likes, 0 repeats
       
       @Nazani @futurebird It can't save bandwidth. It might save storage. But the cost in compute would be way larger than the cost of the storage.
       
 (DIR) Post #AxEgndjgUVQEyNH9jE by dalias@hachyderm.io
       0 likes, 0 repeats
       
       @Nazani @futurebird The only way "AI compression" could be profitable in storage & bandwidth to the video provider platform is if they could offload all the compute onto the user's client device. This isn't happening because it'd reduce battery life to a few minutes, and web won't allow it anyway (having that much site specific data for model, acces to "AI" scale compute, etc.)
       
 (DIR) Post #Ay3UKgP8CVeYSl2o1Q by dalias@hachyderm.io
       0 likes, 0 repeats
       
       @Kierkegaanks @Tutanota I've seen multiple claims to this effect with no evidence.The proposals as stated have been requiring providers of e2ee messaging to do the backdoors, which would basically mean the corporate shit platforms get backdoored and the real stuff first fights then ignores the law if they don't win, and possibly gets removed from app stores, relegated to sideloading.Nobody wants to be the party doing the backdooring. Apple is not going to step up to do it at the OS level, and ruin their reputation, when they can say "the regulation puts the burden on chat providers, not us" (except iMessage which they'd probably backdoor).
       
 (DIR) Post #Ay9sZ7dtI57EqJ7qRk by dalias@hachyderm.io
       0 likes, 0 repeats
       
       @wonkothesane @micahflee Any option is better. ProtonMail dude is fash, can't be trusted to protect users. Mail is not e2ee and there are no technical measures that can prevent them from betraying you, only trust.
       
 (DIR) Post #AzDq76hUsfKF55Xoqu by dalias@hachyderm.io
       0 likes, 0 repeats
       
       @dnkrupinski @pixelfed I'm seeing the same.
       
 (DIR) Post #B10knXBBn3qyBgpea0 by dalias@hachyderm.io
       0 likes, 0 repeats
       
       @m3tti @RussSharek The assumption is that their service is supposed to outlive whatever personal system your cronjob would be queued on.Meanwhile they'll be out of business 20 years later and my uptime command will be printing "up 7300 days, ..."