[HN Gopher] Unicode Utilities: Confusables
___________________________________________________________________
Unicode Utilities: Confusables
Author : simonpure
Score : 46 points
Date : 2022-08-17 15:09 UTC (3 days ago)
(HTM) web link (util.unicode.org)
(TXT) w3m dump (util.unicode.org)
| gcau wrote:
| If its of interest, I'm the author of the javascript/npm package
| 'confusables', which does the same thing (and also in the
| reverse)
|
| Source: https://github.com/gc/confusables Demo:
| https://confusables.gc.codes/
| danbruc wrote:
| _Total raw values: 148,068,974,592,000
|
| Too many raw items to process._
|
| Any good guesses what this is about?
|
| EDIT: Got it, the number of strings confusable with the input.
| jancsika wrote:
| Too bad there isn't a standard lib that will flag any non-ASCII
| characters to be rendered big like those oversized sprites in the
| "giant" level of Super Mario 3.
|
| So you scroll through source code and hit this weirdly spaced
| line with a big "h" character and go-- oh-- that's some buggy
| crap in there.
|
| Then maybe have a map for ranges that you want to include in your
| allowable set, and we're all good to go. :)
| jabbany wrote:
| I think the rust compiler kind of does this:
| https://mobile.twitter.com/hmemcpy/status/115189096847766732...
| jancsika wrote:
| Ooh, good job, Rust!
| jeroenhd wrote:
| Only a partial set, though: https://github.com/rust-
| lang/rust/blob/master/compiler/rustc...
| II2II wrote:
| Better yet, highlight characters out of context. Not everyone
| writes in languages that are fully representable in ASCII, yet
| confusable characters are still an issue.
| hilbert42 wrote:
| Potentially this is a big problem. Especially with OCR,
| transliterations from and across different languages, characters
| missing their diacriticals, etc.
|
| Whilst I was aware of the problem I wasn't aware that it's as big
| as it is. Those of us who use Latin scripts are reasonably
| familiar with the common ones such as _o, O_ and _0_ [l /c, u/c
| alpha & zero], but there's some tricky ones even in these Latin
| scripts that many of us get wrong.
|
| As I discovered a while back - but I can't remember how - most of
| us get the World War Two abbreviation _WWII_ wrong (myself
| included). The _' II'_ is not two alpha characters as we almost
| inevitability use but rather it should be the Unicode characters
| for Roman numerals. Even then I cannot remember if the correct
| transliteration for Arabic numeral 2 is supposed to to be Roman
| numeral '1' used twice/repeated or if Roman numeral '2' actually
| has its own Roman numerical glyph (I suspect the latter is
| correct). There are many more instances like this too, the dash,
| minus sign, en and em dash for instance.
|
| Yes, I could look them up but it's a nuisance to do so on-the-fly
| and that's the whole point/trouble.
|
| It seems to me we need much better proofing tools that would flag
| errors or potential errors. I reckon we've been very poorly
| served in this regard in that there no simple software tools
| available of the quality we need.
|
| It's not only symbols or characters we need to correct but also
| typos that don't show up on spelling checkers such as _for_ and
| _fro_ and the big troublemakers _it 's_ and _its._ Spelling and
| grammar checkers should automatically highlight or flag such
| words whether their usage is correct or not.
| jonstewart wrote:
| Confusables are used by attackers to make malicious domains and
| URLs apppear innocuous. If you show such things to users, it's
| good to highlight confusables.
___________________________________________________________________
(page generated 2022-08-20 23:00 UTC)