hngopher.com

       [HN Gopher] Unicode Utilities: Confusables
       ___________________________________________________________________
        
       Unicode Utilities: Confusables
        
       Author : simonpure
       Score  : 46 points
       Date   : 2022-08-17 15:09 UTC (3 days ago)
        
 (HTM) web link (util.unicode.org)
 (TXT) w3m dump (util.unicode.org)
        
       | gcau wrote:
       | If its of interest, I'm the author of the javascript/npm package
       | 'confusables', which does the same thing (and also in the
       | reverse)
       | 
       | Source: https://github.com/gc/confusables Demo:
       | https://confusables.gc.codes/
        
       | danbruc wrote:
       | _Total raw values: 148,068,974,592,000
       | 
       | Too many raw items to process._
       | 
       | Any good guesses what this is about?
       | 
       | EDIT: Got it, the number of strings confusable with the input.
        
       | jancsika wrote:
       | Too bad there isn't a standard lib that will flag any non-ASCII
       | characters to be rendered big like those oversized sprites in the
       | "giant" level of Super Mario 3.
       | 
       | So you scroll through source code and hit this weirdly spaced
       | line with a big "h" character and go-- oh-- that's some buggy
       | crap in there.
       | 
       | Then maybe have a map for ranges that you want to include in your
       | allowable set, and we're all good to go. :)
        
         | jabbany wrote:
         | I think the rust compiler kind of does this:
         | https://mobile.twitter.com/hmemcpy/status/115189096847766732...
        
           | jancsika wrote:
           | Ooh, good job, Rust!
        
           | jeroenhd wrote:
           | Only a partial set, though: https://github.com/rust-
           | lang/rust/blob/master/compiler/rustc...
        
         | II2II wrote:
         | Better yet, highlight characters out of context. Not everyone
         | writes in languages that are fully representable in ASCII, yet
         | confusable characters are still an issue.
        
       | hilbert42 wrote:
       | Potentially this is a big problem. Especially with OCR,
       | transliterations from and across different languages, characters
       | missing their diacriticals, etc.
       | 
       | Whilst I was aware of the problem I wasn't aware that it's as big
       | as it is. Those of us who use Latin scripts are reasonably
       | familiar with the common ones such as _o, O_ and _0_ [l /c, u/c
       | alpha & zero], but there's some tricky ones even in these Latin
       | scripts that many of us get wrong.
       | 
       | As I discovered a while back - but I can't remember how - most of
       | us get the World War Two abbreviation _WWII_ wrong (myself
       | included). The _' II'_ is not two alpha characters as we almost
       | inevitability use but rather it should be the Unicode characters
       | for Roman numerals. Even then I cannot remember if the correct
       | transliteration for Arabic numeral 2 is supposed to to be Roman
       | numeral '1' used twice/repeated or if Roman numeral '2' actually
       | has its own Roman numerical glyph (I suspect the latter is
       | correct). There are many more instances like this too, the dash,
       | minus sign, en and em dash for instance.
       | 
       | Yes, I could look them up but it's a nuisance to do so on-the-fly
       | and that's the whole point/trouble.
       | 
       | It seems to me we need much better proofing tools that would flag
       | errors or potential errors. I reckon we've been very poorly
       | served in this regard in that there no simple software tools
       | available of the quality we need.
       | 
       | It's not only symbols or characters we need to correct but also
       | typos that don't show up on spelling checkers such as _for_ and
       | _fro_ and the big troublemakers _it 's_ and _its._ Spelling and
       | grammar checkers should automatically highlight or flag such
       | words whether their usage is correct or not.
        
       | jonstewart wrote:
       | Confusables are used by attackers to make malicious domains and
       | URLs apppear innocuous. If you show such things to users, it's
       | good to highlight confusables.
        
       ___________________________________________________________________
       (page generated 2022-08-20 23:00 UTC)