[HN Gopher] Bitten by Unicode
       ___________________________________________________________________
        
       Bitten by Unicode
        
       Author : pryelluw
       Score  : 117 points
       Date   : 2024-09-09 02:38 UTC (20 hours ago)
        
 (HTM) web link (pyatl.dev)
 (TXT) w3m dump (pyatl.dev)
        
       | ks2048 wrote:
       | That's only the tip of the iceberg of hyphen-looking characters.
       | 
       | Here's some more,                 2010 ; 002D ; MA #* ( - - - )
       | HYPHEN - HYPHEN-MINUS #        2011 ; 002D ; MA #* ( - - - ) NON-
       | BREAKING HYPHEN - HYPHEN-MINUS #        2012 ; 002D ; MA #* ( - -
       | - ) FIGURE DASH - HYPHEN-MINUS #        2013 ; 002D ; MA #* ( - -
       | - ) EN DASH - HYPHEN-MINUS #        FE58 ; 002D ; MA #* ( - - - )
       | SMALL EM DASH - HYPHEN-MINUS #        06D4 ; 002D ; MA #* ( . - -
       | ) ARABIC FULL STOP - HYPHEN-MINUS # ---       2043 ; 002D ; MA #*
       | ( -- - - ) HYPHEN BULLET - HYPHEN-MINUS # ---       02D7 ; 002D ;
       | MA #* ( - - - ) MODIFIER LETTER MINUS SIGN - HYPHEN-MINUS #
       | 2212 ; 002D ; MA #* ( - - - ) MINUS SIGN - HYPHEN-MINUS #
       | 2796 ; 002D ; MA #* (  - - ) HEAVY MINUS SIGN - HYPHEN-MINUS #
       | ---       2CBA ; 002D ; MA # (  - - ) COPTIC CAPITAL LETTER
       | DIALECT-P NI - HYPHEN-MINUS # ---
       | 
       | copied from
       | https://www.unicode.org/Public/security/8.0.0/confusables.tx...
        
         | markus_zhang wrote:
         | I think it's a good idea to write a plugin for any IDE to
         | highlight those confusing characters.
        
           | samatman wrote:
           | VSCode does this out of the box actually. Ended up putting a
           | few on a whitelist while writing Julia, where it can get kind
           | of ugly (puts a yellow box around them).
        
           | userbinator wrote:
           | Using an ASCII-only font automatically shows all characters
           | that IMHO should not be present in source code.
        
             | metadat wrote:
             | Some platforms, such at python3 have full UTF-8 support
             | already, so what is the problem?
        
               | userbinator wrote:
               | The one shown very clearly by this article.
        
               | metadat wrote:
               | Thanks usrbinator.. _guilty grimace smile_
               | 
               | Maybe highlighting isn't such bad idea :)
        
               | keybored wrote:
               | The wrong values are from PDF files. Maybe you mean using
               | a system-wide ASCII-only font but you finished your point
               | with "should not be present in source code". Source code
               | wasn't the problem here.
        
               | foobarchu wrote:
               | It very much is a problem in source code too though. It's
               | unfortunately common in college courses (particularly
               | non-CS courses with programming like bioinformatics) for
               | instructors to distribute sample code as word docs. Cue
               | students who can't run the code and don't know why
               | because Word helpfully converted all double quotes to a
               | "prettier" Unicode equivalent.
        
               | keybored wrote:
               | Bizarrely I have experienced the same thing from Latex
               | with its purpose-made code/literal blocks.
               | 
               | But the most shocking thing are printed learning
               | resources on things like Haskell where the code examples
               | _on purpose_ are some kind of typographic printout rather
               | than just the symbols themselves!
        
             | lifthrasiir wrote:
             | String literals frequently have non-ASCII characters to say
             | the least.
        
             | powersnail wrote:
             | That would make it impossible to edit non-ascii strings,
             | like texts in foreign languages. As far as I know, most
             | editors/IDE don't support switching fonts for string
             | literals. It is more feasible for a syntax highlighter to
             | highlight non-ascii characters outside of literals.
        
               | Someone wrote:
               | > As far as I know, most editors/IDE don't support
               | switching fonts for string literals
               | 
               | When asked to render an Unicode character that isn't
               | present in the font modern OSes will automatically pick a
               | font that has it.
               | 
               | https://en.wikipedia.org/wiki/Fallback_font: _"A fallback
               | font is a reserve typeface containing symbols for as many
               | Unicode characters as possible. When a display system
               | encounters a character that is not part of the repertoire
               | of any of the other available fonts, a symbol from a
               | fallback font is used instead. Typically, a fallback font
               | will contain symbols representative of the various types
               | of Unicode characters."_
               | 
               | That can be avoided, for example by storing text as "one
               | character per byte", but I don't think many editors do
               | that nowadays.
        
               | powersnail wrote:
               | But that would not distinguish between chars inside a
               | string literal and chars outside of a string literal.
        
             | keybored wrote:
             | For every such Unicode problem (which is a data input^W
             | source problem, not a programming source code error) there
             | are fifty problems caused by the anemic ASCII character set
             | like Unix toothpicks and three layers of escaping due to
             | using too uniform delimiters.
             | 
             | (Granted this is heavily biased since so much source code
             | is ASCII-only so you don't get many Unicode problems in the
             | first place...)
        
             | makeitdouble wrote:
             | A note on non-ascii in code: I thought of it as an
             | abomination, until hitting test pattern descriptors.
             | 
             | On a project targeted at a non English speaking devs with a
             | strong domain knowledge requirement, writing the test
             | patterns (endless arrays of input -> expected output
             | sequences, interspersed with adjustment code) in the native
             | language saves an incredible amount of time and effort, in
             | particular as we don't need to translate obscure notions
             | into even more obscure English.
             | 
             | And that had very little downsides as it's not production
             | running code, lining will still raise anything problematic,
             | and and the whole thing is easier to get reviewed by non
             | domain experts.
             | 
             | We could have made a translation layer to have the content
             | in a spreadsheet and convert it to test code, but that's
             | not any more stable than having unicode names straight into
             | the code.
        
               | nine_k wrote:
               | String constants / symbols is one domain, keywords and
               | reserved characters, another. They should be checked for
               | different things. E.g. spell-checking string constants as
               | plain text if they look as plain text is helpful.
               | Checking for non-ASCII quotes / dashes / other
               | punctuation outside quoted strings, where they can only
               | occur by mistake, is _also_ helpful.
        
               | makeitdouble wrote:
               | My comment got mistakenly autocorrected (meant "linting"
               | instead of "lining"), which is so on point given the
               | subject.
               | 
               | I agree, and think a decent linter can deal with these
               | issues, and syntax highlighting as well.
               | 
               | In particular these kind of rules tend to get complicated
               | with many exceptions (down to specific folders needing
               | dedicated rules), so doing it as lint and not at the
               | language level gives a lot of freedom on where and how to
               | apply the rules and raise warnings.
        
             | oneeyedpigeon wrote:
             | It depends on whether you count html as "source code", but
             | if so, then non-ASCII characters absolutely _should_ be
             | present!
        
             | PaulHoule wrote:
             | It's a very unpopular opinion but I use as much Unicode as
             | I can in source code. In comments for instance I can write
             | x2
             | 
             | as well as italic and bold characters (would have demoed
             | but HN filters out Unicode bold & italics) and I can write
             | a test named                  processesZhong Wen
             | Characters()
             | 
             | and also write Java that looks like APL, add sigil
             | characters in code generated stubs that will never conflict
             | with other people's code because they're too afraid to use
             | these characters, etc.
             | 
             | https://github.com/paulhoule/ferocity/blob/main/ferocity-
             | std...
             | 
             | People will ask "how do you enter those characters?" and I
             | say "I don't know but I can cut and paste them, they get
             | offered by the autocomplete, etc."
        
               | Arnt wrote:
               | Hardly unpopular where I live. Lots of source code
               | contains EUR and much else. Grepping for it in the code I
               | worked on last week, I find non-ASCII character in dozens
               | of tests, in some scripts that seem to be part of CI, in
               | a comment about a locale-specific bug, and I stopped
               | looking there.
               | 
               | How to enter them? Well, the keys are on the keyboard.
        
               | PaulHoule wrote:
               | If you're in Euro land.
               | 
               | I have a lot of personal interest in Chinese language
               | content these days, I have no idea how to set up and use
               | an "input method" but I either see the text I want in
               | front of me or ask an LLM "How do I write X in Chinese?"
               | and either way cut and paste.
        
               | sigseg1v wrote:
               | Chinese enter words into a keyboard using the same type
               | of keyboard you would use in North America. The
               | characters are entered as "pinyin" which is a romanized
               | phonetic method of describing Chinese words. You should
               | be able to enter it into your keyboard on Windows for
               | example by enabling Simplified Chinese / pinyin in the
               | language input settings.
        
           | MrJohz wrote:
           | I know vscode had this feature built in, and it's come in
           | handy a couple of times for me.
        
         | mjevans wrote:
         | Also remember to squash 'wide' characters back to the ASCII
         | table where possible, if the data is being processed by normal
         | tools.
         | 
         | There are honestly so many data-cleaning steps a pipeline could
         | need / have to produce programatically well-formatted data.
        
         | toastal wrote:
         | And yet all of these serve a different, useful purpose for
         | semantics.
        
           | account42 wrote:
           | As TFA shows, no they don't. They may have been _intended_
           | for different semantics but once humans come into play if it
           | looks vaguelly correct then its getting used.
        
         | renhanxue wrote:
         | Three Minus Signs for the Mathematicians under the pi,
         | 2212 MINUS SIGN       2796 HEAVY MINUS SIGN       02D7 MODIFIER
         | LETTER MINUS SIGN
         | 
         | Seven Dashes for the Dash-lords in their quotes as shown,
         | 2012 FIGURE DASH       2013 EN DASH       2014 EM DASH
         | 2015 QUOTATION DASH       2E3A TWO-EM DASH       2E3B THREE-EM
         | DASH       FE58 SMALL EM DASH
         | 
         | Nine Hyphens for Word Breakers, one of them ­,
         | 00AD SOFT HYPHEN       058A ARMENIAN HYPHEN       1400 CANADIAN
         | SYLLABICS HYPHEN       1806 MONGOLIAN TODO SOFT HYPHEN
         | 2010 HYPHEN       2011 NON-BREAKING HYPHEN       2E17 DOUBLE
         | OBLIQUE HYPHEN       2E40 DOUBLE HYPHEN       30A0 KATAKANA-
         | HIRAGANA DOUBLE HYPHEN
         | 
         | One for the Dark Word in the QWERTY zone
         | 
         | In the land of ASCII where Basic Latin lie.
         | 
         | One String to rule them all, One String to find them,
         | 
         | One String to bring them all and in the plain-text, bind them
         | 
         | In the land of ASCII where Basic Latin lie.
         | 002D HYPHEN-MINUS
         | 
         | - @FakeUnicode on Twitter, with apologies to J. R. R. Tolkien
        
         | tracker1 wrote:
         | Yeah, quotes and magic quotes are another set... Nothing like
         | discovering MySQL treats magic quotes as ANSI quotes for
         | purposes of SQL (injection)... AddSlahshes wasn't enough.
         | 
         | For what it's worth TFA could still use a regexp, it would just
         | be slightly more complex. But the conditional statement may or
         | may not be faster or easier to reason with.
        
       | LegionMammal978 wrote:
       | Does anyone here know of any actual Unicode-encoded documents
       | that consistently use U+2010 HYPHEN for their hyphens? Among
       | those documents that do distinguish between dash-like characters,
       | the most common usage I've seen is to use U+002D HYPHEN-MINUS for
       | hyphens and U+2212 MINUS SIGN for minus signs, alongside the rest
       | of U+2013 EN-DASH, U+2014 EM-DASH, etc. U+2010 seems
       | conspicuously absent from everything, even when the 'proper'
       | character usage is otherwise adhered to.
        
         | mjevans wrote:
         | If you ever find any, it might be time to ask if a true General
         | AI has been developed. I really doubt most humans bother, and
         | LLMs will copy our mistakes.
        
           | LegionMammal978 wrote:
           | My point is, there _are_ penty of documents which bother with
           | minus signs, en-dashes, and em-dashes, including Wikipedia,
           | the Unicode Standard itself, and well-edited online news
           | articles. Yet they still don 't bother with U+2010 in
           | particular, which makes me question the character's
           | usefulness.
        
             | adrian_b wrote:
             | U+2010 has the advantage that it is not ambiguous and its
             | appearance is predictable. You can never know whether a
             | given typeface will display U+002D as a hyphen or as a
             | minus or en-dash.
             | 
             | The reason why it is seldom used is that all keyboards by
             | default provide only a way to easily type U+002D and the
             | other ASCII characters. The input methods may provide some
             | combination of keys that allows you to enter a minus or an
             | en-dash, but nobody bothers to add an additional key
             | combination for U+2010. The U+002D key could be
             | reconfigured to output U+2010, but this would annoy the
             | programmers who use programming languages where U+002D is
             | used for minus.
             | 
             | So there is no way out of this mess. In programming
             | languages or spreadsheets U+002D is used for minus, while
             | in documents intended for reading, U+002D is used for
             | hyphen, and the appropriate Unicode characters are used for
             | minus and en-dash.
             | 
             | An exception among programming languages was COBOL.
             | Originally it used only a hyphen character, which was used
             | to improve the readability of long identifiers. This was
             | possible because the arithmetic operations were written
             | with words, i.e. SUBTRACT, so there was no need for a minus
             | character.
             | 
             | A few years later (1964-12), when the PL/I language was
             | developed to replace both FORTRAN and COBOL, they have
             | introduced the underscore character, replacing the hyphen
             | in long identifiers, where it was used to improve their
             | readability like in COBOL, so that the hyphen/minus
             | character could be used with the meaning of minus, like in
             | FORTRAN. This convention has been inherited by most later
             | programming languages, except by most dialects of LISP,
             | which typically use a hyphen character in identifiers and
             | they do not use a minus character, except for the sign of
             | numbers.
        
               | LegionMammal978 wrote:
               | > U+2010 has the advantage that it is not ambiguous and
               | its appearance is predictable. You can never know whether
               | a given typeface will display U+002D as a hyphen or as a
               | minus or en-dash.
               | 
               | The thing is, I've never found a single non-monospace
               | typeface that displays U+002D as a minus sign or en-dash:
               | it seems to be universally rendered shorter than a U+2212
               | or U+2013, whenever the latter have their own glyphs in
               | the first place. I also did some testing on my system
               | some time back, and 99% or more of typefaces treated a
               | U+2010 identically to a U+002D. Only one or two displayed
               | it a smidgeon shorter than a U+002D.
               | 
               | Hence my original question about whether it really is
               | used for that purpose (or any other purpose) in practice.
               | 
               | Meanwhile, you do make a good point regarding programming
               | languages. Though it would seem mostly coincidental to me
               | that their use cases are almost always 'hyphen' or
               | 'minus', as opposed to any of the other meanings of a
               | 'typewriter dash'.
        
               | lispm wrote:
               | > except by most dialects of LISP, which typically use a
               | hyphen character in identifiers and they do not use a
               | minus character, except for the sign of numbers.
               | 
               | In Common Lisp, there is one character SP10 for Hyphen
               | and Minus: https://www.lispworks.com/documentation/HyperS
               | pec/Body/02_ac...
               | 
               | It is used
               | 
               | * in numbers as a sign -> -42
               | 
               | * in a conditional read macro as a minus operator ->
               | #-ARM64(error "This is no 64bit ARM platform")
               | 
               | * in functions as a numeric minus operator -> (- 100 2/3)
               | or (1- 100)
               | 
               | * as a global variable for the currently evaluated REPL
               | expression -> -
               | 
               | * as a hyphen for symbols (incl. identifiers) -> UPDATE-
               | INSTANCE-FOR-DIFFERENT-CLASS
        
             | keybored wrote:
             | For people/text authors who care, hyphen-minus is already
             | hyphen-biased: most hyphen-minuses you encounter from
             | average text authors (who don't care) are meant to be
             | hyphens. And for people who care it is even more slanted:
             | 
             | - They will either use `--` or `---` as poor man's en/em-
             | dash or use the proper symbols
             | 
             | - They might use the proper minus sign but even if they
             | don't: post-processing can guess what is meant as "minus"
             | in basic contexts (and even for math-heavy contexts:
             | hyphens aren't that common)
             | 
             | Furthermore hyphen-minus is rendered as a hyphen already.
             | Not as minus or a dash.
             | 
             | It's like a process of elimination: people who care already
             | treat non-hyphens sufficiently different such that the
             | usage of hyphen-minus is clear: it is just hyphen.
             | 
             | For me these things is mostly about looks and author-
             | intent. Dashes look better than poor man's dashes. Hyphen-
             | minus looks like a hyphen minus already. And if I use
             | hyphen-minus then I mean hyphen.
             | 
             | And for me it is less about using the correct character at
             | the expense of possible inter-operation: the hyphen-minus
             | is so widespread that I have no idea if 95% of software
             | will even cope with using the real HYPHEN Unicode scalar.
             | (I very much _doubt_ that!)
             | 
             | The last thing is keyboard usability economics. I use en-
             | dash/em-dash a few times per paragraph at most. Hyphens can
             | occur several times a sentence. And since I need hyphen-
             | minus as well (see previous point about interoperability)
             | most keyboard setups will probably need to relegate it to
             | some modifier keybind like AltGr-something... and no one
             | has the patience for typing such a common symbol with a
             | modifier combo.
        
         | lifthrasiir wrote:
         | I too haven't seen any natural use of U+2010. But some Unicode
         | characters are equally underused, often because they are
         | historical or designed for specialized or internal uses. Here
         | U+2010 can be thought as a normalized form for U+002D after
         | some processing to filter non-hyphens, which justifies its
         | inclusion even when the character itself might not be used
         | much.
        
         | red_admiral wrote:
         | TeX definitely distinguishes between -, -- and --- in text mode
         | (hyphen, en dash, em dash); there are packages for language-
         | specific quotes and hyphenation rules so there may be something
         | out there that does this - ctan/smartmn specifically seems to
         | be dealing with this kind of thing. Mind you, TeX also allows
         | pretty arbitrary remapping of symbols.
        
           | LegionMammal978 wrote:
           | Of course TeX also distinguishes between its dash-like
           | characters. But I'm not talking about TeX but about Unicode,
           | which is the one with the apparently-unused U+2010 HYPHEN.
        
         | tracker1 wrote:
         | It will depend on the source for the input. Odds are every
         | variation of minus and hyphen has appeared in every context at
         | some point.
         | 
         | From a stylistic perspective, it may have been desired for a
         | given appearance even if technically wrong. Just because of a
         | given typeface. I say this as someone who was an artist before
         | learning software programming.
        
       | chithanh wrote:
       | Seems not a good idea to roll your own. What if your software
       | encounters U+2212 "MINUS SIGN" next?
       | 
       | Probably best to just transliterate to ASCII using gettext or
       | unidecode or similar.
        
       | samatman wrote:
       | Still broken, alas. '-', named MINUS SIGN, U+2212, is an Sm:
       | Symbol, math. Arguably the one which _should_ be used, meaning
       | the risk of actually encountering it, while e, is never 0.
       | 
       | As ks2048 points out, the only thing for it is to collect 'em
       | all.
       | 
       | Which is why (shameless plug) I wrote this:
       | https://github.com/mnemnion/runeset
        
       | lifthrasiir wrote:
       | To be clear, you weren't bitten by Unicode but bitten by bad
       | Unicode usages. Which are prevalent enough that any text
       | processing pipeline has to be fuzzy enough to recognize them. I
       | have seen, for example, many uses of archaic Hangul jamo U
       | (U+318D) in place of middle dots (U+00B7) or bullets (U+2022),
       | while middle dots and bullets themselves are often confused to
       | each other.
        
         | riffraff wrote:
         | Why bad? This is the intended use for this character
        
           | lifthrasiir wrote:
           | Hyphen is a distinct (but of course very commonly confused)
           | character from minus, which Unicode separately encodes
           | U+2212. Though it is also possible that an OCR somehow
           | produced a hyphen out of nowhere.
        
           | keybored wrote:
           | Hyphens in front of numbers is not the intended use of
           | hyphen. The PDFs have mangled the symbols.
        
       | rdtsc wrote:
       | A bit off-topic but a thing that jumps out is using floats for
       | currency. Good for examples and small demos but beware using it
       | for anything serious.
        
         | lordmauve wrote:
         | The finance industry mostly uses floats for currency, up until
         | settlement etc.
         | 
         | "What would I get for this share?" can be answered with a
         | float.
         | 
         | "What did I get for selling this share?" should probably be a
         | fixed point value.
        
           | dotancohen wrote:
           | Floats are fine for speculation. But they should not be used
           | to record actual transactions.
           | 
           | I typically use the smaller unit of a currency to store
           | transaction amounts. E.g., for a US transaction of $10, I
           | would store the integer 1000 because that is 1000 cents.
        
             | zie wrote:
             | Or just use decimal numbers instead. Decimal libraries
             | abound. Then you can do rounding however your
             | jurisdiction/bank/etc does it too.
        
         | zie wrote:
         | I would argue it's not even good for demos or examples :)
        
       | userbinator wrote:
       | In my current font, that hyphen looks very slightly different
       | from the normal ASCII one - it's just a pixel shorter and located
       | a pixel lower. If I force the charset to CP1252 then I get aEUR
       | which is very obviously not a hyphen.
        
       | mwkaufma wrote:
       | "For dollar figures I find a prefixed dollar symbol and convert
       | the number following it into a float."
       | 
       | Bloombug red flag!!
        
       | makach wrote:
       | *Bitten by regex
        
       | Toxygene wrote:
       | Another option would be to detect and/or normalize Unicode input
       | using the recommendations from the Unicode consortium.
       | 
       | https://www.unicode.org/reports/tr39/
       | 
       | Here's the relevant bit from the doc:
       | 
       | > For an input string X, define skeleton(X) to be the following
       | transformation on the string:                   Convert X to NFD
       | format, as described in [UAX15].         Remove any characters in
       | X that have the property Default_Ignorable_Code_Point.
       | Concatenate the prototypes for each character in X according to
       | the specified data, producing a string of exemplar characters.
       | Reapply NFD.
       | 
       | The strings X and Y are defined to be confusable if and only if
       | skeleton(X) = skeleton(Y). This is abbreviated as X [?] Y.
       | 
       | This is obviously talking about comparing two string to see if
       | they are "confusable" but if you just run the skeleton function
       | on a string, you get a "normalize" version of it.
        
         | lexicality wrote:
         | Python even has a handy function for this:
         | https://docs.python.org/3/library/unicodedata.html#unicodeda...
        
         | jrochkind1 wrote:
         | This was my first thought -- I was specifically thinking the
         | less typically used [K] "compatibility" normalization forms
         | would do it.
         | 
         | But in fact, none of the unicode normalization forms seem to
         | convert a `HYPHEN` to a `HYPHEN-MINUS`. Try it, you'll see!
         | 
         | Unicode considers them semantically different characters, and
         | not normalized.
         | 
         | The default normalization forms NFC and NFD that are probably
         | defaults for a "unicode normalize" function will should always
         | result in exactly equivalent glyphs (displayed the same by a
         | given font modulo bugs), just expressed differently in unicode.
         | Like single code point "Latin Small Letter E with Acute"
         | (composed, NFC form); vs two code points "latin small letter e"
         | plus "combining acute accent" (decomposed, NFD form). I would
         | not expect them to change the hyphen characters here -- and
         | they do not.
         | 
         | The "compatibility" normalizations, abbreviated by "K" since
         | "C" was already taken for "composed", WILL change glyphs. For
         | instance, they will normalize a "Superscript One" `1` or a
         | "Circled Digit 1" `1` to an ordinary "Digit 1" (ascii 49).
         | (which could also be relevant to this problem, and it's
         | important all platforms expose compatibility normalization
         | too!) NFKC for compatibility plus composed, or NFKD for
         | compatibility plus decomposed. I expected/hoped they would
         | change the unicode `HYPHEN` to the ascii `HYPHEN-MINUS` here.
         | 
         | But they don't seem to, the unicode directory decided these
         | were not semantically equivalent even at "compatibility" level.
         | 
         | Unfortunately! I was hoping compatibility normalization would
         | solve it too! The standard unicode normalization forms will not
         | resolve this problem though.
         | 
         | (I forget if there are some _locale-specific_ compatibility
         | normalizations? And if so, maybe they would normalize this? I
         | think of compat normalization as usually being like  "for
         | search results should it match" (sure you want `1` to match
         | `1`), which can definitely be locale specific)
        
       | wodenokoto wrote:
       | If your source is not consistent enough to give you consistent
       | hyphen there are probably a lot of other weird things that are
       | slipping through the cracks.
        
         | tomcam wrote:
         | That's pretty much all real-world datasets
        
           | tracker1 wrote:
           | Considering how many real world data sets are based on hand
           | crafted spreadsheets, absolutely. Especially with copy pasta.
           | 
           | Edit: pasta above was actually meant to be paste, but gesture
           | input is fun. Ironically it's better this way.
        
         | advisedwang wrote:
         | Probably true, but unless you are suggesting the author should
         | abandon the product/feature, the author needs to achieve the
         | best they can given the constraints. Stuff like fixing hypens
         | gets closer. There's probably a lot more such things their code
         | will end up doing.
        
           | account42 wrote:
           | The author should use a validating parser instead of a simple
           | regular expression and hoping that the result is correct.
           | I.e. the start of the post should have been that the parser
           | errored out rather than that the result was positive.
        
       | ReleaseCandidat wrote:
       | That's why you don't want to use regexes to parse something, but
       | an actual tokenizer which uses Unicode _k_ompatibility
       | normalisation (NFKC or NFKD) for comparisons. Although I'm not
       | sure if that works with the Dollar emoji (which HN doesn't like
       | to display)
        
         | riffraff wrote:
         | Regexes can handle Unicode categories just fine, if he'd
         | written a tokenizer it would still have failed. Which in fact
         | it did when he removed the regex.
        
           | ReleaseCandidat wrote:
           | It's not about categories they don't help with such problems,
           | but comparison using compatible normalisation (either NFKC or
           | NFKD). Using that, e.g. 1 compares equal to I and i (the
           | roman number literals) the 1 in a circle and all the other
           | Unicode code points which have the same meaning.
        
             | riffraff wrote:
             | but that's not about using a tokenizer vs a regex, it's
             | about using a normalization step, which would also work
             | with the regex.
        
               | ReleaseCandidat wrote:
               | Yes, that's true. Except it isn't (well, doesn't have to
               | be) an extra step in the tokenizer. Most of the time you
               | do not want to run the whole string or part of it through
               | the kompatibility normalization but just some code points
               | (like the sign of a float). Which could of course be done
               | with a matchgroup of a regexp too. I have just made the
               | observationthrough the last two decades, that it's easier
               | to not forget about such cases when not using regexps.
        
       | kccqzy wrote:
       | Run this:                   >>> unicodedata.category('\N{MINUS
       | SIGN}')         'Sm'
       | 
       | There you go. No need to thank me for breaking your code.
       | 
       | Also, nobody has yet commented on the fact that the author is
       | also doing PDF text extraction. That's yet another area where a
       | lot of fuzziness needs to be applied. My confidence in the
       | author's product greatly decreased after reading this post.
        
       | riffraff wrote:
       | FWIW, python regex support checking for Unicode properties via
       | the \p{SOME NAME} syntax, but as people said there's a lot more
       | weird edge cases. Btw looks like the code may also have a couple
       | lurking bugs (parsing floats vs decimals, implicit locale number
       | formatting ).
       | 
       | I feel all "import data from multiple sources" I've seen in my
       | life grew through repeated application of edge case handling.
        
       | bobbylarrybobby wrote:
       | I'd be concerned about `value[0]` -- how does that work in the
       | face of multi byte characters? Is all string indexing in Python
       | O(n)? Does it store whether a given string is ascii-only and
       | switch to constant time lookup if it is?
        
         | Sniffnoy wrote:
         | Python 3 actually uses UTF-32, so it's all constant-time. A
         | tradeoff few make, certainly!
        
           | lifthrasiir wrote:
           | Or more accurately, behaves as if it is UTF-32. The actual
           | implementation uses multiple internal representations, just
           | like JS engines emulating UCS-2.
        
             | Sniffnoy wrote:
             | Huh! I was unaware (of both of those), thanks.
        
         | jerf wrote:
         | The cost of string indexing isn't relevant for a hard-coded
         | zero index. It affects what you might get back but that's O(1)
         | regardless of implementation.
        
       | wonnage wrote:
       | I feel like the responsible thing to do here is throw an error if
       | you encounter an unexpected character. Others have already
       | pointed out that there's an actual minus sign character that
       | would break this. This code is dealing with like four different
       | tricky/unpredictable things (parsing PDFs, parsing strings,
       | unicode, money) and the lack of basic exception handling should
       | raise alarm bells.
        
       | stouset wrote:
       | This highlights a way I constantly see people misuse regex: they
       | aren't specific enough. You weren't bitten by Unicode, you were
       | bitten by lazy and unprincipled parsing. Explicitly and strictly
       | parse _every_ character.
       | 
       | For here, assuming you already have _only_ have the numeric value
       | as a token, the regex should look like                   / ^ -?
       | [0-9]+ ( \. [0-9]+ )? $ /x
       | 
       | or something similar. Match the beginning and end of the string
       | and everything inbetween: an optional hyphen, any number of
       | digits, and an optional decimal conponent. Feel free to adjust
       | the details to match your spec, but _any_ unexpected character
       | will fail to parse.
        
         | bregma wrote:
         | That should be an optional minus sign, not an optional hyphen.
         | Also, the radix character is locale-dependent so your should
         | use a character class for it.
        
           | nine_k wrote:
           | Locale-dependent parsing is a bit more complicated.
           | 
           | For instance, you likely want to accept locale-specific
           | numerals, and any of 77 7 7 Qi  match the \d character class
           | and mean "seven", but you likely don't want to accept a
           | string as a valid number if different types of digits are
           | mixed together.
           | 
           | Also, 1,23,456.78 is fine in an Indian locale, but likely is
           | a typo in the en_US or en_UK locales.
        
           | IsTom wrote:
           | > your should use a character class
           | 
           | That depends on locale. Is "1,222" 1222 or 1.222?
        
             | account42 wrote:
             | But it definitely should not be the global process locale
             | if you are parsing something that doesn't originate from
             | the user's environment (and even then using something fixed
             | like en_US or the saner en_DK unless a locale is explicitly
             | requirested for the invocation makes sense).
        
           | stouset wrote:
           | Sure, the details depend on the exact format you're trying to
           | parse. But the point is that you should strictly and
           | explicitly match every component of the string.
        
       | pino82 wrote:
       | Bitten by MS Outlook... ^^
        
         | emmelaich wrote:
         | Indeed, very familiar to those who copy code from Outlook,
         | Word, and some websites.
         | 
         | Even some manpages had (or still do) have hyphen instead of
         | minus for the option character. Argh!
        
           | blueflow wrote:
           | Known bug: https://lists.debian.org/debian-
           | devel/2023/10/msg00085.html                 This issue does
           | indeed have a history of provoking unhinged lunacy.
        
       | SomewhatLikely wrote:
       | Where I thought this might be going from the first paragraph:
       | 
       | Negative numbers are sometimes represented with parentheses:
       | (234.58)
       | 
       | Tables sometimes tell you in the description that all numbers in
       | are in 1000's or millions.
       | 
       | The dollar sign is used by many currencies, including in
       | Australia and Canada.
       | 
       | I'd probably look around for some other gotchas. Here's one page
       | on prices in general: https://gist.github.com/rgs/6509585 but
       | interestingly doesn't quite cover the OP's problem or the ones I
       | brought up, though the use cases are slightly different.
        
         | oneeyedpigeon wrote:
         | I was certain that it was going to be a range of numbers that
         | didn't use an endash.
        
       | eviks wrote:
       | > Inspecting the hyphen. > I pulled in the standard library
       | module unicodedata and starting checking things.
       | 
       | Or you could extend your editor to show Unicode character name in
       | the status bar and do the inspection in a more immediate way
        
         | wonger_ wrote:
         | Or in vim, `ga` when hovered over a character
        
       | bluecalm wrote:
       | Reading my code from some years ago I can see I was very
       | frustrated by a similar problem when parsing .csv files from some
       | financial institutions:                   # converts idiotic
       | number format containing random junk into normal         #
       | represantion of a number         def junkno_to_normal(s):
       | (...)
       | 
       | There are so many random characters you can insert into dates or
       | numbers. Not only hyphens but also all kind of white spaces or
       | invisible characters. It's always a warm feeling when you import
       | a text document and not only it's encoded in UTF-8 but you see
       | YYYY-MM-DD date format. You know it's going to be safe from
       | there. Unfortunately it's still very rare in my experience (even
       | UTF-8 bit).
        
       | devit wrote:
       | This fix makes no sense:
       | 
       | if is_hyphen(value[0]) and value[1] == "$":
       | converted_value = float(re.sub(r"[^.0-9]", "", value)) \* -1
       | 
       | If the strategy is to delete all non-numeric characters in
       | re.sub, you should instead replace _all_ characters that could be
       | a minus with '-' before doing the float(re.sub(...)) including
       | the '-' instead of this bizarre ad-hoc code.
       | 
       | Also "is_hyphen" is wrong since it doesn't handle the Unicode
       | minus sign.
        
       | jstanley wrote:
       | But if they explicitly wrote a HYPHEN instead of a HYPHEN-MINUS
       | or some other type of minus sign, doesn't that suggest it's
       | actually not a minus sign and the number shouldn't be negative?
        
         | pornel wrote:
         | Unicode is not that semantic. It inherited ASCII (with no
         | minus) and a ton of presentational (mis)uses of code points.
         | 
         | It's so messy that Unicode discourages use of APOSTROPHE for
         | apostrophes, and instead recommends using RIGHT SINGLE
         | QUOTATION MARK for _apostrophes_.
        
           | oneeyedpigeon wrote:
           | > Unicode discourages use of APOSTROPHE
           | 
           | Blame fonts that render APOSTROPHE as a disgusting straight
           | character.
        
             | account42 wrote:
             | Surely you mean a pretty straight and symmetric character,
             | the ideal all characters should aspire to.
        
             | pornel wrote:
             | Because in ASCII it also plays a role of the left single
             | quote, so you get a geometric compromise.
        
         | chatmasta wrote:
         | Sure, misinterpreting user intent could cost a lot of money --
         | $100 or more, if you're not careful.
        
       | lynx23 wrote:
       | And dont forget to check if your .startswith takes a regex,
       | because -$ will give you unexpected headaches even without the
       | multitude of hyphens.
        
       | langsoul-com wrote:
       | If anyone has worked with spreadsheets across Mac, Windows, Linux
       | and various online ones. They're also a nightmare.
       | 
       | Some characters are encoded differently based on what system set
       | it. So an if statement character comparison runs into the same
       | misery as the author has :(
        
       | evOve wrote:
       | They should simply drop HYPHEN (U+2010). I don't see the purpose
       | of having extra identity
        
         | lifthrasiir wrote:
         | U+2010 was a very early addition to Unicode (1.1) and Unicode
         | characters are never removed once encoded, even when there are
         | glaring errors [1].
         | 
         | [1] https://www.unicode.org/policies/stability_policy.html
        
         | keybored wrote:
         | The purpose is to have an unambiguous character for where that
         | matters. This has been covered.
        
       | amiga386 wrote:
       | What's old is new again. People who use the wrong tools produce
       | data in the wrong format.
       | 
       | You used to get people writing web pages in Microsoft Word, a
       | tool designed for human prose, and so has "smart quotes" on by
       | default, hence they write:                   <div class="a b c
       | d">
       | 
       | which is parsed as:                   <div class=""a" b="" c=""
       | d"="">
       | 
       | because _smart quotes aren 't quotes_. The author used the wrong
       | tool for composing text. They should have used a text editor.
       | 
       | I also find that even people in text editors sometimes
       | accidentally type some combination that is invisibly wrong, for
       | example Option+Space on macOS is a non-breaking space (U+00A0)
       | rather than regular space (U+0020) and that's quite easy to type
       | accidentally, especially If You're Adding Capitals because shift
       | and option are pretty near each other.
       | 
       | Sometimes people also manage to insert carriage returns and/or
       | linefeeds in what's supposed to be a single-line input value, so
       | regular expressions using "." to match anything don't go beyond
       | the first newline unless you turn on the "multiline" flag.
       | 
       | None of this is unicode specifically, it's just the age old
       | problem of human ingenuity in providing nonstandard data, and
       | whether _you_ do workarounds to fix it, or you make the supplier
       | fix it.
        
         | oneeyedpigeon wrote:
         | > The author used the wrong tool for composing text. They
         | should have used a text editor.
         | 
         | Then you have the opposite problem: most text editors make it
         | non-trivial to work with unicode. I mean, I've taken the time
         | to learn how to type curly quotation marks vs. straight ones,
         | but not everyone has and keyboards don't make it easy.
        
           | pavel_lishin wrote:
           | May I ask why you use curly quotation marks instead of the
           | straight ascii ones?
        
             | oneeyedpigeon wrote:
             | In written text, I think they're far more attractive. If I
             | need to put forward some kind of 'objective' argument, then
             | differentiating between open and closed seems to make
             | logical sense. Check out any printed material: 99.9% of the
             | time, it uses curly quotes.
        
           | verandaguy wrote:
           | My mental framework has been:
           | 
           | - Curly quotes are a typographic sugar that's easier on the
           | human eye when reading normal, human-language text. It's
           | reasonable for them to be automatically inserted into your
           | typing in something like a word processor, and depending on
           | which language you're writing in, there may be strong
           | orthographic rules about the use of curly braces (or their
           | cognates, like << guillemets >>, etc).
           | 
           | - Straight quotes belong in code by a combination of
           | convention and practicality; unicode characters should be
           | escaped wherever it's practical to do so (for example, if you
           | must use "-" in your code, prefer to do e.g. "\u2192" instead
           | -- it's clearer for future users which exact unicode arrow
           | you were using there).
        
         | kragen wrote:
         | i have this problem a lot with markdown, because i very much do
         | want my "" smart quotes in the formatted output, but markdown
         | also (optionally) uses "" for link titles. i recently switched
         | to using () for link titles, which i had forgotten was an
         | option
         | 
         | also i sometimes accidentally replace a " with " or ", or a '
         | with a ' or ', inside of `` or an indented code block
        
         | euroderf wrote:
         | Smart quotes are the work of the Devil.
        
         | TheRealPomax wrote:
         | nit: at the time they should have used an HTML editor. Those
         | still existed back then.
        
       | Retr0id wrote:
       | Fun fact, ISO 8601 says you _should_ use U+2212 MINUS to express
       | timestamps with negative timezone offsets. At least, I think it
       | does, I 'm going off the Wikipedia description:
       | https://en.wikipedia.org/wiki/ISO_8601#Other_time_offset_spe...
        
         | lifthrasiir wrote:
         | In my understanding that is a misunderstanding. I previously
         | commented about that [1], but in short: a combined hyphen-minus
         | character should be used for any charset based on ISO/IEC 646,
         | which includes Unicode.
         | 
         | [1] https://news.ycombinator.com/item?id=37346702
        
       | hgs3 wrote:
       | Unicode conforming regular expression engines are supposed to
       | support the \p or \P property syntax [1] so you should be able to
       | match hyphen characters with \p{Hyphen} or \p{Dash}.
       | 
       | [1] https://www.unicode.org/reports/tr18/#property_syntax
        
         | account42 wrote:
         | Very nice for Unicode to provide a solution to the problem
         | Unicode created.
        
           | samatman wrote:
           | Unicode did not create the problem of many similar-looking
           | dash-like characters with different meanings and widths.
           | 
           | It documented it, at most.
        
         | gknoy wrote:
         | Thanks for linking this! I also learned that support for `\p{}`
         | syntax isn't supported in the Python `re` library, and they
         | recommend the api-compatible `regex` library, which does have
         | support for that.
        
       | rzwitserloot wrote:
       | Isn't "turns out there are _lots_ of look-alikes, often literally
       | pixel-for-pixel identical in most fonts at all sizes, in the
       | unicode tables and that might cause some confusion when parsing
       | text" like.. lesson 101 for unicode?
       | 
       | At any rate, I find the conclusion a bit hilarious: "Ah, the
       | input text uses that symbol that very explicitly DOES NOT MEAN
       | 'minus', it _ONLY_ means hyphen, and would be _the_ unicode
       | solution if for whatever reason you want to render the notion:
       | Hyphen followed by a _positive_ cash amount".. and.. I will
       | completely mess up the whole point and just treat it as a minus
       | sign after all.
       | 
       | What, pray tell, is the point of having all those semantically
       | different but visually identical things in unicode when folks
       | don't even acknowledge that what they are doing is fundamentally
       | at odds with the very text they are reading?
        
         | HelloNurse wrote:
         | There might be a social angle: the input-shitters are assumed
         | to be right, and the IT peons have to understand user intent
         | and make the system work. If the boss says so, hyphen means
         | minus.
        
       | jay-barronville wrote:
       | I don't think fully relying on the Pd Unicode category is ideal
       | though. For example, I don't think you'd want U+2E17 to be
       | matched too.
       | 
       | I think the best solution would be to match by some specific code
       | points, and then throw an error when a strange code point is
       | encountered.
       | 
       | I think it's a mistake to try to handle every edge case in this
       | particular case.
        
       | red_admiral wrote:
       | A safer way to approach any parsing task is to complain if you
       | see a character you don't expect there. If there is a character
       | in front of the dollar sign that is not whitespace, then
       | something is going on and you need to take a look.
        
       | numpad0 wrote:
       | Wow. That's basically what I've heard of as the the Kangxi
       | radicals problem. From what I could gather from 5-minute search,
       | the mechanism is:
       | 
       | PDFs don't use Unicode or ASCII codepoints, but Glyph ID used by
       | fonts. Therefore all strings are converted to sequences of that
       | Glyph ID. Original Unicode or ASCII texts are dropped, or _can_
       | be linked and embedded for convenience. In many cases, a reverse
       | conversion from ID to Unicode is silently done when text is copy-
       | pasted or extracted from PDF.
       | 
       | That silent automatic reverse conversion tend to pick the
       | numerically smallest Unicode codepoint assigned to the
       | glyph(letter shapes), and many fonts reuses close-enough glyphs
       | for obscure Unicode characters like ancient Chinese dictionary
       | header symbols and dozen Unicode siblings of hyphens. Unicode
       | also tends to have those esoteric symbols higher up in the table
       | than commonly used ones.
       | 
       | Therefore, through conversion into Glyph ID and back into
       | Unicode, some of simple characters like `Jiao ` or `-`, which
       | glyphs tend to get reused to cover those technicalities,
       | sometimes gets converted into those technicalities at remote
       | ends.
       | 
       | 1: https://en.wikipedia.org/wiki/Kangxi_radical
       | 
       | 2: use TL:
       | https://espresso3389.hatenablog.com/entry/20090526/124332747...
       | 
       | 3: use TL: https://github.com/trueroad/tr-NTTtech05
       | 
       | 4: use TL: https://anti-
       | rugby.blogspot.com/2020/08/Computer001.html
        
       | cooolbear wrote:
       | > One product of mine takes reports that come in as a table
       | that's been exported to PDF
       | 
       | Here's the first problem!
       | 
       | I can't believe actual businesses think that a report is anything
       | other than for human eyes to look at. Typesetting (and report
       | generation) is for presentation, and otherwise data should be
       | treated like data.
       | 
       | I mean it's a different story if the product is like "we can help
       | process your historical, improperly-formatted documents", but if
       | it's from someone continually generating reports, somebody really
       | should step in and make things more... computational.
        
       | l72 wrote:
       | I wrote a web scraper to scrape products from some Vinyl Record
       | Distributors. It is amazing to me how careless (or clueless)
       | people are with various unicode characters.
       | 
       | I had huge amounts of rules for "unifying" unicode, so I could
       | then run the result through various regular expressions. It
       | wasn't just hyphens, but I'd run into all sorts of weird
       | characters.
       | 
       | It all worked, but was very brittle, and constantly had to be
       | tweaked.
       | 
       | In the end, I used a machine learning model, which I wrote about
       | here[1]
       | 
       | [1] https://blog.line72.net/2024/07/31/the-joys-of-parsing-
       | using...
        
       | TristanBall wrote:
       | So, I guess it's only me who learned from the comments here that
       | the was a difference between em dash and en dash? Or that they
       | might be different from a hyphen or a minus?
       | 
       | (In my defence, I don't work in any of the specialized areas
       | where it matters, and was raised in a poor, ascii only, western
       | household.)
       | 
       | I will point out that spammers and scammers have been having a
       | field day with this kind of character confusion for years now,
       | and a lot of software still hasn't caught up to it.
       | 
       | On the bright side, the very old school database I babysit for
       | work can be convinced to output utf8, including emoji, many of
       | which render quite well in a terminal, allowing me to create bar
       | graphs of poo or love heart characters, which honestly makes it
       | all worth it for me.
        
       ___________________________________________________________________
       (page generated 2024-09-09 23:02 UTC)