[HN Gopher] Bitten by Unicode
___________________________________________________________________
Bitten by Unicode
Author : pryelluw
Score : 117 points
Date : 2024-09-09 02:38 UTC (20 hours ago)
(HTM) web link (pyatl.dev)
(TXT) w3m dump (pyatl.dev)
| ks2048 wrote:
| That's only the tip of the iceberg of hyphen-looking characters.
|
| Here's some more, 2010 ; 002D ; MA #* ( - - - )
| HYPHEN - HYPHEN-MINUS # 2011 ; 002D ; MA #* ( - - - ) NON-
| BREAKING HYPHEN - HYPHEN-MINUS # 2012 ; 002D ; MA #* ( - -
| - ) FIGURE DASH - HYPHEN-MINUS # 2013 ; 002D ; MA #* ( - -
| - ) EN DASH - HYPHEN-MINUS # FE58 ; 002D ; MA #* ( - - - )
| SMALL EM DASH - HYPHEN-MINUS # 06D4 ; 002D ; MA #* ( . - -
| ) ARABIC FULL STOP - HYPHEN-MINUS # --- 2043 ; 002D ; MA #*
| ( -- - - ) HYPHEN BULLET - HYPHEN-MINUS # --- 02D7 ; 002D ;
| MA #* ( - - - ) MODIFIER LETTER MINUS SIGN - HYPHEN-MINUS #
| 2212 ; 002D ; MA #* ( - - - ) MINUS SIGN - HYPHEN-MINUS #
| 2796 ; 002D ; MA #* ( - - ) HEAVY MINUS SIGN - HYPHEN-MINUS #
| --- 2CBA ; 002D ; MA # ( - - ) COPTIC CAPITAL LETTER
| DIALECT-P NI - HYPHEN-MINUS # ---
|
| copied from
| https://www.unicode.org/Public/security/8.0.0/confusables.tx...
| markus_zhang wrote:
| I think it's a good idea to write a plugin for any IDE to
| highlight those confusing characters.
| samatman wrote:
| VSCode does this out of the box actually. Ended up putting a
| few on a whitelist while writing Julia, where it can get kind
| of ugly (puts a yellow box around them).
| userbinator wrote:
| Using an ASCII-only font automatically shows all characters
| that IMHO should not be present in source code.
| metadat wrote:
| Some platforms, such at python3 have full UTF-8 support
| already, so what is the problem?
| userbinator wrote:
| The one shown very clearly by this article.
| metadat wrote:
| Thanks usrbinator.. _guilty grimace smile_
|
| Maybe highlighting isn't such bad idea :)
| keybored wrote:
| The wrong values are from PDF files. Maybe you mean using
| a system-wide ASCII-only font but you finished your point
| with "should not be present in source code". Source code
| wasn't the problem here.
| foobarchu wrote:
| It very much is a problem in source code too though. It's
| unfortunately common in college courses (particularly
| non-CS courses with programming like bioinformatics) for
| instructors to distribute sample code as word docs. Cue
| students who can't run the code and don't know why
| because Word helpfully converted all double quotes to a
| "prettier" Unicode equivalent.
| keybored wrote:
| Bizarrely I have experienced the same thing from Latex
| with its purpose-made code/literal blocks.
|
| But the most shocking thing are printed learning
| resources on things like Haskell where the code examples
| _on purpose_ are some kind of typographic printout rather
| than just the symbols themselves!
| lifthrasiir wrote:
| String literals frequently have non-ASCII characters to say
| the least.
| powersnail wrote:
| That would make it impossible to edit non-ascii strings,
| like texts in foreign languages. As far as I know, most
| editors/IDE don't support switching fonts for string
| literals. It is more feasible for a syntax highlighter to
| highlight non-ascii characters outside of literals.
| Someone wrote:
| > As far as I know, most editors/IDE don't support
| switching fonts for string literals
|
| When asked to render an Unicode character that isn't
| present in the font modern OSes will automatically pick a
| font that has it.
|
| https://en.wikipedia.org/wiki/Fallback_font: _"A fallback
| font is a reserve typeface containing symbols for as many
| Unicode characters as possible. When a display system
| encounters a character that is not part of the repertoire
| of any of the other available fonts, a symbol from a
| fallback font is used instead. Typically, a fallback font
| will contain symbols representative of the various types
| of Unicode characters."_
|
| That can be avoided, for example by storing text as "one
| character per byte", but I don't think many editors do
| that nowadays.
| powersnail wrote:
| But that would not distinguish between chars inside a
| string literal and chars outside of a string literal.
| keybored wrote:
| For every such Unicode problem (which is a data input^W
| source problem, not a programming source code error) there
| are fifty problems caused by the anemic ASCII character set
| like Unix toothpicks and three layers of escaping due to
| using too uniform delimiters.
|
| (Granted this is heavily biased since so much source code
| is ASCII-only so you don't get many Unicode problems in the
| first place...)
| makeitdouble wrote:
| A note on non-ascii in code: I thought of it as an
| abomination, until hitting test pattern descriptors.
|
| On a project targeted at a non English speaking devs with a
| strong domain knowledge requirement, writing the test
| patterns (endless arrays of input -> expected output
| sequences, interspersed with adjustment code) in the native
| language saves an incredible amount of time and effort, in
| particular as we don't need to translate obscure notions
| into even more obscure English.
|
| And that had very little downsides as it's not production
| running code, lining will still raise anything problematic,
| and and the whole thing is easier to get reviewed by non
| domain experts.
|
| We could have made a translation layer to have the content
| in a spreadsheet and convert it to test code, but that's
| not any more stable than having unicode names straight into
| the code.
| nine_k wrote:
| String constants / symbols is one domain, keywords and
| reserved characters, another. They should be checked for
| different things. E.g. spell-checking string constants as
| plain text if they look as plain text is helpful.
| Checking for non-ASCII quotes / dashes / other
| punctuation outside quoted strings, where they can only
| occur by mistake, is _also_ helpful.
| makeitdouble wrote:
| My comment got mistakenly autocorrected (meant "linting"
| instead of "lining"), which is so on point given the
| subject.
|
| I agree, and think a decent linter can deal with these
| issues, and syntax highlighting as well.
|
| In particular these kind of rules tend to get complicated
| with many exceptions (down to specific folders needing
| dedicated rules), so doing it as lint and not at the
| language level gives a lot of freedom on where and how to
| apply the rules and raise warnings.
| oneeyedpigeon wrote:
| It depends on whether you count html as "source code", but
| if so, then non-ASCII characters absolutely _should_ be
| present!
| PaulHoule wrote:
| It's a very unpopular opinion but I use as much Unicode as
| I can in source code. In comments for instance I can write
| x2
|
| as well as italic and bold characters (would have demoed
| but HN filters out Unicode bold & italics) and I can write
| a test named processesZhong Wen
| Characters()
|
| and also write Java that looks like APL, add sigil
| characters in code generated stubs that will never conflict
| with other people's code because they're too afraid to use
| these characters, etc.
|
| https://github.com/paulhoule/ferocity/blob/main/ferocity-
| std...
|
| People will ask "how do you enter those characters?" and I
| say "I don't know but I can cut and paste them, they get
| offered by the autocomplete, etc."
| Arnt wrote:
| Hardly unpopular where I live. Lots of source code
| contains EUR and much else. Grepping for it in the code I
| worked on last week, I find non-ASCII character in dozens
| of tests, in some scripts that seem to be part of CI, in
| a comment about a locale-specific bug, and I stopped
| looking there.
|
| How to enter them? Well, the keys are on the keyboard.
| PaulHoule wrote:
| If you're in Euro land.
|
| I have a lot of personal interest in Chinese language
| content these days, I have no idea how to set up and use
| an "input method" but I either see the text I want in
| front of me or ask an LLM "How do I write X in Chinese?"
| and either way cut and paste.
| sigseg1v wrote:
| Chinese enter words into a keyboard using the same type
| of keyboard you would use in North America. The
| characters are entered as "pinyin" which is a romanized
| phonetic method of describing Chinese words. You should
| be able to enter it into your keyboard on Windows for
| example by enabling Simplified Chinese / pinyin in the
| language input settings.
| MrJohz wrote:
| I know vscode had this feature built in, and it's come in
| handy a couple of times for me.
| mjevans wrote:
| Also remember to squash 'wide' characters back to the ASCII
| table where possible, if the data is being processed by normal
| tools.
|
| There are honestly so many data-cleaning steps a pipeline could
| need / have to produce programatically well-formatted data.
| toastal wrote:
| And yet all of these serve a different, useful purpose for
| semantics.
| account42 wrote:
| As TFA shows, no they don't. They may have been _intended_
| for different semantics but once humans come into play if it
| looks vaguelly correct then its getting used.
| renhanxue wrote:
| Three Minus Signs for the Mathematicians under the pi,
| 2212 MINUS SIGN 2796 HEAVY MINUS SIGN 02D7 MODIFIER
| LETTER MINUS SIGN
|
| Seven Dashes for the Dash-lords in their quotes as shown,
| 2012 FIGURE DASH 2013 EN DASH 2014 EM DASH
| 2015 QUOTATION DASH 2E3A TWO-EM DASH 2E3B THREE-EM
| DASH FE58 SMALL EM DASH
|
| Nine Hyphens for Word Breakers, one of them ­,
| 00AD SOFT HYPHEN 058A ARMENIAN HYPHEN 1400 CANADIAN
| SYLLABICS HYPHEN 1806 MONGOLIAN TODO SOFT HYPHEN
| 2010 HYPHEN 2011 NON-BREAKING HYPHEN 2E17 DOUBLE
| OBLIQUE HYPHEN 2E40 DOUBLE HYPHEN 30A0 KATAKANA-
| HIRAGANA DOUBLE HYPHEN
|
| One for the Dark Word in the QWERTY zone
|
| In the land of ASCII where Basic Latin lie.
|
| One String to rule them all, One String to find them,
|
| One String to bring them all and in the plain-text, bind them
|
| In the land of ASCII where Basic Latin lie.
| 002D HYPHEN-MINUS
|
| - @FakeUnicode on Twitter, with apologies to J. R. R. Tolkien
| tracker1 wrote:
| Yeah, quotes and magic quotes are another set... Nothing like
| discovering MySQL treats magic quotes as ANSI quotes for
| purposes of SQL (injection)... AddSlahshes wasn't enough.
|
| For what it's worth TFA could still use a regexp, it would just
| be slightly more complex. But the conditional statement may or
| may not be faster or easier to reason with.
| LegionMammal978 wrote:
| Does anyone here know of any actual Unicode-encoded documents
| that consistently use U+2010 HYPHEN for their hyphens? Among
| those documents that do distinguish between dash-like characters,
| the most common usage I've seen is to use U+002D HYPHEN-MINUS for
| hyphens and U+2212 MINUS SIGN for minus signs, alongside the rest
| of U+2013 EN-DASH, U+2014 EM-DASH, etc. U+2010 seems
| conspicuously absent from everything, even when the 'proper'
| character usage is otherwise adhered to.
| mjevans wrote:
| If you ever find any, it might be time to ask if a true General
| AI has been developed. I really doubt most humans bother, and
| LLMs will copy our mistakes.
| LegionMammal978 wrote:
| My point is, there _are_ penty of documents which bother with
| minus signs, en-dashes, and em-dashes, including Wikipedia,
| the Unicode Standard itself, and well-edited online news
| articles. Yet they still don 't bother with U+2010 in
| particular, which makes me question the character's
| usefulness.
| adrian_b wrote:
| U+2010 has the advantage that it is not ambiguous and its
| appearance is predictable. You can never know whether a
| given typeface will display U+002D as a hyphen or as a
| minus or en-dash.
|
| The reason why it is seldom used is that all keyboards by
| default provide only a way to easily type U+002D and the
| other ASCII characters. The input methods may provide some
| combination of keys that allows you to enter a minus or an
| en-dash, but nobody bothers to add an additional key
| combination for U+2010. The U+002D key could be
| reconfigured to output U+2010, but this would annoy the
| programmers who use programming languages where U+002D is
| used for minus.
|
| So there is no way out of this mess. In programming
| languages or spreadsheets U+002D is used for minus, while
| in documents intended for reading, U+002D is used for
| hyphen, and the appropriate Unicode characters are used for
| minus and en-dash.
|
| An exception among programming languages was COBOL.
| Originally it used only a hyphen character, which was used
| to improve the readability of long identifiers. This was
| possible because the arithmetic operations were written
| with words, i.e. SUBTRACT, so there was no need for a minus
| character.
|
| A few years later (1964-12), when the PL/I language was
| developed to replace both FORTRAN and COBOL, they have
| introduced the underscore character, replacing the hyphen
| in long identifiers, where it was used to improve their
| readability like in COBOL, so that the hyphen/minus
| character could be used with the meaning of minus, like in
| FORTRAN. This convention has been inherited by most later
| programming languages, except by most dialects of LISP,
| which typically use a hyphen character in identifiers and
| they do not use a minus character, except for the sign of
| numbers.
| LegionMammal978 wrote:
| > U+2010 has the advantage that it is not ambiguous and
| its appearance is predictable. You can never know whether
| a given typeface will display U+002D as a hyphen or as a
| minus or en-dash.
|
| The thing is, I've never found a single non-monospace
| typeface that displays U+002D as a minus sign or en-dash:
| it seems to be universally rendered shorter than a U+2212
| or U+2013, whenever the latter have their own glyphs in
| the first place. I also did some testing on my system
| some time back, and 99% or more of typefaces treated a
| U+2010 identically to a U+002D. Only one or two displayed
| it a smidgeon shorter than a U+002D.
|
| Hence my original question about whether it really is
| used for that purpose (or any other purpose) in practice.
|
| Meanwhile, you do make a good point regarding programming
| languages. Though it would seem mostly coincidental to me
| that their use cases are almost always 'hyphen' or
| 'minus', as opposed to any of the other meanings of a
| 'typewriter dash'.
| lispm wrote:
| > except by most dialects of LISP, which typically use a
| hyphen character in identifiers and they do not use a
| minus character, except for the sign of numbers.
|
| In Common Lisp, there is one character SP10 for Hyphen
| and Minus: https://www.lispworks.com/documentation/HyperS
| pec/Body/02_ac...
|
| It is used
|
| * in numbers as a sign -> -42
|
| * in a conditional read macro as a minus operator ->
| #-ARM64(error "This is no 64bit ARM platform")
|
| * in functions as a numeric minus operator -> (- 100 2/3)
| or (1- 100)
|
| * as a global variable for the currently evaluated REPL
| expression -> -
|
| * as a hyphen for symbols (incl. identifiers) -> UPDATE-
| INSTANCE-FOR-DIFFERENT-CLASS
| keybored wrote:
| For people/text authors who care, hyphen-minus is already
| hyphen-biased: most hyphen-minuses you encounter from
| average text authors (who don't care) are meant to be
| hyphens. And for people who care it is even more slanted:
|
| - They will either use `--` or `---` as poor man's en/em-
| dash or use the proper symbols
|
| - They might use the proper minus sign but even if they
| don't: post-processing can guess what is meant as "minus"
| in basic contexts (and even for math-heavy contexts:
| hyphens aren't that common)
|
| Furthermore hyphen-minus is rendered as a hyphen already.
| Not as minus or a dash.
|
| It's like a process of elimination: people who care already
| treat non-hyphens sufficiently different such that the
| usage of hyphen-minus is clear: it is just hyphen.
|
| For me these things is mostly about looks and author-
| intent. Dashes look better than poor man's dashes. Hyphen-
| minus looks like a hyphen minus already. And if I use
| hyphen-minus then I mean hyphen.
|
| And for me it is less about using the correct character at
| the expense of possible inter-operation: the hyphen-minus
| is so widespread that I have no idea if 95% of software
| will even cope with using the real HYPHEN Unicode scalar.
| (I very much _doubt_ that!)
|
| The last thing is keyboard usability economics. I use en-
| dash/em-dash a few times per paragraph at most. Hyphens can
| occur several times a sentence. And since I need hyphen-
| minus as well (see previous point about interoperability)
| most keyboard setups will probably need to relegate it to
| some modifier keybind like AltGr-something... and no one
| has the patience for typing such a common symbol with a
| modifier combo.
| lifthrasiir wrote:
| I too haven't seen any natural use of U+2010. But some Unicode
| characters are equally underused, often because they are
| historical or designed for specialized or internal uses. Here
| U+2010 can be thought as a normalized form for U+002D after
| some processing to filter non-hyphens, which justifies its
| inclusion even when the character itself might not be used
| much.
| red_admiral wrote:
| TeX definitely distinguishes between -, -- and --- in text mode
| (hyphen, en dash, em dash); there are packages for language-
| specific quotes and hyphenation rules so there may be something
| out there that does this - ctan/smartmn specifically seems to
| be dealing with this kind of thing. Mind you, TeX also allows
| pretty arbitrary remapping of symbols.
| LegionMammal978 wrote:
| Of course TeX also distinguishes between its dash-like
| characters. But I'm not talking about TeX but about Unicode,
| which is the one with the apparently-unused U+2010 HYPHEN.
| tracker1 wrote:
| It will depend on the source for the input. Odds are every
| variation of minus and hyphen has appeared in every context at
| some point.
|
| From a stylistic perspective, it may have been desired for a
| given appearance even if technically wrong. Just because of a
| given typeface. I say this as someone who was an artist before
| learning software programming.
| chithanh wrote:
| Seems not a good idea to roll your own. What if your software
| encounters U+2212 "MINUS SIGN" next?
|
| Probably best to just transliterate to ASCII using gettext or
| unidecode or similar.
| samatman wrote:
| Still broken, alas. '-', named MINUS SIGN, U+2212, is an Sm:
| Symbol, math. Arguably the one which _should_ be used, meaning
| the risk of actually encountering it, while e, is never 0.
|
| As ks2048 points out, the only thing for it is to collect 'em
| all.
|
| Which is why (shameless plug) I wrote this:
| https://github.com/mnemnion/runeset
| lifthrasiir wrote:
| To be clear, you weren't bitten by Unicode but bitten by bad
| Unicode usages. Which are prevalent enough that any text
| processing pipeline has to be fuzzy enough to recognize them. I
| have seen, for example, many uses of archaic Hangul jamo U
| (U+318D) in place of middle dots (U+00B7) or bullets (U+2022),
| while middle dots and bullets themselves are often confused to
| each other.
| riffraff wrote:
| Why bad? This is the intended use for this character
| lifthrasiir wrote:
| Hyphen is a distinct (but of course very commonly confused)
| character from minus, which Unicode separately encodes
| U+2212. Though it is also possible that an OCR somehow
| produced a hyphen out of nowhere.
| keybored wrote:
| Hyphens in front of numbers is not the intended use of
| hyphen. The PDFs have mangled the symbols.
| rdtsc wrote:
| A bit off-topic but a thing that jumps out is using floats for
| currency. Good for examples and small demos but beware using it
| for anything serious.
| lordmauve wrote:
| The finance industry mostly uses floats for currency, up until
| settlement etc.
|
| "What would I get for this share?" can be answered with a
| float.
|
| "What did I get for selling this share?" should probably be a
| fixed point value.
| dotancohen wrote:
| Floats are fine for speculation. But they should not be used
| to record actual transactions.
|
| I typically use the smaller unit of a currency to store
| transaction amounts. E.g., for a US transaction of $10, I
| would store the integer 1000 because that is 1000 cents.
| zie wrote:
| Or just use decimal numbers instead. Decimal libraries
| abound. Then you can do rounding however your
| jurisdiction/bank/etc does it too.
| zie wrote:
| I would argue it's not even good for demos or examples :)
| userbinator wrote:
| In my current font, that hyphen looks very slightly different
| from the normal ASCII one - it's just a pixel shorter and located
| a pixel lower. If I force the charset to CP1252 then I get aEUR
| which is very obviously not a hyphen.
| mwkaufma wrote:
| "For dollar figures I find a prefixed dollar symbol and convert
| the number following it into a float."
|
| Bloombug red flag!!
| makach wrote:
| *Bitten by regex
| Toxygene wrote:
| Another option would be to detect and/or normalize Unicode input
| using the recommendations from the Unicode consortium.
|
| https://www.unicode.org/reports/tr39/
|
| Here's the relevant bit from the doc:
|
| > For an input string X, define skeleton(X) to be the following
| transformation on the string: Convert X to NFD
| format, as described in [UAX15]. Remove any characters in
| X that have the property Default_Ignorable_Code_Point.
| Concatenate the prototypes for each character in X according to
| the specified data, producing a string of exemplar characters.
| Reapply NFD.
|
| The strings X and Y are defined to be confusable if and only if
| skeleton(X) = skeleton(Y). This is abbreviated as X [?] Y.
|
| This is obviously talking about comparing two string to see if
| they are "confusable" but if you just run the skeleton function
| on a string, you get a "normalize" version of it.
| lexicality wrote:
| Python even has a handy function for this:
| https://docs.python.org/3/library/unicodedata.html#unicodeda...
| jrochkind1 wrote:
| This was my first thought -- I was specifically thinking the
| less typically used [K] "compatibility" normalization forms
| would do it.
|
| But in fact, none of the unicode normalization forms seem to
| convert a `HYPHEN` to a `HYPHEN-MINUS`. Try it, you'll see!
|
| Unicode considers them semantically different characters, and
| not normalized.
|
| The default normalization forms NFC and NFD that are probably
| defaults for a "unicode normalize" function will should always
| result in exactly equivalent glyphs (displayed the same by a
| given font modulo bugs), just expressed differently in unicode.
| Like single code point "Latin Small Letter E with Acute"
| (composed, NFC form); vs two code points "latin small letter e"
| plus "combining acute accent" (decomposed, NFD form). I would
| not expect them to change the hyphen characters here -- and
| they do not.
|
| The "compatibility" normalizations, abbreviated by "K" since
| "C" was already taken for "composed", WILL change glyphs. For
| instance, they will normalize a "Superscript One" `1` or a
| "Circled Digit 1" `1` to an ordinary "Digit 1" (ascii 49).
| (which could also be relevant to this problem, and it's
| important all platforms expose compatibility normalization
| too!) NFKC for compatibility plus composed, or NFKD for
| compatibility plus decomposed. I expected/hoped they would
| change the unicode `HYPHEN` to the ascii `HYPHEN-MINUS` here.
|
| But they don't seem to, the unicode directory decided these
| were not semantically equivalent even at "compatibility" level.
|
| Unfortunately! I was hoping compatibility normalization would
| solve it too! The standard unicode normalization forms will not
| resolve this problem though.
|
| (I forget if there are some _locale-specific_ compatibility
| normalizations? And if so, maybe they would normalize this? I
| think of compat normalization as usually being like "for
| search results should it match" (sure you want `1` to match
| `1`), which can definitely be locale specific)
| wodenokoto wrote:
| If your source is not consistent enough to give you consistent
| hyphen there are probably a lot of other weird things that are
| slipping through the cracks.
| tomcam wrote:
| That's pretty much all real-world datasets
| tracker1 wrote:
| Considering how many real world data sets are based on hand
| crafted spreadsheets, absolutely. Especially with copy pasta.
|
| Edit: pasta above was actually meant to be paste, but gesture
| input is fun. Ironically it's better this way.
| advisedwang wrote:
| Probably true, but unless you are suggesting the author should
| abandon the product/feature, the author needs to achieve the
| best they can given the constraints. Stuff like fixing hypens
| gets closer. There's probably a lot more such things their code
| will end up doing.
| account42 wrote:
| The author should use a validating parser instead of a simple
| regular expression and hoping that the result is correct.
| I.e. the start of the post should have been that the parser
| errored out rather than that the result was positive.
| ReleaseCandidat wrote:
| That's why you don't want to use regexes to parse something, but
| an actual tokenizer which uses Unicode _k_ompatibility
| normalisation (NFKC or NFKD) for comparisons. Although I'm not
| sure if that works with the Dollar emoji (which HN doesn't like
| to display)
| riffraff wrote:
| Regexes can handle Unicode categories just fine, if he'd
| written a tokenizer it would still have failed. Which in fact
| it did when he removed the regex.
| ReleaseCandidat wrote:
| It's not about categories they don't help with such problems,
| but comparison using compatible normalisation (either NFKC or
| NFKD). Using that, e.g. 1 compares equal to I and i (the
| roman number literals) the 1 in a circle and all the other
| Unicode code points which have the same meaning.
| riffraff wrote:
| but that's not about using a tokenizer vs a regex, it's
| about using a normalization step, which would also work
| with the regex.
| ReleaseCandidat wrote:
| Yes, that's true. Except it isn't (well, doesn't have to
| be) an extra step in the tokenizer. Most of the time you
| do not want to run the whole string or part of it through
| the kompatibility normalization but just some code points
| (like the sign of a float). Which could of course be done
| with a matchgroup of a regexp too. I have just made the
| observationthrough the last two decades, that it's easier
| to not forget about such cases when not using regexps.
| kccqzy wrote:
| Run this: >>> unicodedata.category('\N{MINUS
| SIGN}') 'Sm'
|
| There you go. No need to thank me for breaking your code.
|
| Also, nobody has yet commented on the fact that the author is
| also doing PDF text extraction. That's yet another area where a
| lot of fuzziness needs to be applied. My confidence in the
| author's product greatly decreased after reading this post.
| riffraff wrote:
| FWIW, python regex support checking for Unicode properties via
| the \p{SOME NAME} syntax, but as people said there's a lot more
| weird edge cases. Btw looks like the code may also have a couple
| lurking bugs (parsing floats vs decimals, implicit locale number
| formatting ).
|
| I feel all "import data from multiple sources" I've seen in my
| life grew through repeated application of edge case handling.
| bobbylarrybobby wrote:
| I'd be concerned about `value[0]` -- how does that work in the
| face of multi byte characters? Is all string indexing in Python
| O(n)? Does it store whether a given string is ascii-only and
| switch to constant time lookup if it is?
| Sniffnoy wrote:
| Python 3 actually uses UTF-32, so it's all constant-time. A
| tradeoff few make, certainly!
| lifthrasiir wrote:
| Or more accurately, behaves as if it is UTF-32. The actual
| implementation uses multiple internal representations, just
| like JS engines emulating UCS-2.
| Sniffnoy wrote:
| Huh! I was unaware (of both of those), thanks.
| jerf wrote:
| The cost of string indexing isn't relevant for a hard-coded
| zero index. It affects what you might get back but that's O(1)
| regardless of implementation.
| wonnage wrote:
| I feel like the responsible thing to do here is throw an error if
| you encounter an unexpected character. Others have already
| pointed out that there's an actual minus sign character that
| would break this. This code is dealing with like four different
| tricky/unpredictable things (parsing PDFs, parsing strings,
| unicode, money) and the lack of basic exception handling should
| raise alarm bells.
| stouset wrote:
| This highlights a way I constantly see people misuse regex: they
| aren't specific enough. You weren't bitten by Unicode, you were
| bitten by lazy and unprincipled parsing. Explicitly and strictly
| parse _every_ character.
|
| For here, assuming you already have _only_ have the numeric value
| as a token, the regex should look like / ^ -?
| [0-9]+ ( \. [0-9]+ )? $ /x
|
| or something similar. Match the beginning and end of the string
| and everything inbetween: an optional hyphen, any number of
| digits, and an optional decimal conponent. Feel free to adjust
| the details to match your spec, but _any_ unexpected character
| will fail to parse.
| bregma wrote:
| That should be an optional minus sign, not an optional hyphen.
| Also, the radix character is locale-dependent so your should
| use a character class for it.
| nine_k wrote:
| Locale-dependent parsing is a bit more complicated.
|
| For instance, you likely want to accept locale-specific
| numerals, and any of 77 7 7 Qi match the \d character class
| and mean "seven", but you likely don't want to accept a
| string as a valid number if different types of digits are
| mixed together.
|
| Also, 1,23,456.78 is fine in an Indian locale, but likely is
| a typo in the en_US or en_UK locales.
| IsTom wrote:
| > your should use a character class
|
| That depends on locale. Is "1,222" 1222 or 1.222?
| account42 wrote:
| But it definitely should not be the global process locale
| if you are parsing something that doesn't originate from
| the user's environment (and even then using something fixed
| like en_US or the saner en_DK unless a locale is explicitly
| requirested for the invocation makes sense).
| stouset wrote:
| Sure, the details depend on the exact format you're trying to
| parse. But the point is that you should strictly and
| explicitly match every component of the string.
| pino82 wrote:
| Bitten by MS Outlook... ^^
| emmelaich wrote:
| Indeed, very familiar to those who copy code from Outlook,
| Word, and some websites.
|
| Even some manpages had (or still do) have hyphen instead of
| minus for the option character. Argh!
| blueflow wrote:
| Known bug: https://lists.debian.org/debian-
| devel/2023/10/msg00085.html This issue does
| indeed have a history of provoking unhinged lunacy.
| SomewhatLikely wrote:
| Where I thought this might be going from the first paragraph:
|
| Negative numbers are sometimes represented with parentheses:
| (234.58)
|
| Tables sometimes tell you in the description that all numbers in
| are in 1000's or millions.
|
| The dollar sign is used by many currencies, including in
| Australia and Canada.
|
| I'd probably look around for some other gotchas. Here's one page
| on prices in general: https://gist.github.com/rgs/6509585 but
| interestingly doesn't quite cover the OP's problem or the ones I
| brought up, though the use cases are slightly different.
| oneeyedpigeon wrote:
| I was certain that it was going to be a range of numbers that
| didn't use an endash.
| eviks wrote:
| > Inspecting the hyphen. > I pulled in the standard library
| module unicodedata and starting checking things.
|
| Or you could extend your editor to show Unicode character name in
| the status bar and do the inspection in a more immediate way
| wonger_ wrote:
| Or in vim, `ga` when hovered over a character
| bluecalm wrote:
| Reading my code from some years ago I can see I was very
| frustrated by a similar problem when parsing .csv files from some
| financial institutions: # converts idiotic
| number format containing random junk into normal #
| represantion of a number def junkno_to_normal(s):
| (...)
|
| There are so many random characters you can insert into dates or
| numbers. Not only hyphens but also all kind of white spaces or
| invisible characters. It's always a warm feeling when you import
| a text document and not only it's encoded in UTF-8 but you see
| YYYY-MM-DD date format. You know it's going to be safe from
| there. Unfortunately it's still very rare in my experience (even
| UTF-8 bit).
| devit wrote:
| This fix makes no sense:
|
| if is_hyphen(value[0]) and value[1] == "$":
| converted_value = float(re.sub(r"[^.0-9]", "", value)) \* -1
|
| If the strategy is to delete all non-numeric characters in
| re.sub, you should instead replace _all_ characters that could be
| a minus with '-' before doing the float(re.sub(...)) including
| the '-' instead of this bizarre ad-hoc code.
|
| Also "is_hyphen" is wrong since it doesn't handle the Unicode
| minus sign.
| jstanley wrote:
| But if they explicitly wrote a HYPHEN instead of a HYPHEN-MINUS
| or some other type of minus sign, doesn't that suggest it's
| actually not a minus sign and the number shouldn't be negative?
| pornel wrote:
| Unicode is not that semantic. It inherited ASCII (with no
| minus) and a ton of presentational (mis)uses of code points.
|
| It's so messy that Unicode discourages use of APOSTROPHE for
| apostrophes, and instead recommends using RIGHT SINGLE
| QUOTATION MARK for _apostrophes_.
| oneeyedpigeon wrote:
| > Unicode discourages use of APOSTROPHE
|
| Blame fonts that render APOSTROPHE as a disgusting straight
| character.
| account42 wrote:
| Surely you mean a pretty straight and symmetric character,
| the ideal all characters should aspire to.
| pornel wrote:
| Because in ASCII it also plays a role of the left single
| quote, so you get a geometric compromise.
| chatmasta wrote:
| Sure, misinterpreting user intent could cost a lot of money --
| $100 or more, if you're not careful.
| lynx23 wrote:
| And dont forget to check if your .startswith takes a regex,
| because -$ will give you unexpected headaches even without the
| multitude of hyphens.
| langsoul-com wrote:
| If anyone has worked with spreadsheets across Mac, Windows, Linux
| and various online ones. They're also a nightmare.
|
| Some characters are encoded differently based on what system set
| it. So an if statement character comparison runs into the same
| misery as the author has :(
| evOve wrote:
| They should simply drop HYPHEN (U+2010). I don't see the purpose
| of having extra identity
| lifthrasiir wrote:
| U+2010 was a very early addition to Unicode (1.1) and Unicode
| characters are never removed once encoded, even when there are
| glaring errors [1].
|
| [1] https://www.unicode.org/policies/stability_policy.html
| keybored wrote:
| The purpose is to have an unambiguous character for where that
| matters. This has been covered.
| amiga386 wrote:
| What's old is new again. People who use the wrong tools produce
| data in the wrong format.
|
| You used to get people writing web pages in Microsoft Word, a
| tool designed for human prose, and so has "smart quotes" on by
| default, hence they write: <div class="a b c
| d">
|
| which is parsed as: <div class=""a" b="" c=""
| d"="">
|
| because _smart quotes aren 't quotes_. The author used the wrong
| tool for composing text. They should have used a text editor.
|
| I also find that even people in text editors sometimes
| accidentally type some combination that is invisibly wrong, for
| example Option+Space on macOS is a non-breaking space (U+00A0)
| rather than regular space (U+0020) and that's quite easy to type
| accidentally, especially If You're Adding Capitals because shift
| and option are pretty near each other.
|
| Sometimes people also manage to insert carriage returns and/or
| linefeeds in what's supposed to be a single-line input value, so
| regular expressions using "." to match anything don't go beyond
| the first newline unless you turn on the "multiline" flag.
|
| None of this is unicode specifically, it's just the age old
| problem of human ingenuity in providing nonstandard data, and
| whether _you_ do workarounds to fix it, or you make the supplier
| fix it.
| oneeyedpigeon wrote:
| > The author used the wrong tool for composing text. They
| should have used a text editor.
|
| Then you have the opposite problem: most text editors make it
| non-trivial to work with unicode. I mean, I've taken the time
| to learn how to type curly quotation marks vs. straight ones,
| but not everyone has and keyboards don't make it easy.
| pavel_lishin wrote:
| May I ask why you use curly quotation marks instead of the
| straight ascii ones?
| oneeyedpigeon wrote:
| In written text, I think they're far more attractive. If I
| need to put forward some kind of 'objective' argument, then
| differentiating between open and closed seems to make
| logical sense. Check out any printed material: 99.9% of the
| time, it uses curly quotes.
| verandaguy wrote:
| My mental framework has been:
|
| - Curly quotes are a typographic sugar that's easier on the
| human eye when reading normal, human-language text. It's
| reasonable for them to be automatically inserted into your
| typing in something like a word processor, and depending on
| which language you're writing in, there may be strong
| orthographic rules about the use of curly braces (or their
| cognates, like << guillemets >>, etc).
|
| - Straight quotes belong in code by a combination of
| convention and practicality; unicode characters should be
| escaped wherever it's practical to do so (for example, if you
| must use "-" in your code, prefer to do e.g. "\u2192" instead
| -- it's clearer for future users which exact unicode arrow
| you were using there).
| kragen wrote:
| i have this problem a lot with markdown, because i very much do
| want my "" smart quotes in the formatted output, but markdown
| also (optionally) uses "" for link titles. i recently switched
| to using () for link titles, which i had forgotten was an
| option
|
| also i sometimes accidentally replace a " with " or ", or a '
| with a ' or ', inside of `` or an indented code block
| euroderf wrote:
| Smart quotes are the work of the Devil.
| TheRealPomax wrote:
| nit: at the time they should have used an HTML editor. Those
| still existed back then.
| Retr0id wrote:
| Fun fact, ISO 8601 says you _should_ use U+2212 MINUS to express
| timestamps with negative timezone offsets. At least, I think it
| does, I 'm going off the Wikipedia description:
| https://en.wikipedia.org/wiki/ISO_8601#Other_time_offset_spe...
| lifthrasiir wrote:
| In my understanding that is a misunderstanding. I previously
| commented about that [1], but in short: a combined hyphen-minus
| character should be used for any charset based on ISO/IEC 646,
| which includes Unicode.
|
| [1] https://news.ycombinator.com/item?id=37346702
| hgs3 wrote:
| Unicode conforming regular expression engines are supposed to
| support the \p or \P property syntax [1] so you should be able to
| match hyphen characters with \p{Hyphen} or \p{Dash}.
|
| [1] https://www.unicode.org/reports/tr18/#property_syntax
| account42 wrote:
| Very nice for Unicode to provide a solution to the problem
| Unicode created.
| samatman wrote:
| Unicode did not create the problem of many similar-looking
| dash-like characters with different meanings and widths.
|
| It documented it, at most.
| gknoy wrote:
| Thanks for linking this! I also learned that support for `\p{}`
| syntax isn't supported in the Python `re` library, and they
| recommend the api-compatible `regex` library, which does have
| support for that.
| rzwitserloot wrote:
| Isn't "turns out there are _lots_ of look-alikes, often literally
| pixel-for-pixel identical in most fonts at all sizes, in the
| unicode tables and that might cause some confusion when parsing
| text" like.. lesson 101 for unicode?
|
| At any rate, I find the conclusion a bit hilarious: "Ah, the
| input text uses that symbol that very explicitly DOES NOT MEAN
| 'minus', it _ONLY_ means hyphen, and would be _the_ unicode
| solution if for whatever reason you want to render the notion:
| Hyphen followed by a _positive_ cash amount".. and.. I will
| completely mess up the whole point and just treat it as a minus
| sign after all.
|
| What, pray tell, is the point of having all those semantically
| different but visually identical things in unicode when folks
| don't even acknowledge that what they are doing is fundamentally
| at odds with the very text they are reading?
| HelloNurse wrote:
| There might be a social angle: the input-shitters are assumed
| to be right, and the IT peons have to understand user intent
| and make the system work. If the boss says so, hyphen means
| minus.
| jay-barronville wrote:
| I don't think fully relying on the Pd Unicode category is ideal
| though. For example, I don't think you'd want U+2E17 to be
| matched too.
|
| I think the best solution would be to match by some specific code
| points, and then throw an error when a strange code point is
| encountered.
|
| I think it's a mistake to try to handle every edge case in this
| particular case.
| red_admiral wrote:
| A safer way to approach any parsing task is to complain if you
| see a character you don't expect there. If there is a character
| in front of the dollar sign that is not whitespace, then
| something is going on and you need to take a look.
| numpad0 wrote:
| Wow. That's basically what I've heard of as the the Kangxi
| radicals problem. From what I could gather from 5-minute search,
| the mechanism is:
|
| PDFs don't use Unicode or ASCII codepoints, but Glyph ID used by
| fonts. Therefore all strings are converted to sequences of that
| Glyph ID. Original Unicode or ASCII texts are dropped, or _can_
| be linked and embedded for convenience. In many cases, a reverse
| conversion from ID to Unicode is silently done when text is copy-
| pasted or extracted from PDF.
|
| That silent automatic reverse conversion tend to pick the
| numerically smallest Unicode codepoint assigned to the
| glyph(letter shapes), and many fonts reuses close-enough glyphs
| for obscure Unicode characters like ancient Chinese dictionary
| header symbols and dozen Unicode siblings of hyphens. Unicode
| also tends to have those esoteric symbols higher up in the table
| than commonly used ones.
|
| Therefore, through conversion into Glyph ID and back into
| Unicode, some of simple characters like `Jiao ` or `-`, which
| glyphs tend to get reused to cover those technicalities,
| sometimes gets converted into those technicalities at remote
| ends.
|
| 1: https://en.wikipedia.org/wiki/Kangxi_radical
|
| 2: use TL:
| https://espresso3389.hatenablog.com/entry/20090526/124332747...
|
| 3: use TL: https://github.com/trueroad/tr-NTTtech05
|
| 4: use TL: https://anti-
| rugby.blogspot.com/2020/08/Computer001.html
| cooolbear wrote:
| > One product of mine takes reports that come in as a table
| that's been exported to PDF
|
| Here's the first problem!
|
| I can't believe actual businesses think that a report is anything
| other than for human eyes to look at. Typesetting (and report
| generation) is for presentation, and otherwise data should be
| treated like data.
|
| I mean it's a different story if the product is like "we can help
| process your historical, improperly-formatted documents", but if
| it's from someone continually generating reports, somebody really
| should step in and make things more... computational.
| l72 wrote:
| I wrote a web scraper to scrape products from some Vinyl Record
| Distributors. It is amazing to me how careless (or clueless)
| people are with various unicode characters.
|
| I had huge amounts of rules for "unifying" unicode, so I could
| then run the result through various regular expressions. It
| wasn't just hyphens, but I'd run into all sorts of weird
| characters.
|
| It all worked, but was very brittle, and constantly had to be
| tweaked.
|
| In the end, I used a machine learning model, which I wrote about
| here[1]
|
| [1] https://blog.line72.net/2024/07/31/the-joys-of-parsing-
| using...
| TristanBall wrote:
| So, I guess it's only me who learned from the comments here that
| the was a difference between em dash and en dash? Or that they
| might be different from a hyphen or a minus?
|
| (In my defence, I don't work in any of the specialized areas
| where it matters, and was raised in a poor, ascii only, western
| household.)
|
| I will point out that spammers and scammers have been having a
| field day with this kind of character confusion for years now,
| and a lot of software still hasn't caught up to it.
|
| On the bright side, the very old school database I babysit for
| work can be convinced to output utf8, including emoji, many of
| which render quite well in a terminal, allowing me to create bar
| graphs of poo or love heart characters, which honestly makes it
| all worth it for me.
___________________________________________________________________
(page generated 2024-09-09 23:02 UTC)