[HN Gopher] Charset="WTF-8"
___________________________________________________________________
Charset="WTF-8"
Author : edent
Score : 127 points
Date : 2024-11-24 09:38 UTC (13 hours ago)
(HTM) web link (wtf-8.xn--stpie-k0a81a.com)
(TXT) w3m dump (wtf-8.xn--stpie-k0a81a.com)
| jtvjan wrote:
| A coworker once implemented a name validation regex that would
| reject his own name. It still mystifies me how much convincing it
| took to get him to make it less strict.
| croes wrote:
| Is name validation even possible?
| armada651 wrote:
| Yes, it is essential when you want to avoid doing business
| with customers who have invalid names.
| Diti wrote:
| What are "invalid names" in this context? Because,
| depending on the country the person was born in, a name can
| be literally anything, so I'm not sure what an invalid name
| looks like (unless you allow an `eval` of sorts).
| dgoldstein0 wrote:
| Obligatory xkcd https://xkcd.com/327/
| Muromec wrote:
| The non-joke answer for Europe is extened Latin, dashes,
| spaces and apostrophe sign, separated into two (or three)
| distinct ordered fields. Just because it's written in a
| different script originally, doesn't mean it will printed
| only with that on your id in the country of residence or
| travel document issued at home. My name isn't written in
| Latin characters and it's fine. I know you can't even try
| to pronounce them, so I have it spelled out in above
| mentioned Latin script.
| ryandrake wrote:
| You joke, but when a customer wants to give your company
| their money, it is our duty as developers to _make sure
| their names are valid_. That is so business critical!
| xtiansimon wrote:
| In legitimate retail, take the money, has always been the
| motto.
|
| That said, recently I learned about monetary policy in
| North Korea and sanctions on the import of luxury goods.
|
| Why Nations Fail (2012) by Daron Acemoglu and James
| Robinson
|
| https://en.wikipedia.org/wiki/United_Nations_Security_Cou
| nci...
| Muromec wrote:
| It's not just business necrssary, it's also mandatory to
| do rigjt under gdpr
| jandrese wrote:
| What if your customer is the artist formerly known as
| Prince or even X AE A-12 Musk?
| chungy wrote:
| Prince: "Get over yourself and just use your given name."
| (Shockingly, his given name actually is Prince; I first
| thought it was only a stage name)
|
| Musk: Tell Elon to get over his narcissism enough to not
| use his children as his own vanity projects. This isn't
| just an Elon problem, many people treat children as
| vanity projects to fuel their own narcissism. That's not
| what children are for. Give him a proper name. (and then
| proceed to enter "X AE A-12" into your database, it's
| just text...)
| majkinetor wrote:
| Sure it is. Context matters. For example, in clone wars.
| poizan42 wrote:
| Yes, it's easy bool ValidateName(string
| name) => true;
|
| (With the caveat that a name might not be representable in
| Unicode, in which case I dunno. Use an image format?)
| arsome wrote:
| name.Length > 0
|
| is probably pretty safe.
| tomxor wrote:
| What if my name is
| chuckadams wrote:
| Slim Shady?
| pridkett wrote:
| That only works if you're concatenating the first and
| last name fields. Some people have no last name and thus
| would fail this validation if the system had fields for
| first and last name.
| cluckindan wrote:
| _some people have no name at all_
| exitb wrote:
| Any notable examples apart from young children and
| Michael Scott that one time?
| ndsipa_pomu wrote:
| I've been compiling a list of them:
| dvfjsdhgfv wrote:
| You seem to have forgotten quite a few, like
| Macha wrote:
| Honestly I wish we could just abolish first and last name
| fields and replace them with a single free text name
| field since there's so many edge cases where first and
| last is an oversimplification that leads to errors.
| Unfortunately we have to interact with external systems
| that themselves insist on first and last name fields, and
| pushing it to the user to decide which is part of what
| name is wrong less often than string.split, so we're
| forced to become part of the problem.
| caseyohara wrote:
| I did this in the product where I work. We operate
| globally so having separate first and last name fields
| was making less sense. So I merged them into a singular
| full name field.
|
| The first and only people to complain about that change
| were our product marketing team, because now they
| couldn't "personalize" emails like `Hi <firstname>,`. I
| had the hardest time convincing them that while the
| concept of first and last names are common in the west,
| it is not a universal concept.
|
| So as a compromise, we added a "Preferred Name" field
| where users can enter their first name or whatever name
| they prefer to be called. Still better than separate
| first and last name fields.
| poizan42 wrote:
| See point 40 and 32-36 on Falsehoods programmers believe
| about names[1]
|
| [1] https://www.kalzumeus.com/2010/06/17/falsehoods-
| programmers-...
| from-nibly wrote:
| I know that this is trying to be helpful but the snark in
| this list detracts from the problem.
| i80and wrote:
| Whether it's healthy or not, programmers tend to love
| snark, and that snark has kept this list circulating and
| hopefully educating for a long time to this very day
| rsynnott wrote:
| No, but it doesn't stop people trying.
| gmuslera wrote:
| You may not want Bobby Tables in your system.
| malfist wrote:
| If you're prohibiting valid letters to protect your
| database because you didn't parametrize your queries,
| you're solving the problem from the wrong end
| crazygringo wrote:
| If you just use the {Alphabetic} Unicode character class
| (100K code points), together with a space, hyphen, and maybe
| comma, that might get you close. It includes diacritics.
|
| I'm curious if anyone can think of any other non-alphabetic
| characters used in legal names around the world, in other
| scripts?
|
| I wondered about numbers, but the most famous example of that
| has been overturned:
|
| "Originally named X AE A-12, the child (whom they call X) had
| to have his name officially changed to X AE A-Xii in order to
| align with California laws regarding birth certificates."
|
| (Of course I'm not saying you _should_ do this. It is fun to
| wonder though.)
| nicoburns wrote:
| Apostrophe is common in surnames in parts of the world.
| poizan42 wrote:
| You forgot apostrophe as is common in Irish names like
| O'Brien.
| bloak wrote:
| Yes, though O'Brien is O Briain in Irish, according to
| Wikipedia. I think the apostrophe in Irish names was
| added by English speakers, perhaps by analogy with
| "o'clock", perhaps to avoid writing something that would
| look like an initial.
|
| There are also English names of Norman origin that
| contain an apostrophe, though the only example I can
| think of immediately is the fictional d'Urberville.
| gus_massa wrote:
| Comma or apostrophe, like in d'Alembert ?
|
| (And I have 3 in my keyboard, I'm not sure everyone is
| using the same one.)
| ahazred8ta wrote:
| Mrs. Keihanaikukauakahihuliheekahaunaele only had a
| string length problem, but there are people with a
| Hawaiian `okina in their names. U+02BB
| Seb-C wrote:
| > I'm curious if anyone can think of any other non-
| alphabetic characters used in legal names around the world,
| in other scripts?
|
| Latin characters are NOT allowed in official names for
| Japanese citizens. It must be written in Japanese
| characters only.
|
| For foreigners living in Japan it's quite frequent to end
| up in a situation where their official name in Latin does
| not pass the validation rules of many forms online. Issues
| like forbidden characters, or because it's too long since
| Japanese names (family name + first name) are typically
| only 4 characters long.
|
| Also, when you get a visa to Japan, you have to bend and
| disform the pronunciation of your name to make it fit into
| the (limited) Japanese syllabary.
|
| Funnily, they even had to register a whole new unicode
| range at some point, because old administrative documents
| sometimes contains characters that have been deprecated
| more than a century ago.
|
| https://ccjktype.fonts.adobe.com/2016/11/hentaigana.html
| crazygringo wrote:
| Very interesting about Japan!
|
| To be clear, I wasn't thinking about within a specific
| country though.
|
| More like, what is the set of all characters that are
| allowed in legal names across the world?
|
| You know, to eliminate things like emoji, mathematical
| symbols, and so forth.
| Seb-C wrote:
| Ah, I see.
|
| I don't know, but I would bet that the sum of all corner
| cases and exceptions in the world would make it pretty
| hard to confidently eliminate any "obvious" characters.
|
| From a technical standpoint, unicode emojis are probably
| safe to exclude, but on the other hand, some scripts like
| Chinese characters are fundamentally pictograms, which is
| semantically not so different than an emoji.
|
| Maybe after centuries of evolution we will end up with a
| legit universal language based on emojis, and people
| named with it.
| crazygringo wrote:
| Chinese characters are nothing like emoji. They are more
| akin to syllables. There is no semantic similarity to
| emoji at all, even if they were originally derived from
| pictorial representations.
|
| And they belong to the {Alphabetic} Unicode class.
|
| I'm mostly curious if Unicode character classes have
| already done all the hard work.
| GolDDranks wrote:
| What if one's name is not in alphabetic script? Let's say,
| "Ling Mu Liang Tai ".
| crazygringo wrote:
| That's part of {Alphabetic} in Unicode. It validates.
| shash wrote:
| There's this individual's name which involves a clock
| sound: N!xau |=Toma[1]
|
| [1]
| https://en.m.wikipedia.org/wiki/N%25C7%2583xau_%C7%82Toma
| crazygringo wrote:
| Click characters are part of {Alphabetic}!
|
| https://en.wikipedia.org/wiki/Click_consonant
|
| https://www.compart.com/en/unicode/category/Lo
|
| https://stackoverflow.com/a/4843363
| kens wrote:
| > There's this individual's name which involves a clock
| sound: N!xau |=Toma
|
| I was extremely puzzled until I realized you meant a
| click sound, not a clock sound. Adding to my confusion,
| the vintage IBM 1401 computer uses |= as a record mark
| character.
| golergka wrote:
| dvyd Smith (concatenated) will have an LTR control
| character in the middle
| crazygringo wrote:
| Oh that's interesting.
|
| Is that a thing? I've never known of anyone whose legal
| name used two alphabets that didn't have any overlap in
| letters at all -- two completely different scripts.
|
| Would a birth certificate allow that? Wouldn't you be
| expected to transliterate one of them?
| zarzavat wrote:
| Presumably there aren't any people with control characters in
| their name, for example.
| kijin wrote:
| Challenge accepted, I'll try to put a backspace and a null
| byte in my firstborn's name. Hope I don't get swatted for
| crashing the government servers.
| cobbzilla wrote:
| Watch as someone names themselves the bell character, "^G"
| (ASCII code 7) [1]
|
| When they meet people, they tell them their name is
| unpronounceable, it's the sound of a PC speaker from the
| late 20th century, but you can call them by their preferred
| nickname "beep".
|
| In paper and online forms they are probably forced to go by
| the name "BEL".
|
| [1] https://en.wikipedia.org/wiki/Bell_character
| emmelaich wrote:
| Or Derek <wood dropping on desk>
|
| https://www.youtube.com/watch?v=hNoS2BU6bbQ
| pavel_lishin wrote:
| I thought this was going to be a link to the Key & Peele
| sketch: https://youtu.be/gODZzSOelss?t=180
| eyelidlessness wrote:
| That sounds like a reasonable assumption, but probably not
| strictly correct.
| ValentinA23 wrote:
| khun smchaay
|
| This name, "khunsmchaay" (Khun Somchai, a common Thai
| name), appears normal but has a Zero Width Space (U+200B)
| between "khun" (Khun, a title like Mr./Ms.) and "smchaay"
| (Somchai, a given name).
|
| In scripts like Thai, Chinese, and Arabic, where words are
| written without spaces, invisible characters can be
| inserted to signal word boundaries or provide a hint to
| text processing systems.
| pwdisswordfishz wrote:
| But C0 and C1 control codes are out, probably.
| pwdisswordfishz wrote:
| Or unpaired surrogates. Or unassigned code points. Or
| fullwidth characters. Or "mathematical bold" characters.
| Though the latter two should be probably solved with NFKC
| normalization instead.
| baruchel wrote:
| Mandatory reference: https://xkcd.com/327/
| nkrisc wrote:
| It is if you first provide a complete specification of a
| "name". Then you can validate if a name is compliant with
| your specification.
| GrantMoyer wrote:
| Valid names are those which terminate when run as Python
| programs.
| Muromec wrote:
| It's super easy actually. Name consists of three parts --
| Family Name, Given Name and Patronymic, spelled using
| Ukrainian Cyrillic. You can have a dash in the Family name
| and apostrophe is part of Cyrillic for this purposes, but
| no spaces in any of the three. If are unfortunate enough to
| not use Cyrillic (of our variety) or Patronymics in the
| country of your origin (why didn't you stay there, anyway),
| we will fix it for you, mister Nkrisk. If you belong to
| certain ethnic groups who by their custom insist on not
| using Patronymics, you can have a free pass, but life will
| be difficult, as not everybody got the memo really. No, you
| can not use Matronimyc instead of Patronymic, but give us
| another 30 years of not having a nuclear war with country
| name starting with "R" and ending in "full of putin slaves
| si iiia" and we might see to that.
|
| Unless of course the name is not used for official
| purposes, in which case you can get away with First-Last
| combination.
|
| It's really a non issue and the answer is jurisdiction
| bound. In most of Europe extented Latin set is used in
| place of Cyrillic (because they don't know better), so my
| name is transliterated for the purposes of being in the
| uncivilized realms by my own government. No, I can't just
| use L and Ia as part of my name anywhere here.
| ValentinA23 wrote:
| Don't validate names, use transliteration to make them safe
| for postal services (or whatever). In SQL this is COLLATE, in
| the command line you can use uconv:
|
| >echo "'Lodz'" | uconv -f "UTF-8" -t "UTF-8" -x "Latin-ASCII"
|
| >'Lodz'
| notanote wrote:
| The name of the city has the L with stroke (pronounced as a
| W), so it's Lodz.
| poincaredisk wrote:
| And the transliteration in this case is so far from the
| original that it's barely recognisable for me (three out
| of four characters are different and as a native I
| perceive L as a fully separate character, not as a funny
| variation of L)
| Muromec wrote:
| The fact that it's pronounced as Vuch and not Lodzh still
| triggers me.
| pavel_lishin wrote:
| I just looked up the Russian wikipedia entry for it, and
| it's spelled "Lodz'", but it sounds like it's pronounced
| "Vudzh'", and this fact irritates the hell out of me.
|
| Why would it be transliterated with an L? And an O? And a
| z? None of this makes sense.
| Muromec wrote:
| It's a general pattern of what russia does to names of
| places and people, which is aggressively imposing their
| own cultural paradigm (which follows the more general
| general pattern). You can look up your civil code
| provisions around names and ask a question or two of what
| historical problem they attempt to solve.
| notanote wrote:
| L with stroke is the english name for it according to
| wikipedia by the way, not my choice of naming. The
| transliterated version is not great, considering how far
| removed from the proper pronunciation it is, but I'm sort
| of used to it. The almost correct one above was jarring
| enough that I wanted to point it out.
| poincaredisk wrote:
| If I ever make my own customer facing product with
| registration, I'm rejecting names with 'v', 'x' and 'q'.
| After all, these characters don't exist in my language, and
| foreign people can always transliterate them to 'w', 'ks'
| or 'ku' if they have names with weird characters.
| ajsnigrutin wrote:
| Yeah, that'll work great..
|
| https://en.wikipedia.org/wiki/%C4%8Celje
|
| echo "Celje" | uconv -f "UTF-8" -t "UTF-8" -x "Latin-ASCII"
|
| > "Celje"
|
| https://en.wikipedia.org/wiki/Celje
|
| (i mean... we do have postal numbers just for problems like
| this, but both Stefan and Stefan are not-so-uncommon male
| names over here, so are Jozef and Jozef, etc.)
| Muromec wrote:
| Most places where telling Stefan from Stefan is a problem
| use postal numbers for people too, or/and ask for your
| DOB.
| ajsnigrutin wrote:
| I don't have a problem from differentiatin Stefan from
| Stefan, 's' and 's' sound pretty different to everyone
| around here. But if someone runs that script above and
| transliterates "s" to "s" it can cause confusion.
|
| And no, we don't use "postal numbers for humans".
| perching_aix wrote:
| In certain cultures yes. Where I live, you can only select
| from a central, though frequently updated, list of names when
| naming your child. So theoretically only (given) names that
| are on that list can occur.
|
| Family names are not part of this, but maybe that exists too
| elsewhere. I don't know how people whose name has been given
| to them before this list was established is handled however.
|
| An alternative method, which is again culture dependent, is
| to use virtual governmental IDs for this purpose. Whether
| this is viable in practice I don't know, never implemented
| such a thing. But just on the surface, should be.
| bjackman wrote:
| I still don't see how any system in the real world can
| safely assume its users only have names from that list.
|
| Even if you try to imagine a system for a hospital to
| register newly born babies... What happens if a pregnant
| tourist is visiting?
| perching_aix wrote:
| With plenty of attitude of course :)
|
| I've only ever interacted with freeform textfields when
| inputting my name, so most regular systems clearly don't
| dare to attempt this.
|
| But if somebody was dead set on only serving local
| customers or having only local personnel, I can
| definitely imagine someone being brave(?) enough.
| Y_Y wrote:
| For example in Iceland you don't have to name the baby
| immediately, and the registration times are different for
| foreign
| parents.https://www.skra.is/english/people/registration-
| of-children/...
|
| Of course then you may fall foul of classic falsehood 40:
| People have names.
| throw310822 wrote:
| I know multiple developers who would just say "well it's their
| fault, they have to change name then".
| MrJohz wrote:
| I worked with an office of Germans who insisted that ASCII
| was sufficient. The German language uses letters that cannot
| be represented in ASCII.
|
| In fairness, they mostly wanted stuff to be in English, and
| when necessary, to transliterate German characters into their
| English counterparts (in German there is a standardised way
| of doing this), so I can understand why they didn't see it
| was necessary. I just never understood why I, as the non-
| German, was forever the one trying to convince them that
| Germans would probably prefer to use their software in
| German...
| sandreas wrote:
| You should have asked how they would encode the german
| currency sign (EUR for euro) in ASCII or its german
| counterpart latin1/iso-8859-1...
|
| It's not possible. However I bet they would argument to use
| iso-8859-15 (latin9 / latin0) with the international
| currency sign ($?) instead or insist that char 128 of
| latin1 is almost always meant as EUR, so just ignore the
| standard in these cases and use a new font.
|
| This would only fail in older printers and who is still
| printing stuff these days? Nobody right?
|
| Using real utf-8 is just too complex... All these emojis
| are nuts
| richardwhiuk wrote:
| EUR is the common answer.
| asddubs wrote:
| or just double all the numbers and use DM
| Y_Y wrote:
| Weirdly the old Deutsch Mark doesn't seem to have its own
| code point in the block start U+20A0, whereas the Spanish
| equivalent (Peseta, Pts, not just Pt) does.
| bee_rider wrote:
| I've run into a similar-ish situation working with East-
| Asian students and East-Asian faculty. Me, an American who
| wants to be clear and make policies easy for everybody to
| understand: worried about name ordering a bit (Do we want
| to ask for their last name or their family name in this
| field, what's the stupid learning management system want,
| etc etc). Chinese co-worker: we can just ask them for their
| last names, everybody knows what Americans mean when they
| ask for that, and all the students are used to dealing with
| this.
|
| Hah, fair enough. I think it was an abstract question to
| me, so I was looking for the technically correct answer.
| Practical question for him, so he gave the practical
| answer.
| poizan42 wrote:
| I have an 'ae' in my middle name (formally secondary first name
| because history reasons). Usually I just don't use it, but it's
| always funny when a payment form instructs me to write my full
| name exactly as written on my credit card, and then goes on to
| tell me my name is invalid.
| pzduniak wrote:
| I live in Lodz.
|
| Love receiving packages addressed to ??d? :)
| troymc wrote:
| I wonder how many of those packages end up in Vada, Italy. Or
| Cody, Wyoming. Or Buda, Texas...
| jplrssn wrote:
| I imagine the "Poland" part of the address would narrow it
| down somewhat.
| mkotowski wrote:
| I got curious if I can get data to answer that, and it
| seems so.
|
| Based on xlsx from [0], we got the following ??d?
| localities in Poland:
|
| 1 x Bady, 1 x Brda, 5 x Buda, 120 x Budy, 4 x Dudy, 1 x
| Dydy, 1 x Gady, 1 x Judy, 1 x Kady, 1 x Kadz, 1 x Lada, 1
| x Lady, 4 x Lady, 2 x Lady, 1 x Leda, 1 x Lody, 4 x Lodz,
| 1 x Nida, 1 x Reda, 1 x Redy, 1 x Redz, 74 x Ruda, 8 x
| Rudy, 12 x Sady, 2 x Zady, 2 x Zydy
|
| Certainly quite a lot to search for a lost package.
|
| [0]: https://dane.gov.pl/pl/dataset/188,wykaz-urzedowych-
| nazw-mie...
| jplrssn wrote:
| Interesting! However, assuming that ASCII characters are
| always rendered correctly and never as "?", it seems like
| the only solution for "??d?" would be one of the four
| Lodzs?
| schubart wrote:
| Sounds like someone is getting ready for Advent of Code!
| poincaredisk wrote:
| Interestingly, Lady, Lady and Lady will end up the same
| after the usual transliteration.
| yreg wrote:
| Experienced postal workers most probably know well that
| ??d? represents a municipality with three non-ascii
| characters.
| ygra wrote:
| And the postal code.
| jowea wrote:
| And the packages get there? Don't you put "Lodz (Lodz)" in
| the city field? Or the postal code takes care of the issue?
| pzduniak wrote:
| Yep, postal code does all the work.
| epcoa wrote:
| As you may be aware, the name field for credit card
| transactions is rarely verified (perhaps limited to North
| America, not sure).
|
| Often I'll create a virtual credit card number and use a fake
| name, and virtually never have had a transaction declined. Even
| if they are more aggressively asking for a street address,
| giving just the house number often works. This isn't a deep
| cover but gives a little bit of a anonymity for marketing.
| seba_dos1 wrote:
| It's for when things go wrong. Same as with wire transfers.
| Nobody checks it unless there's a dispute.
| epcoa wrote:
| The thing is though that payment networks do in fact do
| instant verification and it is interesting what gets
| verified and when. At gas stations it is very common to ask
| for a zip code (again US), and this is verified immediately
| to allow the transaction to proceed. I've found that when a
| street address is asked for there is some verification and
| often a match on the house number is sufficient. Zip codes
| are verified almost always, names pretty much never. This
| likely has something to do with complexities behind
| "authorized users".
| cruffle_duffle wrote:
| There is so many ways to write your address I always
| assume it it's just the house number as well. In fact I
| vaguely remember that being a specific field when
| interacting with some old payment gateway.
| jjmarr wrote:
| At American gas stations, if you have a Canadian credit
| card, you type in 00000 because Canadians don't have ZIP
| codes.
| poizan42 wrote:
| Are we sure they don't actually validate against a more
| generic postal code field? Then again some countries have
| letters in their postcodes (the UK comes to mind), so
| that might be a problem anyways.
| blahedo wrote:
| Funny thing about house numbers: they have their _own_
| validation problems. For a while I lived in a building
| whose house number was of the form 1231/2 and that was an
| ongoing source of problems. If it just truncated the 1/2
| that was basically fine (the house at 123 didn 't have
| apartment numbers and the postal workers would deliver it
| correctly) but validating in online forms (twenty-ish
| years ago) was a challenge. If they ran any validation at
| all they'd reject the 1/2, but it was a crapshoot whether
| which of "123-1/2" or "123 1/2" would work, or sometimes
| neither one. The USPS's official recommendation at the
| time was to enter it as "123 1 2 N Streetname" which
| _usually_ validated but looked so odd it was my last
| choice (and some validators rejected the "three numbers"
| format too).
|
| I don't think I ever tried "123.5", actually.
| mkotowski wrote:
| Still much better when it fails at the first step. I once got
| myself in a bit of a struggle with Windows 10 by using "l" as
| part of Windows username. Amusingly/irritatingly large number
| of applications, even some of Microsoft's own ones, could not
| cope with that.
| ahazred8ta wrote:
| The government of Ireland has many IT systems that cannot
| handle accented letters. #headdesk
| arp242 wrote:
| I worked for an Irish company that didn't support ' in names.
| Did get fixed eventually, but sigh...
| lxgr wrote:
| Did you actually get banks to print that on your credit card?
|
| I'm impressed, most I know struggle with any kind of non-[A-Z]!
| Muromec wrote:
| "Write your name the way it's spelled in your government issued
| id" is my favorite. I have three ids issued by two governments
| and no two match letter by letter.
| card_zero wrote:
| Pfft, "Dein Name ist ungultig" (your name is invalid). Let's get
| straight to the point, it's the user's fault for having a bad
| name, user needs to fix this.
| cabirum wrote:
| How do I allow "stepien" while detecting Zalgo-isms?
| egypturnash wrote:
| Zalgo is largely the result of abusing combining modifiers.
| Declare that any string with more than _n_ combining modifiers
| in a row is invalid.
|
| n=1 is probably a reasonable falsehood to believe about names
| until someone points out that language X regularly has multiple
| combining modifiers in a row, at which point you can bump up N
| to somewhere around the maximum number of combining modifiers
| language X is likely to have, add a special case to say "this
| is probably language X so we don't look for Zalgos", or just
| give up and put some Zalgo in your test corpus, start looking
| for places where it breaks things, and fix whatever breaks in a
| way that isn't funny.
| zvr wrote:
| I can point out that Greek needs n=2: for accent and
| breathing.
| ahazred8ta wrote:
| N=2 is common in Viet Nam. (vowel sound + tonal pitch)
| anttihaapala wrote:
| Yet Vietnamese can be written in Unicode without any
| combining characters whatsoever - in NFC normalization each
| character is one code point - just like the U+1EC7 LATIN
| SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW in your
| example.
| zootboy wrote:
| For the unaware (including myself):
| https://en.wikipedia.org/wiki/Zalgo_text
|
| If you really think you need to programmatically detect and
| reject these (I'm dubious), there is probably a reasonable
| limit on the number of diacritics per character.
|
| https://stackoverflow.com/a/11983435
| KPGv2 wrote:
| I could answer your question better if I knew why you need to
| detect Zalgo-isms.
| seba_dos1 wrote:
| There's nothing special about "Stepien", it has no combining
| characters, just the usual diacritics that have their own
| codepoints in Basic Multilingual Plane (U+0119 and U+0144). I
| bet there are some names out there that would make it harder,
| but this isn't one.
| dpassens wrote:
| Why do you need to detect Zalgo-isms and why is it so important
| that you want to force people to misspell their names?
| tobyhinloopen wrote:
| We have a whitelist of allowed characters, which is a pretty
| big list.
|
| I think we based it on Lodash' deburr source code. If deburr's
| output is a-z and some common symbols, it passes (and we store
| the original value)
|
| https://www.geeksforgeeks.org/lodash-_-deburr-method/
| Diggsey wrote:
| I thought this was https://simonsapin.github.io/wtf-8/
| webstrand wrote:
| Yeah, this is just issues caused by ascii
| RadiozRadioz wrote:
| I've got a good feel now for which forms will accept my name and
| which won't, though mostly I default to an ASCII version for
| safety. Similarly, I've found a way to mangle my address to fit a
| US house/state/city/zip format.
|
| I don't feel unwelcome, I emphathize with the developers. I'd
| certainly hate to figure out address entry for all countries. At
| least the US format is consistent across websites and I can have
| a high degree of confidence that it'll work in the software, and
| my local postal service know what to do because they see it all
| the time.
| Arch485 wrote:
| You can grab JSON data of all ISO recognized countries and
| their address formats on GitHub (apologies, I forget the repo
| name. IIRC there is more than one).
|
| I don't know if it's 100% accurate, but it's not very hard to
| implement it as part of an address entry form. I think the main
| issue is that most developers don't know it exists,
| saurik wrote:
| At the end of the day, a postal address is printed to an
| envelope or package as a single block of text and then read
| back and parsed somehow by the people delivering the package
| (usually by a machine most of the way, but even these days more
| by humans as the package gets closer to the destination). This
| means that, in a very real sense, the "correct" way to enter an
| address is into a single giant multi-line text box with the
| implication that the user must provide whatever is required to
| be printed onto the mailing label such that the package will
| successfully be delivered.
|
| Really, then, the reasons why we bother trying to break out an
| address into multiple parts is not really related to the need
| for an address at all: it is because we 1) might not trust the
| user to provide for us everything required to make the address
| valid (assuming the country or even state, or giving us only a
| street address with no city or postal code... both mistakes
| that are likely extremely common without a multi-field form),
| or 2) need to know some subset of the address ourselves and do
| not trust ourselves to parse back the fuzzy address the same
| way as the postal service might, either for taxes or to help
| establish shipping rates.
|
| FWIW, I'd venture to say that #2 is sufficiently common -- as
| if you are even needing a street address for shipping you are
| going to need to be careful about sales taxes and VAT,
| increasingly often even if you aren't located in the state or
| even country to which the shipment will be made -- that it
| almost becomes nonsensical to support accepting an address for
| a location where you aren't already sure of the format
| convention ahead of time (as that just leads you to only later
| realizing you failed to collect a tax, will be charged a
| fortune to ship there, or even that it simply isn't possible to
| deliver anything to that country)... and like, if you don't
| intend to ship anything, you actually do not need the full
| address anyway (credit cards, as an obvious example, don't need
| or use the full address).
| KPGv2 wrote:
| It seems ridiculous to apply form validation to a name, given the
| complexity of charsets involved. I don't even validate email
| addresses. I remember
| [this](https://www.netmeister.org/blog/email.html) wonderful
| explainer of why your email validation regex is wrong.
| imrejonk wrote:
| A system not supporting non-latin characters in personal names is
| pitiful, but a system telling the user that they have an invalid
| name is outright insulting.
| notanote wrote:
| That's the best one of the lot. "Dein Name ist ungultig", "Your
| name is invalid", written with the informal word for "your".
| rossdavidh wrote:
| They're trying to say that you and the server are very close
| friends, you see? No, no, I get this is not correct, just a
| joke...
| ginko wrote:
| Under GDPR you have the legal right for your name to be stored
| and processed with the correct spelling in the EU.
|
| https://gdprhub.eu/index.php?title=Court_of_Appeal_of_Brusse...
| xigoi wrote:
| This seems to only apply to banks.
| pornel wrote:
| I wouldn't be surprised if that created kafkaesque problems
| with other institutions that require name to match the bank
| account _exactly_ , and break/reject non-ASCII at the same
| time.
| robin_reala wrote:
| I know an Asa who became variously Asa, Aasa and Asa after
| moving to a non-Scandinavian country. That took a while to
| untangle, and caused some of the problems you describe.
| postepowanieadm wrote:
| No, anywhere where your name is used.
| robin_reala wrote:
| It's a general right to have incorrect personal data relating
| to you rectified by the data processor.
| Etheryte wrote:
| This does not only apply to banks. The specific court case
| was brought against a bank, but the law as is applies to any
| and everyone who processes your personal data.
| stop_nazi wrote:
| grzegorz brzeczyszczykiewicz
| dvh wrote:
| Looks ok in my language: Gregor Bzenciscikievic
| postepowanieadm wrote:
| You miss "e"!
| dvh wrote:
| I don't think I did. I watched the video and this is the
| phonetic transcription. I hear b zh e n ch ...
| Hackbraten wrote:
| Situations like these regularly make me feel ashamed about being
| a software developer.
| jccalhoun wrote:
| My first name is hyphenated. I still find forms that reject it.
| My favorite was one that say "invalid first name."
| Pesthuf wrote:
| I totally get that companies are probably more successful using
| simple validation rules, that work for the vast majority of names
| rather than just accepting everything just so that some person
| with no name or someone whose name cannot possibly be expressed
| or at least transliterated to Unicode can use their services.
|
| But that person's name has no business failing validation. They
| fucked up.
| surfingdino wrote:
| I lost count of the projects where this was an issue. US and
| Western European-born devs are oblivious to this problem and it
| ends up catching them over and over again.
| ACS_Solver wrote:
| Yeah, it's amazing. My language has a Latin-based alphabet but
| can't be represented with ISO 8859-1 (aka the Latin-1 charset)
| so I used to take it for granted that most software will not
| support inputs in the language... 25 years ago. But Windows XP
| shipped with a good selection of input methods and used UTF-16,
| dramatically improving things, so it's amazing to still see new
| software created where this is somehow a problem.
|
| Except that now there's no good excuse. Things like the name in
| the linked article would just work out of the box if it weren't
| for developers actually taking the time to break them by
| implementing unnecessary and incorrect validation.
|
| I can think of very few situations, where validation of names
| is actually warranted. One that comes to mind is when you need
| people's ICAO 9303 compliant names, such as on passports or
| airline systems. If you need to make sure you're getting the
| name the person has in their passport's MRZ, then yes,
| rejecting non-ASCII characters is correct, but most systems
| don't need to do that.
| xyst wrote:
| Software has been gaslighting generations of people around the
| world.
|
| Side note: not a bad way to skirt surveillance though.
|
| A name like "stepien" will without a doubt have many ambiguous
| spellings across different intelligence gathering systems
| (RUMINT, OSINT, ...). Americans will probably spell it as
| "Stefen" or "Steven" or "Stephen", especially once communicated
| over phone.
| ljouhet wrote:
| Yes, all these forms should handle existing names...
|
| but the author's own website doesn't (url: xn--stpie-k0a81a.com,
| bottom of the page: "(c) 2024 e n. All rights reserved.")
| Etheryte wrote:
| I think the bottom of the page is you missing the joke. It's
| showing only the name letters that get rejected everywhere
| else. Similarly for the URL, the URL renders his name correctly
| when you browse to it in a modern browser. What you've copied
| is the canonical fallback for unicode.
| powersnail wrote:
| As someone who really think name field should just be one field
| with any printable unicode characters, I do wonder what the hell
| would I need to do if I take customer names in this form, and
| then my system has to interact with some other service that
| requires first/last name split, and/or [a-zA-Z] validation, like
| a bank or postal service.
|
| Automatic transliteration seems to be very dangerous (wrong name
| on bank accounts, for instance), and not always feasible (some
| unicode characters have more than one way of being
| transliterated).
|
| Should we apologize to the user, and just ask the user twice,
| once correctly, and once for the bad computer systems? This seems
| to be the only approach that both respects their spelling, and at
| the same time not creating potential conflict with other systems.
| Muromec wrote:
| Okay, I have a non-ASCII (non Latin even) name, so I can tell.
| You just ask explicitly how my name is spelled in a bank system
| or my government id. Please don't try transliteration, unless
| you know exact rules the _other_ system suggests to
| transliterate my name from the one cultural context into
| another and then still make it a suggestion and make it clear
| for which purpose it will be used (and then only use it for
| that purpose).
|
| And please please please, don't try to be smart and detect the
| cultural context from the character set before automatically
| translating it to another character set. It will go wrong and
| you will not notice for a long time, but people will make mean
| passive aggressive screenshots of your product too.
|
| My bank for example knows my legal name in Cyrillic, but will
| not print it on a card, so they make best-effort attempt to
| transliterate it to ASCII, but make it editable field and will
| ask me to confirm this is how I want it to be on a card.
| matthewbauer wrote:
| You can just show the user the transliteration & have them
| confirm it makes sense. Always store the original version since
| you can't reverse the process. But you can compare the
| transliterated version to make sure it matches.
|
| Debit cards a pretty common example of this. I believe you can
| only have ASCII in the cardholder name field.
| Muromec wrote:
| >But you can compare the transliterated version to make sure
| it matches
|
| No you can't.
|
| Add: Okay, you need to know why. I'm right here a living
| breathing person with a government id that has the same name
| scribed in two scripts side by side.
|
| There is an algorithm (blessed by the same government that
| issued said it) which defines how to transliterate names from
| one to another, published on the parliament web site and
| implement in all the places that are involved in the id
| issuing business.
|
| The algorithm will however not produce the outcome you will
| see on my id, because me, living breathing person who has a
| name asked nicely to spell it the way I like. The next time I
| visit the id issuing place, I could forget to ask nicely and
| then I will have two valid ids (no, the old one will not be
| marked as void!) with three names that don't exactly match.
| It's all perfectly fine, because name as a legal concept is
| defined in the character set you probably can't read anyway.
|
| Please, don't try be smart with names.
| wruza wrote:
| I'll say it again: this is the consequence of Unicode trying to
| be a mix of html and docx, instead of a charset. It went too far
| for an average Joe DevGuy to understand how to deal with it, so
| he just selects a subset he can handle and bans everything else.
| HN does that too - special symbols simply get removed.
|
| Unicode screwed itself up completely. We wanted a common charset
| for things like latin, extlatin, cjk, cyrillic, hebrew, etc. And
| we got it, for a while. Shortly after it focused on becoming a
| complex file format with colorful icons and invisible symbols,
| which is not manageable without cutting out all that bs by force.
| Muromec wrote:
| >so he just selects a subset he can handle and bans everything
| else.
|
| Yes? And the problem is?
| throwaway290 wrote:
| The next guy with a different subset? :)
| Muromec wrote:
| The subset is mostly defined by the jurisdiction you
| operate in, which usually defines a process to map names
| from one subset to another and is also in the business of
| keeping the log of said operation. The problem is not
| operating in a subset, but defining it wrong and not being
| aware there are multiple of those.
|
| If different parts of your system operate in different
| jurisdictions (or interface which other systems that do),
| you have to pick multiple subsets and ask user to provide
| input for each of them.
|
| You just can't put anything other than ASCII into either
| payment card or PNR and the rules of minimal length will
| differ for the two and you can't put ASCII into the
| government database which explicitly rejects all of ASCII
| letters.
| n2d4 wrote:
| > and invisible symbols
|
| Invisible symbols were in Unicode before Unicode was even a
| thing (ASCII already has a few). I also don't think emojis are
| the reason why devs add checks like in the OP, it's much more
| likely that they just don't want to deal with character
| encoding hell.
|
| As much as devs like to hate on emojis, they're widely adopted
| in the real world. Emojis are the closest thing we have to a
| universal language. Having them in the character encoding
| standard ensures that they are _really_ universal, and
| supported by every platform; a loss for everyone who 's trying
| to count the number of glyphs in a string, but a win for
| everyone else.
| meew0 wrote:
| The "invisible symbols" are necessary to correctly represent
| human language. For instance, one of the most infamous Unicode
| control characters -- the right-to-left override -- is required
| to correctly encode mixed Latin and Hebrew text [1], which are
| both scripts that you mentioned. Besides, ASCII has control
| characters as well.
|
| The "colorful icons" are not part of Unicode. Emoji are just
| characters like any other. There is a convention that
| applications should display them as little coloured images, but
| this convention has evolved on its own.
|
| If you say that Unicode is too expansive, you would have to
| make a decision to exclude certain types of human communication
| from being encodable. In my opinion, including everything
| without discrimination is much preferable here.
|
| [1]: https://en.wikipedia.org/wiki/Right-to-
| left_mark#Example_of_...
| n2d4 wrote:
| Granted, technically speaking emojis are not part of the
| "Unicode Standard", but they are standardized by the Unicode
| Consortium and constitute "Unicode Technical Standard #51":
| https://www.unicode.org/reports/tr51/
| Y_Y wrote:
| I'm happy to discriminate against those damn ancient
| Sumerians and anyone still using goddamn Linear B.
| bawolff wrote:
| > one of the most infamous Unicode control characters -- the
| right-to-left override
|
| You are linking to an RLM not an RLO. Those are different
| characters. RLO is generally not needed and more special
| purpose. RLM causes much less problems than RLO.
|
| Really though, i feel like the newer "first strong isolate"
| character is much better designed and easier to understand
| then most of the other rtl characters.
| virexene wrote:
| in what way is unicode similar to html, docx, or a file format?
| the only features I can think of that are even remotely similar
| to what you're describing are emoji modifiers.
|
| and no, this webpage is not result of "carefully cutting out
| the complicated stuff from Unicode". i'm pretty sure it's just
| the result of not supporting Unicode in any meaningful way.
| asddubs wrote:
| >We wanted a common charset for things like latin, extlatin,
| cjk, cyrillic, hebrew, etc. And we got it, for a while.
|
| we didn't even get that because slightly different looking
| characters from japanese and chinese (and other languages) got
| merged to be the same character in unicode due to having the
| same origin, meaning you have to use a font based on the
| language context for it to display correctly.
| tadfisher wrote:
| They are the same character, though. They do not use the same
| _glyph_ in different language contexts, but Unicode is a
| character encoding, not a font standard.
| throwaway290 wrote:
| I bet the complex file format thing probably started at CJK.
| They wanted to compose Hangul and later someone had a bright
| idea to do the same to change the look of emojis.
|
| Don't worry, AI is the new hotness. All they need is unpack
| prompts into arbitrary images and finally unicode is truly
| unicode, all our problems will be solved forever
| kristopolous wrote:
| There's no argument here.
|
| We could say it's only for script and alphabets, ok. It
| includes many undeciphered writing systems from antiquity with
| only a small handful of extent samples.
|
| Should we keep that, very likely to never be used character
| set, but exclude the extremely popular emojis?
|
| Exclude both? Why? Aren't computers capable enough?
|
| I used to be on the anti emoji bandwagon but really, it's all
| indefensible. Unicode is characters of communication at an
| extremely inclusive level.
|
| I'm sure some day it will also have primitive shapes and you
| can construct your own alphabet using them + directional
| modifiers akin to a generalizable Hangul in effect becoming
| some kind of wacky version of svg that people will abuse it in
| an ASCII art renaissance.
|
| So be it. Sounds great.
| riwsky wrote:
| Like how phonetic alphabets save space compared to ideograms
| by just "write the word how it sounds", the little SVG-icode
| would just "write the letter how it's drawn"
| simonh wrote:
| No, no, no, no, no... So then we'd get 'the same' character
| with potentially infinite different encodings. Lovely.
|
| Unicode is a coding system, not a glyph system or font.
| mason_mpls wrote:
| This frustration seems unnecessary, unicode isnt more
| complicated than time and we have far more than enough
| processing power to handle its most absurd manifestations.
|
| We just need good libraries, which is a lot less work than
| inventing yet another system.
| arka2147483647 wrote:
| The limiting factor is not compute power, but the time and
| understanding of a random dev somewhere.
|
| Time also is not well understood by most programmers. Most
| just seem to convert it to epoch and pretend that it is
| continuous.
| bawolff wrote:
| There are no emoiji in this guy's name.
|
| Unicode has made some mistakes, but having all the symbols
| necessary for this guy's name is not one of them.
| jrochkind1 wrote:
| Unicode has metadata on each character that would allow
| software to easily strip out or normalize emoji's and
| "decorative" characters.
|
| It might have edge case problems -- but the charcters in the
| OP's name would not be included.
|
| Also, stripping out emoji's may not actually be required or the
| right solution. If security is the concern, Unicode _also_ has
| recommended processes and algorithms for dealing with that.
|
| https://www.unicode.org/reports/tr39/
|
| We need better support for the functions developers actually
| need on unicode in more platforms and languages.
|
| Global human language is complicated as a domain. Legacy issues
| in actually existing data adds to the complexity. Unicode does
| a pretty good job at it. It's actually pretty amazing how well
| it does. Including a lot more than just the character set, and
| encoding, but algorithms for various kinds of normalizing,
| sorting, indexing, under various localizations, etc.
|
| It needs better support in the environments more developers are
| working in, with raised-to-the-top standard solutions for
| identified common use cases and problems, that can be
| implemented simply by calling a performance-optimized library
| function.
|
| (And, if we really want to argue about emoji's, they seem to be
| _extremely_ popular, and literally have effected global
| culture, because people want to use them? Blaming emoji 's
| seems like blaming the user! Unicode's support for them
| actually supports interoperability and vendor-neutral standards
| for a thing that is wildly popular? but I actually _don 't_
| think any of the problems or complexity we are talking about,
| including the OP's complaint, can or should be laid at the feet
| of emojis)
| josephcsible wrote:
| What would be wrong with "enter your name as it appears in the
| machine-readable zone of your passport" (or "would appear" for
| people who have never gotten one)? Isn't that the one standard
| format for names that actually is universal?
| ahoka wrote:
| I would like to use my name as my parents gave it to me,
| thanks. Is that too much to ask for?
| richardwhiuk wrote:
| How much flexibility are we giving parents in what they name
| children?
|
| If a parent invented a totally new glyph, would supporting
| that be a requirement?
| ks2048 wrote:
| There's the problem that "appears" is a visible phenomenon and
| unicode strings can contain non-visible characters and multiple
| ways to represent the same visible information. Normalization
| is supposed to help here, but some sites may fail to do this or
| do incorrectly, etc.
| gavinsyancey wrote:
| WTF-8 is actually a real encoding, used for encoding invalid
| UTF-16 unpaired surrogates for UTF-8 systems:
| https://simonsapin.github.io/wtf-8/
| bjackman wrote:
| I believe this is what Rust OsStrings are under the hood on
| Windows.
| rurban wrote:
| Just use the unicode identifier rules, my libu8ident.
| https://github.com/rurban/libu8ident
|
| Windows folks need to convert to UTF--8 first
| bawolff wrote:
| Its really not that hard though. PCRE regex support unicode
| letter classes. There is really no excuse for this type of issue.
___________________________________________________________________
(page generated 2024-11-24 23:00 UTC)