https://devblogs.microsoft.com/oldnewthing/20241031-00/?p=110443 Skip to main content [RE1Mu3b] Microsoft Dev Blogs Dev Blogs Dev Blogs * Home * Developer + Visual Studio + Visual Studio Code + Develop from the cloud + Visual Studio for Mac + All things Azure + DevOps + Windows Developer + Developer support + ISE Developer + Engineering@Microsoft + Azure SDK + Command Line + Perf and Diagnostics + React Native * Technology + DirectX + OpenAPI + Semantic Kernel + SurfaceDuo + Windows AI Platform * Languages + C++ + C# + F# + TypeScript + PowerShell Community + PowerShell Team + Python + JavaScript + Java + Java Blog in Chinese + Go * .NET + All .NET posts + .NET Aspire + .NET MAUI + AI + ASP.NET Core + Blazor + Entity Framework + Servicing + .NET Blog in Chinese * Platform Development + #ifdef Windows + Azure Government + Azure VM Runtime Team + Bing Dev Center + Microsoft Edge Dev + Microsoft Azure + Microsoft 365 Developer + Microsoft Entra Identity Developer Blog + Old New Thing + Power Platform + Windows MIDI and Music dev * Data Development + Azure Cosmos DB + Azure Data Studio + Azure SQL + OData + Revolutions R + SQL Server Data Tools * More [ ] Search Search * No results Cancel * Dev Blogs * The Old New Thing * What has case distinction but is neither uppercase nor lowercase? October 31st, 2024 What has case distinction but is neither uppercase nor lowercase? Raymond Chen Raymond Chen Show more If you go exploring the Unicode Standard, you may be surprised to find that there are some characters that have case distinction yet are themselves neither uppercase nor lowercase. Oooooh, spooky. In other words, it is a character c with the properties that * toUpper(c) [?] toLower(c), yet * c [?] toUpper(c) and c [?] toLower(c). Congratulations, you found the mysterious third case: Title case. There are some Unicode characters that occupy a single code point but represent two graphical symbols packed together. For example, the Unicode character dz (U+01F1 LATIN SMALL LETTER DZ), looks like two Unicode characters placed next to each other: dz (U+0064 LATIN SMALL LETTER D followed by U+007A LATIN SMALL LETTER Z). These diagraphs are characters in the alphabets of some languages, most notably Hungarian. In those languages, the diagraph is considered a separate letter of the alphabet. For example, the first ten letters of the Hungarian alphabet are1 +--------------------------------------------+ | a | a | b | c | cs | d | dz | dzs | e | e | +--------------------------------------------+ These digraphs (and one trigraph) have three forms. +----------------------+ | Form | Result | |-------------+--------| | Uppercase | DZ | |-------------+--------| | Title case | Dz | |-------------+--------| | Lowercase | dz | +----------------------+ Unicode includes four diagraphs in its encoding. +------------------------------------+ | Uppercase | Title case | Lowercase | |-----------+------------+-----------| | DZ | Dz | dz | |-----------+------------+-----------| | LJ | Lj | lj | |-----------+------------+-----------| | NJ | Nj | nj | |-----------+------------+-----------| | DZ | Dz | dz | +------------------------------------+ But wait, we have a Unicode code point for the dz digraph, but we don't have one for the cs digraph or the dzs trigraph. What's so special about dz? These digraphs owe their existence in Unicode not to Hungarian but to Serbo-Croatian. Serbo-Croatian is written in both Latin script (Croatian) and Cyrillic script (Serbian), and these digraphs permit one-to-one transliteration between them.1 Just another situation where the world is more complicated than you think. You thought you understood uppercase and lowercase, but there's another case in between that you didn't know about. Bonus chatter: The fact that dz is treated as a single letter in Hungarian means that if you search for "mad", it should not match " madzag" (which means "string") because the "dz" in "madzag" is a single letter and not a "d" followed by a "z", no more than "lav" should match "law" just because the first part of the letter "w" looks like a "v". Another surprising result if you mistakenly use a literal substring search rather than a locale-sensitive one. We'll look at locale-sensitive substrings searches next time. 1 I got this information from the Unicode Standard, Version 15.0, Chapter 7: "Europe I", Section 7.1: "Latin", subsection "Latin Extended-B: U+0180-U+024F", sub-subsection "Croatian Digraphs Matching Serbian Cyrillic Letters." 5 21 9 * [facebook] Share on Facebook * Share on Twitter * [linkedin] Share on Linkedin Category Old New Thing Topics Other Author Raymond Chen Raymond Chen Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information. 21 comments Join the discussion. Leave a commentCancel reply Sign in [ ] [Reply] [Cancel] Code of Conduct Sort by : Newest Newest Popular Oldest * [png] Tudor Iordachescu 1 day ago 1 Collapse this comment Copy link The EU law imposes the user's informed consent for the use of cookies, that's it. Some corporations/people that first complied with that law (I admit I'm a little bit fuzzy on the historical timeline) chose to implement the most annoying version possible as a form of malicious compliance, probably hoping that public outcry would trigger a revision of the law. I bet that 99% of people implementing such popups nowadays are just lazy and... Read more The EU law imposes the user's informed consent for the use of cookies, that's it. Some corporations/people that first complied with that law (I admit I'm a little bit fuzzy on the historical timeline) chose to implement the most annoying version possible as a form of malicious compliance, probably hoping that public outcry would trigger a revision of the law. I bet that 99% of people implementing such popups nowadays are just lazy and follow the herd instead of researching what the law actually requires. Read less Log in to Vote or Reply + [png] Bela Zsir 1 day ago * Edited 0 Collapse this comment Copy link Some first compliers were hoping for a public outcry, since then we have all the herd following, ie. the overall impact is a million times greater, and still no public outcry. What does that say about the public? ...am I really the only one so annoyed with this? Log in to Vote or Reply * [png] David Faulks 2 days ago * Edited 0 Collapse this comment Copy link The reason these letters exist is because Unicode has a policy of 1-to-1 round trip encoding compatibility with older character sets, and Yugoslavia (keep in mind Unicode came out in 1991) used to have an 8-bit character set (YUSCII) that included these digraph letters. I'm not sure why so many commentators are focused on Hungarian. Log in to Vote or Reply * [png] Michael Chermside November 4, 2024 1 Collapse this comment Copy link Your point about how Hungarians actually use the characters is excellent -- and remains so even if it turns out that SOME Hungarian speakers disagree with you on this. In my opinion, the authors of the Unicode standing should generally attempt to support oddities that are unique within a language but when native speakers disagree about an oddity, Unicode should err on the side of simplicity. (Of course dz may have been added to support... Read more Your point about how Hungarians actually use the characters is excellent -- and remains so even if it turns out that SOME Hungarian speakers disagree with you on this. In my opinion, the authors of the Unicode standing should generally attempt to support oddities that are unique within a language but when native speakers disagree about an oddity, Unicode should err on the side of simplicity. (Of course dz may have been added to support Serbo-Croatian, not Hungarian.) However, your ire about cookie popups is misplaced. Computer technologists did not invent and impose them, an EU law mandated the cookie popups (and still does). I don't even live in the EU and I still have to Wade through thickets of cookie agreement popups. Perhaps you could persuade your politicians to change that. Read less Log in to Vote or Reply + [png] Bela Zsir 2 days ago * Edited 0 Collapse this comment Copy link Thank you for the reply, sorry for the long rant. I was trying to make the point that after decades of NOT having all our characters, we don't want now to choke on too many. PS. Your are right about the cookie thing being misplaced here, when I can, I speak out in the right place, but nobody cares. as if it were not a problem for anyone else in the world, in desperation I... Read more Thank you for the reply, sorry for the long rant. I was trying to make the point that after decades of NOT having all our characters, we don't want now to choke on too many. PS. Your are right about the cookie thing being misplaced here, when I can, I speak out in the right place, but nobody cares. as if it were not a problem for anyone else in the world, in desperation I tried here to give just an example, of what is a million times more disturbing than the made-up problem of the lack of some unneeded digraphs. (reminds me of the story made in the news how the 'Calculator Team' (sic! / sick?) in Redmond solved a 20 year problem in the Windows calculator, this and all similar waste of human resources ie. just to have a 'Calculator Team' make me sad) I am aware that it is a law, but nowhere in the law it is stated that half the page must be taken up by a cookie prompt graying out the rest of the page making it unusable till you answer a silly question. (I click always on OK, Yes, Agree, or whatever the 'dont care just go' button says to avoid the pointless further prompts) In my spare time, I volunteer at a centre for people with disabilities, doing what I can: electronics and programming, refurbishing the computers they receive as donations. 'inventing' alternative pointing devices. These people are blessed with a computer and the internet. Scrolling is easy for them without the help of their hands, but clicking a mouse on a randomly popping up window with a mess of buttons is each time a challenge, and it is growing. I am desperately trying to help these people with my browser extension scripts that auto-click away these bs, but they keep on coming, there are not two identically programmed out there. My only wish for you, Computer technologists, please help us, just make the implementation a standard. (for lawyer users you can leave all as is) If talking about the law, why is there no option for a legally binding statement built-in a browser that I will allow/deny all cookies for the next 10 years or whatever, just don't ask me a million times of the same thing. I will go to a notary to sign this life-long statement with a wax seal if needed. What law it is if I can give away a billion-dollar asset at the click of a button, then I am an adult? But this cookie thing is so important that it will be reasked 547 times just over the next week. I began with 'sorry for my rant', now I went on, sorry again. Annoying, isn't it? those cookie prompts are much more annoying for my friends in that center. Read less Log in to Vote or Reply * [png] Kristof RoompMicrosoft employee November 3, 2024 * Edited 0 Collapse this comment Copy link Ll and Ch were considered single letters in Spanish until they changed the rules in 1994. Spanish (traditional sort) treats them as single characters vs Modern sort. Log in to Vote or Reply * [png] Bela Zsir 4 days ago * Edited 1 Collapse this comment Copy link As a Hungarian, I'd like to add my two cents to the discussion. TLDR: I know that lately it's become a "woke" habit to look for 'oppressed victims' who have not the slightest idea that they're supposed to be victims. But thank you, we Hungarians--and our language--have no need for digraphs; in fact, having them and anybody using them would be actually harmful. I began programming in the 80s, we Hungarians spent about two... Read more As a Hungarian, I'd like to add my two cents to the discussion. TLDR: I know that lately it's become a "woke" habit to look for 'oppressed victims' who have not the slightest idea that they're supposed to be victims. But thank you, we Hungarians--and our language--have no need for digraphs; in fact, having them and anybody using them would be actually harmful. I began programming in the 80s, we Hungarians spent about two decades with the problem that the full (8-bit) extended ASCII character table just almost had all the Hungarian characters. We had a e, i, o, o, u, u, but we were missing o u and their uppercase versions. (I guess these diacritics do not even have a name: could be double-acute?), With these four missing we were limited in doing anything computer-related with correct grammar. This was particularly frustrating because, as far as I know, all other European languages, including Eastern European languages, had all their letters. I remember how I complained that, despite the fact that there were Hungarians among the greats of computer science (Janos Neumann - Neumann architecture, Janos Kemeny - inventor of multitasking and the BASIC language, Gabor Denes - holography, Andras Grof, who used the name Andrew Grove - co-founder of Intel, Karoly Simonyi, who used the name Charles Simonyi - chief architect for Microsoft Word and Excel, and many others), they couldn't manage to lobby us into having those four more letters in the 256 ASCII characters. (please note I wrote their name using the original Hungarian letters) The other annoying thing for us is the order of given name / surname. We use it the other way, it is used in very few languages that way on the Planet, and still todays mainstream apps are having issues with this (actually they just don't care) So, Dear international computing community, you owe us Hungarians. Please don't ruin our text searches with the unpredictable results of not finding 'mad' in 'madzag' I assure you, not a single Hungarian expects it this way, nor does anyone need it ever. In my opinion, it's already nonsense that these double and triple letters made to be part of the official Hungarian alphabet. I deliberately not call them digraphs, by that logic, plenty of other letter pairs in our language -- and also in all other languages -- could also become digraphs. There's nothing special about "dz." in Hungarian. We use it the same way as the "ch" in the word *technika*--one sound, two letters, but not in the alphabet as a digraph. The same goes for "ts," "tz," and a dozen other letter pairs. And what concerns the hyphenation rules, they are independent of these anyway, we do not hyphenate as tec-hnika, and for this to work, a digraph of 'ch' is not needed. I bet you'd go crazy if a search function couldn't find "is" in the word "island" just because someone decided that "sl" should be a digraph in English. Please spare us too from this, we want to find our mad-ness in 'madzag', it's OK so. PS. if you're looking for a real problem to solve, do something about the plethora of cookie-consent popups that clutter everything on the internet. They have no sense in the fight for privacy, I guess the lawmakers just knew and could pronounce the word 'cookie' (Want a yummy cookie, Charlie?) and luckily had no idea what a localStorage, sessionStorage, indexedDB, or cache storage is. These cookie-things make any browsing a pain, I click them away dozens times a day, very annoying, internet was not supposed to look this way. At least standardize them, they are all annoyingly (ie. unautomatable) different, what about a special HTML tag for this bs? Read less Log in to Vote or Reply * [png] Jonas Barklund 5 days ago 0 Collapse this comment Copy link Raymond, did you try to make a distinction between digraph and diagraph, or was the latter a typo for digraph? Log in to Vote or Reply * [png] Alvaro Gonzalez 5 days ago 0 Collapse this comment Copy link Funny. That same letter also used to exist in Spanish, together with ll (double L). Both were demoted in the mid 1990s so I guess they never made into Unicode. I also think it was for the best. To look up things in a list or dictionary you had to know the language it was written on. Log in to Vote or Reply * [png] Chris Warrick October 31, 2024 0 Collapse this comment Copy link > The fact that dz is treated as a single letter in Hungarian means that if you search for "mad", it should not match "madzag" (which means "string") because the "dz" in "madzag" is a single letter and not a "d" followed by a "z" This sounds mad to me. Polish has a fair share of digraphs and trigraphs, but I expect partially-typed digraphs not to change the search result. It is disorienting if the result... Read more > The fact that dz is treated as a single letter in Hungarian means that if you search for "mad", it should not match "madzag" (which means "string") because the "dz" in "madzag" is a single letter and not a "d" followed by a "z" This sounds mad to me. Polish has a fair share of digraphs and trigraphs, but I expect partially-typed digraphs not to change the search result. It is disorienting if the result appears when `ma` is typed, disappears after typing `d`, and then comes back after typing `z`. And that applies to experiences which don't search automatically as well. Read less Log in to Vote or Reply + [png] Daniel Chylek November 2, 2024 0 Collapse this comment Copy link I guess it is weird if you combine multiple languages on your system, but to me it's entirely reasonable to expect that Czech Windows will not find a file containing 'ch' when you search for 'c' or 'h'. That is how it works right now. Log in to Vote or Reply * [png] Jan Ringos October 31, 2024 1 Collapse this comment Copy link In Czech, we have similar letter 'ch' but it never got assigned a single Unicode codepoint. It's probably for the best. Log in to Vote or Reply * [png] Nadudvari Kazmer October 31, 2024 0 Collapse this comment Copy link Indeed, we consider those as single sounds (cs, dz, sz, ...), but two characters, not digraphs. They remain together in hyphenation only. Log in to Vote or Reply Load more comments Read next October 24, 2024 It rather involved being on the other side of the airtight hatchway: Defeating ASLR after you've gained RCE via ROP Raymond Chen Raymond Chen October 8, 2024 Microspeak: Run to ground Raymond Chen Raymond Chen Stay informed Get notified when new posts are published. [ ] Subscribe By subscribing you agree to our Terms of Use and Privacy Follow this blog youtube Are you sure you wish to delete this comment? OK Cancel Sign in Theme Insert/edit link Close Enter the destination URL URL [ ] Link Text [ ] [ ] Open link in a new tab Or link to existing content Search [ ] No search term specified. Showing recent items. Search or use up and down arrow keys to select an item. Cancel [Add Link] Code Block x Paste your code snippet [ ] Ok Cancel Feedback What's new * Surface Pro * Surface Laptop * Surface Laptop Studio 2 * Surface Laptop Go 3 * Microsoft Copilot * AI in Windows * Explore Microsoft products * Windows 11 apps Microsoft Store * Account profile * Download Center * Microsoft Store support * Returns * Order tracking * Certified Refurbished * Microsoft Store Promise * Flexible Payments Education * Microsoft in education * Devices for education * Microsoft Teams for Education * Microsoft 365 Education * How to buy for your school * Educator training and development * Deals for students and parents * Azure for students Business * Microsoft Cloud * Microsoft Security * Dynamics 365 * Microsoft 365 * Microsoft Power Platform * Microsoft Teams * Microsoft 365 Copilot * Small Business Developer & IT * Azure * Developer Center * Documentation * Microsoft Learn * Microsoft Tech Community * Azure Marketplace * AppSource * Visual Studio Company * Careers * About Microsoft * Company news * Privacy at Microsoft * Investors * Diversity and inclusion * Accessibility * Sustainability Your Privacy Choices Your Privacy Choices Consumer Health Privacy * Sitemap * Contact Microsoft * Privacy * Manage cookies * Terms of use * Trademarks * Safety & eco * Recycling * About our ads * (c) Microsoft 2024