hngopher.com

       [HN Gopher] A years-long Turkish alphabet bug in the Kotlin comp...
       ___________________________________________________________________
        
       A years-long Turkish alphabet bug in the Kotlin compiler
        
       Author : Bogdanp
       Score  : 49 points
       Date   : 2025-10-12 17:02 UTC (5 hours ago)
        
 (HTM) web link (sam-cooper.medium.com)
 (TXT) w3m dump (sam-cooper.medium.com)
        
       | carstenhag wrote:
       | I was scrolling and scrolling, waiting for the author to mention
       | the new methods, which of course every Android Dev had to migrate
       | to at some point. And 99% of us probably thought how annoying
       | this change is, even though it probably reduced the number of
       | bugs for Turkish users :)
       | 
       | Unrelated, but a month ago I found a weird behaviour where in a
       | kotlin scratch file, `List.isEmpty()` is always true. Questioned
       | my sanity for at least an hour there...
       | https://youtrack.jetbrains.com/issue/KTIJ-35551/
        
         | ajkjk wrote:
         | well now I wanna know what's going on there!
        
       | johnyzee wrote:
       | Ugh, I've had the exact same problem in a Java project, which
       | meant I had to go through thousands and thousands of lines of
       | code and make sure that all 'toLowerCase()' on enum names
       | included Locale.ENGLISH as parameter.
       | 
       | As the article demonstrates, the error manifests in a completely
       | inscrutable way. But once I saw the bug from a couple of users
       | with Turkish sounding names, I zeroed in on it. And cursed a few
       | times under my breath whoever messed up that character table so
       | bad.
        
         | nradov wrote:
         | Were you not using static analysis tools? All of the popular
         | ones will warn about that issue with locales.
        
       | mikestew wrote:
       | When I saw "Turkish alphabet bug", I just knew it was some
       | version of toLower() gone horribly wrong.
       | 
       | (I'm sure there's a good reason, but I find it odd that compiler
       | message tags are invariably uppercase, but in this problem code
       | they lowercased it to go do a lookup from an enum of lowercase
       | names. Why isn't the enum uppercase, like the things you're going
       | to lookup?)
        
       | charcircuit wrote:
       | Everyone who has used Java has hit this before. Java really
       | should force people to always specify the locale and get rid of
       | the versions of the functions without locale parameters. There is
       | so much hidden broken code out there.
        
         | Uvix wrote:
         | That only helps if devs specify an invariant locale (ROOT for
         | Java) where needed. In practice, I think you'll see devs
         | blindly using using the user's current locale like it silently
         | does today.
        
           | jeroenhd wrote:
           | The invariant locale can't parse the numbers I enter (my
           | locale uses comma as a decimal separator). More than a few
           | applications will reject perfectly valid numbers. Intel's
           | driver control panel was even so fucked up that I needed to
           | change my locale to make it parse its own UI layout files.
           | 
           | Defaulting to ROOT makes a lot of sense for internal
           | constants, like in the example in this article, but
           | defaulting to ROOT for everything just exposes the problems
           | that caused Sun to use the user locale by default in the
           | first place.
        
             | Uvix wrote:
             | Agreed, there are cases where user locale is needed. So
             | many so that I expect that to be devs' default if required
             | to specify, and that they _won't_ use ROOT where they
             | should.
        
       | zettabomb wrote:
       | I have always wondered why Turkey chose to Latinize in this way.
       | I understand that the issue is having two similar vowels in
       | Turkish, but not why they decided to invent the dotless I, when
       | other diacritics already existed. I I I I I I I and almost
       | certainly a dozen other would've worked, unless there was already
       | some significance to the dot in Turkish that's not obvious.
        
         | mrighele wrote:
         | The issue is not the invention of the dotless I, it already
         | exists, the issue is that the took a vowerl , i/I, and the
         | assigned the lower case to one vowel, and the upper case to a
         | different one, and invented what left missing.
         | 
         | It's like they decided that the uppercase of "a" is "E" and the
         | uppercase of "e" is "A".
        
           | pinkmuffinere wrote:
           | This is misleading, because it assumes that i/I naturally
           | represent one vowel, which is just not the case. i/I
           | represents one vowel in _English_, when written with a latin
           | script. In fact even this isn't correct, i/I represents one
           | phoneme, not one vowel. <see troad's comment for correction>
           | 
           | There is no reason to assume that the English representation
           | is in general "correct", "standard", or even "first". The
           | modern script for Turkish was adopted around the 1920's, so
           | you could argue perhaps that most typewriters presented a
           | standard that should have been followed. However, there was
           | variation even between different typewriters, and I strongly
           | suspect that typewriters weren't common in Turkey when the
           | change was made.
        
             | ginko wrote:
             | >This is misleading, because it assumes that i/I naturally
             | represent one vowel, which is just not the case.
             | 
             | It does in literally any language using a latin alphabet
             | other than Turkish.
        
               | pinkmuffinere wrote:
               | This may be correct, I'd have to do a 'real' search,
               | which I'm too lazy to do, lol sorry. However there are
               | definitely other (non-latin) scripts that have either i
               | or I, but for which i/I is not a correct pair. For
               | example, greek has i/I too.
        
               | okanat wrote:
               | All other Turkic languages also copied this for their
               | Latin script: https://en.wikipedia.org/wiki/Dotless_I
        
             | troad wrote:
             | > In fact even this isn't correct, i/I represents one
             | phoneme, not one vowel.
             | 
             | Not quite. In English, 'i' and 'I' are two allographs of
             | one grapheme, corresponding to many phonemes, based on
             | context. (Using linguistic definitions here, not compsci
             | ones.) The 'i's in 'kit' and 'kite' stand for different
             | phonemes, for example.
             | 
             | > There is no reason to assume that the English
             | representation is in general "correct", "standard", or even
             | "first".
             | 
             | Correct, but the I/i allography is not exclusive to
             | English. Every Latin script functions that way, other than
             | Turkish and Turkish-derived scripts.
             | 
             | No one is saying Turkish cannot break from that convention
             | - they can feel free to do anything they like - but the
             | resulting issues are fairly predictable, and their adverse
             | effects fall mainly on Turkish speakers in practice, not on
             | the rest of us.
        
               | pinkmuffinere wrote:
               | > Not quite. In English, 'i' and 'I' are two allographs
               | of one grapheme, corresponding to many phonemes, based on
               | context. (Using linguistic definitions here, not compsci
               | ones.) The 'i's in 'kit' and 'kite' stand for different
               | phonemes, for example.
               | 
               | You're right, apologies my linguistics is rusty and I was
               | overconfident.
               | 
               | > Correct, but the I/i allography is not exclusive to
               | English. Every Latin script functions that way, other
               | than Turkish and Turkish-derived scripts.
               | 
               | I think my main argument is that the importance of
               | standardizing to i/I was much less obvious in the 1920's.
               | The benefits are obvious to us now, but I think we would
               | be hard pressed to predict this outcome a-priori.
        
               | Muromec wrote:
               | > but the resulting issues are fairly predictable, and
               | their adverse effects fall mainly on Turkish speakers in
               | practice, not on the rest of us.
               | 
               | I don't think it's fair to call it predictable. When this
               | convention was chosen, the problem of "what is the
               | uppercase letter to I" was always bound to the context of
               | language. Now it suddenly isn't. Shikata ga nai. It
               | wasn't even an explicit assumption that can be reflected
               | upon, it was an implicit one, that just happened.
        
           | steezeburger wrote:
           | I don't think that's the right way to think about it. It's
           | not like they were Latinizing Turkish with ASCII in mind.
           | They wanted a one-to-one mapping between letters and sounds.
           | The dot versus no dot marks where in your mouth or throat the
           | vowel is formed. They didn't have this concept that capital I
           | automatically pairs with lowercase i. The dot was always part
           | of the letter itself. The reform wasn't trying to fit
           | existing Western conventions, it was trying to map the
           | Turkish sounds to symbols.
        
           | okanat wrote:
           | Not really. Turkish has a feature that is called "vowel
           | harmony". You match suffixes you add to a word based on a
           | category system: low pitch vs high pitch vowels where a,i,o,u
           | are low pitch and e,i,o,u are high pitch.
           | 
           | O and u were already borrowed from German alphabet. Umlaut-
           | added variants of 'o' and 'u' have a similar effect on 'o'
           | and 'u' respectively: they bring a back vowel to front. See:
           | https://en.wikipedia.org/wiki/Vowel . Similarly removing the
           | dots bring them back.
           | 
           | Turkish already had i sound and its back variant which is a
           | schwa-like sound:
           | https://en.wikipedia.org/wiki/Close_back_unrounded_vowel . It
           | has the same relation in IPA as 'o' has to 'o' and 'u' has to
           | 'u'. Since the makers of the Turkish variant of Latin
           | Alphabet had the rare chance of making a regular
           | pronunciation system with the state of the language and since
           | removing the dots had the effect of making a front vowel a
           | back vowel, they simply copied this feature from o and u to
           | i:
           | 
           | Just remove the dots to make it a back vowel! Now we have i.
           | 
           | When comes to capitalization, o becomes O, u becomes U. So it
           | is just logical to make the capital of i I and the lowercase
           | of I i.
        
             | ithkuil wrote:
             | Yes it's hard to come up with a different capital than I
             | unless you somehow can see into the future and foresee the
             | advent of computers, which the Turkish alphabet reform
             | predates.
             | 
             | Of course the latin capital I is dotless because originally
             | the lowercase latin "i" was also dotless. The dot has been
             | added later to make text more legible.
        
           | ozgung wrote:
           | Nope, we decided to do it the correct and logical way for our
           | alphabet. Some glyphs are either dotted or dotless. So, we
           | have Ii, Ii, Oo, Oo, Uu, Uu, Cc, Cc, Ss and Ss. You see the
           | Ii pair is actually the odd one in the series.
           | 
           | Also, we don't have serifs in our I. It's just a straight
           | line. So, it's not even related to your Ii pair in English.
           | You can't dictate how we write our straight lines, can you.
           | 
           | The root cause of the problem is in the implementation and
           | standardization of the computer systems. Computers are
           | originally designed only for English alphabet in mind. And
           | patched to support other languages over time, poorly.
           | Computers should obey the language rules, not the other way
           | around.
        
             | zettabomb wrote:
             | >Also, we don't have serifs in our I.
             | 
             | That depends on font.
             | 
             | >So, it's not even related to your Ii pair in English.
             | 
             | Modern Turkish uses the Latin script, of course it's
             | related.
             | 
             | >You can't dictate how we write our straight lines, can
             | you.
             | 
             | No, I can't, I just want to understand why the Turks
             | decided to change this letter, and this letter only, from
             | the rest of the standard Latin script/diacritics.
        
         | ayhanfuat wrote:
         | Except for the a/e pair, front and back vowels have dotted and
         | dotless versions in Turkish: i and i, o and o, u and u.
        
           | zettabomb wrote:
           | Makes sense enough, but why not use i and i to be consistent?
        
             | okanat wrote:
             | Turkish i/I sounds pretty similar to most of the European
             | languages. Italian, French and German pronounce it pretty
             | similar. Also removing umlauts from the other two vowels o
             | and u to write o and u has the same effect as removing the
             | dot from i. It is just consistent.
        
               | zettabomb wrote:
               | No, what I mean is, o and u get an umlaut (two dots) to
               | become o and u, but i doesn't get an umlaut, it's just a
               | single dot from i to i. Why not make it i and i? That
               | would be more consistent, in my opinion.
        
             | ayhanfuat wrote:
             | This was shortly after the Turkish War of Independence.
             | Illiteracy was quite high (estimated at over 85%) and the
             | country was still being rebuilt. My guess is they did their
             | best to represent all the sounds while creating a one to
             | one mapping between sounds and letters but also not
             | deviating too much from familiar forms. There were probably
             | conflicting goals so inconsistencies were bound to happen.
        
           | o11c wrote:
           | In that case they should've used i for consistency.
        
         | nurettin wrote:
         | There was actually three! i (as in th[i]s), i (as in ch[ee]se)
         | and i which sounds nothing like the first two, it sounds
         | something like the e in bag[e]l. I guess it sounded so
         | different that it warranted such a drastic symbolic change.
        
           | ithkuil wrote:
           | Turkish exhibits a vowel harmony system and uses diacritics
           | on other vowels too and the choice to put "i" together with
           | other front vowels like "u" and "o" and put "i" together with
           | back vowels like "u" and "o" is actually pretty elegant.
           | 
           | The latinization reform of the Turkish language predates
           | computers and it was hard to foresee the woes that future
           | generations would have had with that choice
        
         | jeroenhd wrote:
         | Computers and localisation weren't relevant back in the early
         | 20th century. The dotless existed before the dotted i (in Greek
         | script as iota). Some European scholars putting an extra dot on
         | the letter to make it stand out a bit more are as much to blame
         | as the Turks for making the distinction between the different
         | i-vowels clear.
         | 
         | Really, this bug is nothing but programmers failing to take
         | into account that not everybody writes in English.
        
       | okanat wrote:
       | As a Turkish speaker who was using a Turkish-locale setup in my
       | teenage years these kinds of bugs frustrated me infinitely. Half
       | of the Java or Python apps I installed never run. My PHP
       | webservers always had problems with random software. Ultimately,
       | I had to change my system's language to English. However, US has
       | godawful standards for everything: dates, measurement units,
       | paper sizes.
       | 
       | When I shared computers with my parents I had to switch languages
       | back-and-forth all the time. This helped me learn English rather
       | quickly but, I find it a huge accessibility and software design
       | issue.
       | 
       | If your program depends on letter cases, that is a badly designed
       | program, period. If a language ships toUpper or a toLower
       | function without a mandatory language field, it is badly designed
       | too. The only slightly-better option is making toUpper and
       | toLower ASCII-only and throwing error for any other character
       | set.
       | 
       | While half of the language design of C is questionable and
       | outright dangerous, making its functions locale-sensitive by all
       | popular OSes was an avoidable mistake. Yet everybody did that.
       | Just the existence of this behavior is a reason I would like to
       | get rid of anything GNU-based in the systems I develop today.
       | 
       | I don't care if Unicode releases a conversion map. Natural-
       | language behavior should always require natural language metadata
       | too. Even modern languages like Rust did a crappy job of
       | enforcing it: https://doc.rust-
       | lang.org/std/primitive.char.html#method.to_... . Yes it is
       | significantly safer but converting 'ss' to 'SS' in German
       | definitely has gotchas too.
        
         | arccy wrote:
         | use Australian English: English but with same settings for
         | everything else, including keyboard layout
        
           | Sesse__ wrote:
           | Many Linux distributions provide en_DK specifically for this
           | purpose. English as it is used in Denmark. :-)
        
             | fph wrote:
             | Denmark doesn't have Euros as currency, unfortunately.
        
           | okanat wrote:
           | I live in Germany now, so I generally set it to Irish
           | nowadays. Since I like ISO-style enter key, I use UK keyboard
           | layout (also easier to switch to Turkish than ANSI-layout).
           | However many OSes now have a English (Europe) locale too
        
         | 1718627440 wrote:
         | > However, US has godawful standards for everything: dates,
         | measurement units, paper sizes.
         | 
         | Isn't the choice of language and date and unit formats normally
         | independent.
        
           | neandrake wrote:
           | There are OS-level settings for date and unit formats but not
           | all software obeys that, instead falling back to using the
           | default date/unit formats for the selected locale.
        
           | okanat wrote:
           | > > However, US has godawful standards for everything: dates,
           | measurement units, paper sizes.
           | 
           | > Isn't the choice of language and date and unit formats
           | normally independent.
           | 
           | You would hope so but, no. Quite a bit software tie the
           | language setting to Locale setting. If you are lucky, they
           | will provide an "English (UK)" option (which still uses miles
           | or FFS WTF is a stone!).
           | 
           | On Windows you can kinda select the units easily. On Linux
           | let me introduce you to the journey to LC_ environment
           | variables: https://www.baeldung.com/linux/locale-environment-
           | variables . This doesn't mean the websites or the apps will
           | obey them. Quite a few of them don't and just use LANGUAGE,
           | LANG or LC_TYPE as their setting.
           | 
           | My company switched to Notion this year (I still miss
           | Confluence). It was hell until last month since they only had
           | "English (US)" and used M/D/Y everywhere with no option to
           | change!
        
         | collinfunk wrote:
         | > While half of the language design of C is questionable and
         | outright dangerous, making its functions locale-sensitive by
         | all popular OSes was an avoidable mistake. Yet everybody did
         | that. Just the existence of this behavior is a reason I would
         | like to get rid of anything GNU-based in the systems I develop
         | today.
         | 
         | POSIX requires that many functions account for the current
         | locale. I'm not sure why you are blaming GNU for this.
        
       | darkhorn wrote:
       | Java; write once, run anywhere, except on Turkish Windows.
        
       | sjrd wrote:
       | I am one of the maintainers is the Scala compiler, and this is
       | one of the things that immediately jump to me when I review code
       | that contains any casing operation. Always explicitly specify the
       | locale. However, unlike TFA and other comments, I don't suggest
       | `Locale.US`. That's a little too US-centric. The canonical locale
       | is in fact `Locale.ROOT`. Granted, in practice it's equivalent,
       | but I find it a little bit more sensible.
       | 
       | Also, this is the last remaining major system-dependent default
       | in Java. They made strict floating point the default in 17; UTF-8
       | the default encoding some versions later (21?); only the locale
       | remains. I hope they make ROOT the default in an upcoming
       | version.
       | 
       | FWIW, in the Scala.js implementation, we've been using UTF-8 and
       | ROOT as the defaults forever.
        
       | naniwaduni wrote:
       | A stark reminder that _all operations on strings are wrong_.
        
       | esafak wrote:
       | Kotlin keywords should be assumed to be English.
        
       ___________________________________________________________________
       (page generated 2025-10-12 23:00 UTC)