[HN Gopher] A years-long Turkish alphabet bug in the Kotlin comp...
___________________________________________________________________
A years-long Turkish alphabet bug in the Kotlin compiler
Author : Bogdanp
Score : 49 points
Date : 2025-10-12 17:02 UTC (5 hours ago)
(HTM) web link (sam-cooper.medium.com)
(TXT) w3m dump (sam-cooper.medium.com)
| carstenhag wrote:
| I was scrolling and scrolling, waiting for the author to mention
| the new methods, which of course every Android Dev had to migrate
| to at some point. And 99% of us probably thought how annoying
| this change is, even though it probably reduced the number of
| bugs for Turkish users :)
|
| Unrelated, but a month ago I found a weird behaviour where in a
| kotlin scratch file, `List.isEmpty()` is always true. Questioned
| my sanity for at least an hour there...
| https://youtrack.jetbrains.com/issue/KTIJ-35551/
| ajkjk wrote:
| well now I wanna know what's going on there!
| johnyzee wrote:
| Ugh, I've had the exact same problem in a Java project, which
| meant I had to go through thousands and thousands of lines of
| code and make sure that all 'toLowerCase()' on enum names
| included Locale.ENGLISH as parameter.
|
| As the article demonstrates, the error manifests in a completely
| inscrutable way. But once I saw the bug from a couple of users
| with Turkish sounding names, I zeroed in on it. And cursed a few
| times under my breath whoever messed up that character table so
| bad.
| nradov wrote:
| Were you not using static analysis tools? All of the popular
| ones will warn about that issue with locales.
| mikestew wrote:
| When I saw "Turkish alphabet bug", I just knew it was some
| version of toLower() gone horribly wrong.
|
| (I'm sure there's a good reason, but I find it odd that compiler
| message tags are invariably uppercase, but in this problem code
| they lowercased it to go do a lookup from an enum of lowercase
| names. Why isn't the enum uppercase, like the things you're going
| to lookup?)
| charcircuit wrote:
| Everyone who has used Java has hit this before. Java really
| should force people to always specify the locale and get rid of
| the versions of the functions without locale parameters. There is
| so much hidden broken code out there.
| Uvix wrote:
| That only helps if devs specify an invariant locale (ROOT for
| Java) where needed. In practice, I think you'll see devs
| blindly using using the user's current locale like it silently
| does today.
| jeroenhd wrote:
| The invariant locale can't parse the numbers I enter (my
| locale uses comma as a decimal separator). More than a few
| applications will reject perfectly valid numbers. Intel's
| driver control panel was even so fucked up that I needed to
| change my locale to make it parse its own UI layout files.
|
| Defaulting to ROOT makes a lot of sense for internal
| constants, like in the example in this article, but
| defaulting to ROOT for everything just exposes the problems
| that caused Sun to use the user locale by default in the
| first place.
| Uvix wrote:
| Agreed, there are cases where user locale is needed. So
| many so that I expect that to be devs' default if required
| to specify, and that they _won't_ use ROOT where they
| should.
| zettabomb wrote:
| I have always wondered why Turkey chose to Latinize in this way.
| I understand that the issue is having two similar vowels in
| Turkish, but not why they decided to invent the dotless I, when
| other diacritics already existed. I I I I I I I and almost
| certainly a dozen other would've worked, unless there was already
| some significance to the dot in Turkish that's not obvious.
| mrighele wrote:
| The issue is not the invention of the dotless I, it already
| exists, the issue is that the took a vowerl , i/I, and the
| assigned the lower case to one vowel, and the upper case to a
| different one, and invented what left missing.
|
| It's like they decided that the uppercase of "a" is "E" and the
| uppercase of "e" is "A".
| pinkmuffinere wrote:
| This is misleading, because it assumes that i/I naturally
| represent one vowel, which is just not the case. i/I
| represents one vowel in _English_, when written with a latin
| script. In fact even this isn't correct, i/I represents one
| phoneme, not one vowel. <see troad's comment for correction>
|
| There is no reason to assume that the English representation
| is in general "correct", "standard", or even "first". The
| modern script for Turkish was adopted around the 1920's, so
| you could argue perhaps that most typewriters presented a
| standard that should have been followed. However, there was
| variation even between different typewriters, and I strongly
| suspect that typewriters weren't common in Turkey when the
| change was made.
| ginko wrote:
| >This is misleading, because it assumes that i/I naturally
| represent one vowel, which is just not the case.
|
| It does in literally any language using a latin alphabet
| other than Turkish.
| pinkmuffinere wrote:
| This may be correct, I'd have to do a 'real' search,
| which I'm too lazy to do, lol sorry. However there are
| definitely other (non-latin) scripts that have either i
| or I, but for which i/I is not a correct pair. For
| example, greek has i/I too.
| okanat wrote:
| All other Turkic languages also copied this for their
| Latin script: https://en.wikipedia.org/wiki/Dotless_I
| troad wrote:
| > In fact even this isn't correct, i/I represents one
| phoneme, not one vowel.
|
| Not quite. In English, 'i' and 'I' are two allographs of
| one grapheme, corresponding to many phonemes, based on
| context. (Using linguistic definitions here, not compsci
| ones.) The 'i's in 'kit' and 'kite' stand for different
| phonemes, for example.
|
| > There is no reason to assume that the English
| representation is in general "correct", "standard", or even
| "first".
|
| Correct, but the I/i allography is not exclusive to
| English. Every Latin script functions that way, other than
| Turkish and Turkish-derived scripts.
|
| No one is saying Turkish cannot break from that convention
| - they can feel free to do anything they like - but the
| resulting issues are fairly predictable, and their adverse
| effects fall mainly on Turkish speakers in practice, not on
| the rest of us.
| pinkmuffinere wrote:
| > Not quite. In English, 'i' and 'I' are two allographs
| of one grapheme, corresponding to many phonemes, based on
| context. (Using linguistic definitions here, not compsci
| ones.) The 'i's in 'kit' and 'kite' stand for different
| phonemes, for example.
|
| You're right, apologies my linguistics is rusty and I was
| overconfident.
|
| > Correct, but the I/i allography is not exclusive to
| English. Every Latin script functions that way, other
| than Turkish and Turkish-derived scripts.
|
| I think my main argument is that the importance of
| standardizing to i/I was much less obvious in the 1920's.
| The benefits are obvious to us now, but I think we would
| be hard pressed to predict this outcome a-priori.
| Muromec wrote:
| > but the resulting issues are fairly predictable, and
| their adverse effects fall mainly on Turkish speakers in
| practice, not on the rest of us.
|
| I don't think it's fair to call it predictable. When this
| convention was chosen, the problem of "what is the
| uppercase letter to I" was always bound to the context of
| language. Now it suddenly isn't. Shikata ga nai. It
| wasn't even an explicit assumption that can be reflected
| upon, it was an implicit one, that just happened.
| steezeburger wrote:
| I don't think that's the right way to think about it. It's
| not like they were Latinizing Turkish with ASCII in mind.
| They wanted a one-to-one mapping between letters and sounds.
| The dot versus no dot marks where in your mouth or throat the
| vowel is formed. They didn't have this concept that capital I
| automatically pairs with lowercase i. The dot was always part
| of the letter itself. The reform wasn't trying to fit
| existing Western conventions, it was trying to map the
| Turkish sounds to symbols.
| okanat wrote:
| Not really. Turkish has a feature that is called "vowel
| harmony". You match suffixes you add to a word based on a
| category system: low pitch vs high pitch vowels where a,i,o,u
| are low pitch and e,i,o,u are high pitch.
|
| O and u were already borrowed from German alphabet. Umlaut-
| added variants of 'o' and 'u' have a similar effect on 'o'
| and 'u' respectively: they bring a back vowel to front. See:
| https://en.wikipedia.org/wiki/Vowel . Similarly removing the
| dots bring them back.
|
| Turkish already had i sound and its back variant which is a
| schwa-like sound:
| https://en.wikipedia.org/wiki/Close_back_unrounded_vowel . It
| has the same relation in IPA as 'o' has to 'o' and 'u' has to
| 'u'. Since the makers of the Turkish variant of Latin
| Alphabet had the rare chance of making a regular
| pronunciation system with the state of the language and since
| removing the dots had the effect of making a front vowel a
| back vowel, they simply copied this feature from o and u to
| i:
|
| Just remove the dots to make it a back vowel! Now we have i.
|
| When comes to capitalization, o becomes O, u becomes U. So it
| is just logical to make the capital of i I and the lowercase
| of I i.
| ithkuil wrote:
| Yes it's hard to come up with a different capital than I
| unless you somehow can see into the future and foresee the
| advent of computers, which the Turkish alphabet reform
| predates.
|
| Of course the latin capital I is dotless because originally
| the lowercase latin "i" was also dotless. The dot has been
| added later to make text more legible.
| ozgung wrote:
| Nope, we decided to do it the correct and logical way for our
| alphabet. Some glyphs are either dotted or dotless. So, we
| have Ii, Ii, Oo, Oo, Uu, Uu, Cc, Cc, Ss and Ss. You see the
| Ii pair is actually the odd one in the series.
|
| Also, we don't have serifs in our I. It's just a straight
| line. So, it's not even related to your Ii pair in English.
| You can't dictate how we write our straight lines, can you.
|
| The root cause of the problem is in the implementation and
| standardization of the computer systems. Computers are
| originally designed only for English alphabet in mind. And
| patched to support other languages over time, poorly.
| Computers should obey the language rules, not the other way
| around.
| zettabomb wrote:
| >Also, we don't have serifs in our I.
|
| That depends on font.
|
| >So, it's not even related to your Ii pair in English.
|
| Modern Turkish uses the Latin script, of course it's
| related.
|
| >You can't dictate how we write our straight lines, can
| you.
|
| No, I can't, I just want to understand why the Turks
| decided to change this letter, and this letter only, from
| the rest of the standard Latin script/diacritics.
| ayhanfuat wrote:
| Except for the a/e pair, front and back vowels have dotted and
| dotless versions in Turkish: i and i, o and o, u and u.
| zettabomb wrote:
| Makes sense enough, but why not use i and i to be consistent?
| okanat wrote:
| Turkish i/I sounds pretty similar to most of the European
| languages. Italian, French and German pronounce it pretty
| similar. Also removing umlauts from the other two vowels o
| and u to write o and u has the same effect as removing the
| dot from i. It is just consistent.
| zettabomb wrote:
| No, what I mean is, o and u get an umlaut (two dots) to
| become o and u, but i doesn't get an umlaut, it's just a
| single dot from i to i. Why not make it i and i? That
| would be more consistent, in my opinion.
| ayhanfuat wrote:
| This was shortly after the Turkish War of Independence.
| Illiteracy was quite high (estimated at over 85%) and the
| country was still being rebuilt. My guess is they did their
| best to represent all the sounds while creating a one to
| one mapping between sounds and letters but also not
| deviating too much from familiar forms. There were probably
| conflicting goals so inconsistencies were bound to happen.
| o11c wrote:
| In that case they should've used i for consistency.
| nurettin wrote:
| There was actually three! i (as in th[i]s), i (as in ch[ee]se)
| and i which sounds nothing like the first two, it sounds
| something like the e in bag[e]l. I guess it sounded so
| different that it warranted such a drastic symbolic change.
| ithkuil wrote:
| Turkish exhibits a vowel harmony system and uses diacritics
| on other vowels too and the choice to put "i" together with
| other front vowels like "u" and "o" and put "i" together with
| back vowels like "u" and "o" is actually pretty elegant.
|
| The latinization reform of the Turkish language predates
| computers and it was hard to foresee the woes that future
| generations would have had with that choice
| jeroenhd wrote:
| Computers and localisation weren't relevant back in the early
| 20th century. The dotless existed before the dotted i (in Greek
| script as iota). Some European scholars putting an extra dot on
| the letter to make it stand out a bit more are as much to blame
| as the Turks for making the distinction between the different
| i-vowels clear.
|
| Really, this bug is nothing but programmers failing to take
| into account that not everybody writes in English.
| okanat wrote:
| As a Turkish speaker who was using a Turkish-locale setup in my
| teenage years these kinds of bugs frustrated me infinitely. Half
| of the Java or Python apps I installed never run. My PHP
| webservers always had problems with random software. Ultimately,
| I had to change my system's language to English. However, US has
| godawful standards for everything: dates, measurement units,
| paper sizes.
|
| When I shared computers with my parents I had to switch languages
| back-and-forth all the time. This helped me learn English rather
| quickly but, I find it a huge accessibility and software design
| issue.
|
| If your program depends on letter cases, that is a badly designed
| program, period. If a language ships toUpper or a toLower
| function without a mandatory language field, it is badly designed
| too. The only slightly-better option is making toUpper and
| toLower ASCII-only and throwing error for any other character
| set.
|
| While half of the language design of C is questionable and
| outright dangerous, making its functions locale-sensitive by all
| popular OSes was an avoidable mistake. Yet everybody did that.
| Just the existence of this behavior is a reason I would like to
| get rid of anything GNU-based in the systems I develop today.
|
| I don't care if Unicode releases a conversion map. Natural-
| language behavior should always require natural language metadata
| too. Even modern languages like Rust did a crappy job of
| enforcing it: https://doc.rust-
| lang.org/std/primitive.char.html#method.to_... . Yes it is
| significantly safer but converting 'ss' to 'SS' in German
| definitely has gotchas too.
| arccy wrote:
| use Australian English: English but with same settings for
| everything else, including keyboard layout
| Sesse__ wrote:
| Many Linux distributions provide en_DK specifically for this
| purpose. English as it is used in Denmark. :-)
| fph wrote:
| Denmark doesn't have Euros as currency, unfortunately.
| okanat wrote:
| I live in Germany now, so I generally set it to Irish
| nowadays. Since I like ISO-style enter key, I use UK keyboard
| layout (also easier to switch to Turkish than ANSI-layout).
| However many OSes now have a English (Europe) locale too
| 1718627440 wrote:
| > However, US has godawful standards for everything: dates,
| measurement units, paper sizes.
|
| Isn't the choice of language and date and unit formats normally
| independent.
| neandrake wrote:
| There are OS-level settings for date and unit formats but not
| all software obeys that, instead falling back to using the
| default date/unit formats for the selected locale.
| okanat wrote:
| > > However, US has godawful standards for everything: dates,
| measurement units, paper sizes.
|
| > Isn't the choice of language and date and unit formats
| normally independent.
|
| You would hope so but, no. Quite a bit software tie the
| language setting to Locale setting. If you are lucky, they
| will provide an "English (UK)" option (which still uses miles
| or FFS WTF is a stone!).
|
| On Windows you can kinda select the units easily. On Linux
| let me introduce you to the journey to LC_ environment
| variables: https://www.baeldung.com/linux/locale-environment-
| variables . This doesn't mean the websites or the apps will
| obey them. Quite a few of them don't and just use LANGUAGE,
| LANG or LC_TYPE as their setting.
|
| My company switched to Notion this year (I still miss
| Confluence). It was hell until last month since they only had
| "English (US)" and used M/D/Y everywhere with no option to
| change!
| collinfunk wrote:
| > While half of the language design of C is questionable and
| outright dangerous, making its functions locale-sensitive by
| all popular OSes was an avoidable mistake. Yet everybody did
| that. Just the existence of this behavior is a reason I would
| like to get rid of anything GNU-based in the systems I develop
| today.
|
| POSIX requires that many functions account for the current
| locale. I'm not sure why you are blaming GNU for this.
| darkhorn wrote:
| Java; write once, run anywhere, except on Turkish Windows.
| sjrd wrote:
| I am one of the maintainers is the Scala compiler, and this is
| one of the things that immediately jump to me when I review code
| that contains any casing operation. Always explicitly specify the
| locale. However, unlike TFA and other comments, I don't suggest
| `Locale.US`. That's a little too US-centric. The canonical locale
| is in fact `Locale.ROOT`. Granted, in practice it's equivalent,
| but I find it a little bit more sensible.
|
| Also, this is the last remaining major system-dependent default
| in Java. They made strict floating point the default in 17; UTF-8
| the default encoding some versions later (21?); only the locale
| remains. I hope they make ROOT the default in an upcoming
| version.
|
| FWIW, in the Scala.js implementation, we've been using UTF-8 and
| ROOT as the defaults forever.
| naniwaduni wrote:
| A stark reminder that _all operations on strings are wrong_.
| esafak wrote:
| Kotlin keywords should be assumed to be English.
___________________________________________________________________
(page generated 2025-10-12 23:00 UTC)