[HN Gopher] In MySQL, use "utf8mb4" instead of "utf8" (2016)
       ___________________________________________________________________
        
       In MySQL, use "utf8mb4" instead of "utf8" (2016)
        
       Author : pcr910303
       Score  : 452 points
       Date   : 2022-01-12 15:21 UTC (7 hours ago)
        
 (HTM) web link (adamhooper.medium.com)
 (TXT) w3m dump (adamhooper.medium.com)
        
       | donatj wrote:
       | Is Medium secretly mining crypto in the background? I have no
       | idea what it is but their site makes my CPU spike and my fans
       | turn on.
       | 
       | https://jdon.at/JTHj2G
       | 
       | Update - here's a video, it's making hundreds of graphql requests
       | a minute...
       | 
       | https://jdon.at/z05ImC
        
       | throw0101a wrote:
       | Note:
       | 
       | > _The utf8mb3 character set is deprecated and you should expect
       | it to be removed in a future MySQL release. Please use utf8mb4
       | instead. utf8 is currently an alias for utf8mb3, but it is now
       | deprecated as such, and utf8 is expected subsequently to become a
       | reference to utf8mb4. Beginning with MySQL 8.0.28, utf8mb3 is
       | also displayed in place of utf8 in columns of Information Schema
       | tables, and in the output of SQL SHOW statements._
       | 
       | * https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-...
        
         | ghusbands wrote:
         | Maybe they should rename/alias it to brokenutf8, to make it
         | more likely that people will notice a problem.
        
         | ksec wrote:
         | I really wish they do this in a new, major version release. But
         | they dont seems to be keen on releasing 9.0.
        
       | capitainenemo wrote:
       | This is also a security vulnerability because MySQL/MariaDB
       | silently truncate strings at the first invalid character. This
       | can result in data manipulation attacks where a higher level
       | layer validates the complete input as UTF-8 for insertion into
       | the DB, but the database only stores half the string.
       | 
       | https://bugzilla.mozilla.org/show_bug.cgi?id=1253201
        
         | wanderr wrote:
         | Any data in your database that can come from external input
         | should be treated as untrusted and validated before it's used.
         | Otherwise validation bugs or bypasses will result in bad data
         | and exploits that persist beyond the fix. Edit: I'm not arguing
         | against the need to use utf8mb4
        
           | kstrauser wrote:
           | While that's true, I trust PostgreSQL to store exactly what
           | I've asked it to store. At some point, you have to trust
           | _something_ to do its job, or else everything built on top of
           | it is a castle of sand.
           | 
           | Imagine a bug like this in ext4. No one would reasonably
           | contend that the layers on top of it should be validating
           | that the files you write out are the ones you'll read back
           | in. We write unit tests for all kinds of stuff, but we're not
           | _that_ thorough.
        
           | colpabar wrote:
           | Sure, but the issue here is that the default mysql "utf8"
           | encoding is not actually utf8. You can write as many
           | validation layers as you want, but if they are assuming that
           | utf8 actually means utf8, they won't help, and mysql will
           | potentially screw it up when it gets stored.
        
           | capitainenemo wrote:
           | Agreed. The confusing part here as I see it is where
           | validation layer A (correctly) asserts the data is valid
           | UTF-8 and safe then assumes the database persists what it
           | passes to it, since no error is reported.
           | 
           | Then, subsystem B trusts reading the database field (since it
           | passed validation layer A).
           | 
           | Obviously more validation layers can be added, but at this
           | point validation layer C called by subsystem B needs to
           | _know_ what the initial input from layer A is in order to
           | differentiate it from the db value which was manipulated - a
           | rather tricky thing to do sometimes. (I guess you could add a
           | hash to the db to check the db is storing your strings, but
           | really.. come on)
           | 
           | Upgrading to utf8mb4 is probably safer than hoping enough
           | validation layers thrown at it solves the problem.
        
             | wanderr wrote:
             | Yes. I've been burned by this exact issue in the past, not
             | realizing that utf8 was not really utf8 in mysql-land. That
             | is a major, major WTF IMO.
             | 
             | But I've also seen people relying on validation at the time
             | of insertion so many times that I wanted to warn against
             | that, too. Not in argument against the need for utf8mb4.
        
         | evanelias wrote:
         | Silent truncation only occurs if strict sql_mode is disabled.
         | Strict sql_mode has been the default since MySQL 5.7, released
         | over 6 years ago. All prior versions have hit EOL support, so
         | there is literally no supported version of MySQL which has this
         | behavior out-of-the-box.
         | 
         | The topic of MySQL utf8 vs utf8mb4 was discussed in depth just
         | _six days ago_ on HN:
         | https://news.ycombinator.com/item?id=29793916
        
           | Ginden wrote:
           | > Silent truncation only occurs if strict sql_mode is
           | disabled. Strict sql_mode has been the default since MySQL
           | 5.7, released over 6 years ago.
           | 
           | What happens if you upgrade from older version? Are defaults
           | from previous version used, or is it default only on fresh
           | installs?
        
             | evanelias wrote:
             | See bottom of https://news.ycombinator.com/item?id=29910494
             | and my reply https://news.ycombinator.com/item?id=29910863
        
             | johannes1234321 wrote:
             | If you don't have an explicit value set in your my.cnf
             | config and the client sends no explicit value upon
             | connecting you get the current built in default of the
             | version you are using. If you set something somewhere it
             | will be respected.
        
           | kukx wrote:
           | The strict sql_mode is often manually disabled. I bet it is
           | off on most shared hostings.
        
             | evanelias wrote:
             | Sure, but that's not MySQL's fault. MySQL/Oracle has no
             | control over what ill-conceived setting overrides are used
             | by DBaaS providers or shared hosting providers.
             | 
             | There are similarly dumb misconfigurations that can
             | theoretically be made in most hosted open source software.
        
               | capitainenemo wrote:
               | Well, the things it would have been nice for MySQL to
               | have done would have been not implementing it this way in
               | the first place (yes, that's probably not Oracle's
               | fault), and moving utf8 to utf8mb4 a bit more quickly.
               | (good to know that's finally happening).
               | 
               | Also, I gotta say, now I have to look into this strict
               | mode and what else it breaks, when in this particular
               | case a "don't silently truncate strings" flag would be
               | preferable and probably lower impact, since "strict" vs
               | "security flaw mitigation" is not quite the same thing
               | (does such a more granular flag exist?). ... unless
               | that's all the strict mode does.
               | 
               | Hm, and since you brought this up elsewhere. Is this
               | "strict by default" in MariaDB - which is what most
               | (linux) people are (probably) thinking of.
        
               | evanelias wrote:
               | > moving utf8 to utf8mb4 a bit more quickly
               | 
               | I definitely agree!
               | 
               | > now I have to look into this strict mode and what else
               | it breaks
               | 
               | It's a mode to ensure nothing happens silently: illegal
               | data type conversions, attempting to insert invalid
               | values for the data type (too large, too long, invalid
               | characters, etc), attempting to do an INSERT which
               | doesn't specify some columns that lack defaults, etc.
               | With strict mode enabled, all of these things throw hard
               | errors, like in most other relational databases.
               | Generally, it covers the entire class of things that are
               | typically the most common complaints about MySQL on HN.
               | 
               | Interesting wording re: "breaks" -- for practical
               | purposes this is a real concern for existing
               | applications, which may be doing bad things which
               | previously didn't throw errors with strict mode off. Do
               | you want your DB to suddenly fail loudly for these things
               | (strict mode), or do you want your existing application
               | to continue working in these cases? Can't have it both
               | ways.
               | 
               | > Is this "strict by default" in MariaDB - which is what
               | most (linux) people are (probably) thinking of.
               | 
               | MariaDB made it default in 10.2, ~5 years ago.
               | 
               | I don't agree that most Linux people conflate MySQL and
               | MariaDB though. At large US tech companies that use
               | MySQL, they're almost all literally using Oracle MySQL,
               | or a patch-set like Percona Server which directly tracks
               | MySQL, or an internal patch-set that does the same.
               | 
               | In recent years, MySQL and MariaDB have diverged a bit on
               | feature set and focus; each has some interesting features
               | that the other lacks. They are not drop-in replacements
               | for each other.
        
               | capitainenemo wrote:
               | So... while I'm broadly in agreement with you, and a huge
               | fan of strict typing, I will note people use languages
               | like Javascript and usually are ok with things like
               | string to integer coercion.
               | 
               | It seems to be that there's a world of difference between
               | "silently truncate string" and "type coercion" even if I
               | readily agree both are sources of security
               | vulnerabilities.
               | 
               | Thanks for the replies though, even though I clearly
               | annoyed at least 2 people due to the downvotes.
        
               | ars wrote:
               | > and moving utf8 to utf8mb4 a bit more quickly. (good to
               | know that's finally happening).
               | 
               | That happened in 2010 - I'm not sure 12 years counts as
               | "finally happening", or just how much quicker you would
               | like than 12 years ago?
               | 
               | Other than a time machine, I'm not sure what else you
               | would like them to do.
        
               | capitainenemo wrote:
               | I mean the announced plan to make the "utf8" default be
               | utf8mb4 instead of utf8mb3.
               | 
               | See https://news.ycombinator.com/item?id=29910023
        
               | butlerm wrote:
               | They should have made "utf8" have UTF-8 semantics by
               | default as soon as the defect was discovered, not twenty
               | plus years later.
        
               | ars wrote:
               | They couldn't. The storage size is different internally.
               | The article covers this in the section in migrating from
               | utf8 to utf8mb4.
               | 
               | Do you actually use MySQL or are you just complaining
               | from a theoretical POV?
        
               | butlerm wrote:
               | Pretty sure the name "utf8" isn't metaphysically wired to
               | a defective physical representation. All they have to do
               | is wait until the next minor release, add a compatibility
               | name, make the generic name refer to a non-defective
               | representation, and update the documentation and release
               | notes accordingly.
               | 
               | If they were really conservative, they could wait until
               | the next major release to repoint the name like that.
               | Haven't there been four major releases since 2003?
        
               | evanelias wrote:
               | This would be problematic for existing applications. Even
               | if you leave existing binary data as-is (in existing
               | tables utf8 becomes utf8mb3), changing the meaning of
               | "utf8" in only new tables is still very problematic.
               | Think about logical dumps, e.g. CREATE TABLE statements
               | which use the "utf8" alias. For example, any self-host
               | application which runs a .sql file in its install
               | process. Or spinning up a dev environment for a new hire.
               | Basically, anything that contains a CREATE TABLE with the
               | utf8 alias now has a landmine:
               | 
               | In older versions of MySQL, InnoDB indexes were limited
               | to 767 bytes. Meanwhile, VARCHAR(N) can store N
               | _characters_. So with utf8mb3, that 's a max of 3N bytes
               | worst-case; but with utf8mb4 ("proper" utf8), it's now
               | 4N. This meant that VARCHAR(255) CHARACTER SET utf8mb4
               | could not be indexed (in its entirety) in these older
               | versions; the CREATE TABLE would fail if you try.
               | 
               | When you have a massive installed base of users, and
               | those users have lots of dumps / install scripts
               | containing "VARCHAR(255) CHARACTER SET utf8", you simply
               | cannot change the meaning of "utf8" in a minor release.
               | 
               | I fully agree that MySQL could have made this transition
               | faster/earlier than they did, but I disagree that it
               | could be done trivially or terribly quickly.
        
               | butlerm wrote:
               | Implementing silent string truncation in the first place
               | (by default(!)) is an excellent indication of how fast
               | and loose and out of touch the original MySQL developers
               | were with the database world. It is the worst database
               | "feature" I have ever seen.
        
               | evanelias wrote:
               | It made more sense at the time. Earlier days of the web,
               | dynamic scripting languages used everywhere, etc. No
               | emoji, barely any use of 4-byte characters at all. Large
               | corps weren't using open source DBs, or sometimes open
               | source anything. Simpler times!
               | 
               | And for sake of comparison -- if I understand correctly,
               | prior to just _2 months ago_ , sqlite didn't even have
               | the option for strict type enforcement [1]. Do you judge
               | sqlite as harshly, or think its creator is out of touch
               | with the database world?
               | 
               | [1] https://www.sqlite.org/datatype3.html
        
               | butlerm wrote:
               | Relational databases have been used in production since
               | the late 1970s. You would be hard pressed to find any
               | that silently corrupted user data like that, ever, until
               | MySQL came along.
               | 
               | Sqlite is more of an embedded database, but if it
               | silently truncated character data under some conditions
               | that would make it equally unusable. Better to have no
               | type enforcement than defective type enforcement.
        
               | evanelias wrote:
               | I think you misunderstood my point. Yes, relational
               | databases have been around a long time. But in the early
               | days of MySQL, most of the companies using MySQL differed
               | quite a bit from the companies typically using DB2,
               | Sybase, Oracle, etc. I've been using MySQL professionally
               | since 2003 and am speaking from first-hand experience
               | here: the web development world was largely separate from
               | the traditional enterprise world at that time. It was
               | considered a _different use case_ , it's not like today
               | where the biggest companies in the world are tech
               | companies built on open source software.
               | 
               | Regarding "silent", MySQL at least emits a warning when
               | this happens. Still not great / obviously wrong in a
               | modern context, but just for clarity's sake it wasn't
               | _completely_ silent.
               | 
               | As for "defective type enforcement", that's a subjective
               | opinion. In C, atoi("123hello456") returns 123 -- once
               | you hit an invalid character, everything from that point
               | is truncated. Is C atoi defective? Perhaps, but it's an
               | opinion.
               | 
               | With strict mode disabled, four-byte characters worked
               | the same way in MySQL utf8mb3: four-byte characters are
               | _invalid_ in this data type, and everything gets
               | truncated at the first invalid character. Would simply
               | stripping the invalid characters, but keeping valid chars
               | after them, necessarily be more  "correct"? No, it's an
               | implementation decision, especially considering that
               | emoji _did not exist at the time_ and barely any non-BMP
               | (4-byte) utf8 characters were used anywhere.
        
               | butlerm wrote:
               | It is certainly true that most users of relational
               | databases before MySQL used them for applications where
               | throwing away data is a potential catastrophe, and most
               | uses of MySQL were for toy applications where if that
               | happened no big deal. It was about a decade before MySQL
               | even supported transactions.
               | 
               | atoi is deficient by modern standards, but at least it
               | clearly documents what the behavior is and always has. It
               | doesn't truncate anything. For a database, truncating
               | data is unforgivable. That is what errors are for, so
               | your database doesn't act as a black hole where your data
               | may silently disappear forever.
               | 
               | And yes, if you can't return an error for some reason,
               | you should preserve as much data as possible. Throwing
               | away hundreds or thousands of characters after an initial
               | anomaly is insane. But a database should never do that
               | anyway, it should either store what you give it, or
               | return an error. Same with a filesystem or any other
               | trustworthy means of data storage. Sorry we were too lazy
               | to tell you we threw your data away isn't acceptable, nor
               | is relying on anyone to check for warnings that should be
               | hard errors.
        
           | purple_ferret wrote:
           | Lots of people turn it off because Warnings are easier to
           | ignore than Errors are
        
           | capitainenemo wrote:
           | That's good to know, however I was encountering this MySQL
           | truncation just last year in COTS software, so clearly
           | there's a long tail. As a related example, log4j 1.x has been
           | deprecated for years, but 100% of atlassian uses it with
           | their own maintained version.
           | 
           | In fact, when the log4j thing came up, we audited all our
           | existing COTS/internal and virtually all of it was still on
           | the "EOL" log4j 1.x.
           | 
           | Since you're familiar with this. What happens with an
           | existing upgrade - is strict mode enabled by a typical distro
           | package upgrade or does it only impact new installs?
        
             | evanelias wrote:
             | For sure, there's a huge long tail. However, if a company
             | is using 6+ year old EOL software that has flaws, that's
             | outside of the vendor's control.
             | 
             | > What happens with an existing upgrade
             | 
             | sql_mode is a setting, so it depends on whether the default
             | is being overridden: e.g. does the my.cnf config file
             | specify a value explicitly? Does your application's
             | connection logic set a session-level override for each
             | connection? DBaaS vendors add an additional wrinkle since
             | they tend to have custom UIs and APIs for global settings.
             | 
             | > typical distro package upgrade
             | 
             | Many Linux distros made it so that attempting to install
             | "mysql" actually installs MariaDB instead, so when going
             | this route, it's unfortunately always very hard to predict
             | what you'll get or what will happen!
        
               | KronisLV wrote:
               | > Many Linux distros made it so that attempting to
               | install "mysql" actually installs MariaDB instead, so
               | when going this route, it's unfortunately always very
               | hard to predict what you'll get or what will happen!
               | 
               | MariaDB is indeed pretty pleasant to use and for most use
               | cases can be a drop in replacement for MySQL, should you
               | not want to use MySQL, but also wouldn't want/need/be
               | able to use PostgreSQL or something else:
               | https://mariadb.com/kb/en/mariadb-vs-mysql-compatibility/
               | 
               | I might be in the minority here, but i actually rather
               | enjoy MySQL/MariaDB, since MySQL Workbench is still the
               | best DBA software that i've used, especially because of
               | the ER modelling and schema sync functionality, and their
               | approach to schemas/users within a DB feels more
               | reasonable than that of Oracle/PostgreSQL.
               | 
               | That said, both PL/SQL and PL/pgSQL feel more powerful
               | than whatever MySQL/MariaDB has. Also, in regards to
               | MariaDB in particular, their versions have diverged from
               | MySQL somewhat, so you might still want to look into the
               | differences, if migrating over:
               | https://mariadb.com/kb/en/mariadb-vs-mysql-compatibility/
               | 
               | As for predictability - nowadays i just run specific
               | versions in containers (e.g. 10.7.1 or maybe 10.7 which
               | corresponds to the patest PATCH release), to sidestep the
               | distro differences altogether. It also lets me easily
               | launch new instances, give resource limits to them, have
               | multiple versions in parallel and mount the data
               | directories wherever i need them, for easier backups etc.
        
             | waynesonfire wrote:
             | I'm going to use this as an opportunity to vent.
             | 
             | Wouldn't it be a nice world where you can just declare
             | something EOL and it disappears and somehow magically it's
             | not a problem anymore for anyone end-to-end? Thing's just
             | don't work that way. Like, suddenly, a product that was
             | working perfectly fine for who knows how long suddenly
             | needs to modified because some ass-hats thinks that EOL
             | isn't suitable for production.
             | 
             | There are a number of reasons to upgrade a dependency or a
             | product; an EOL label shouldn't be a part of the calculus.
             | EOL may imply certain conditions that may influence the
             | decision to upgrade but for the sake of EOL you can eat a
             | bag of dicks. Think harder why you're proposing an upgrade
             | is needed and how it'll add value to the business. EOL
             | software is perfectly fine especially because it's been
             | battle tested.
             | 
             | We also use log4j 1.x all over the place and I feel zero
             | need to change that.
        
               | dimitrios1 wrote:
               | It is perfectly reasonable to continue to use EOL
               | software, but you have to assume maintenance costs in
               | that case. It would be irresponsible to continue to use
               | EOL software as is. EOL is a perfectly natural concept.
               | Nothing in this world lasts forever, including our
               | machines, electronics, cars, batteries, lives, and yes,
               | software.
        
               | rkagerer wrote:
               | If software lasted a lifetime, life would be so much
               | easier.
        
               | Dylan16807 wrote:
               | > EOL may imply certain conditions
               | 
               | And "EOL" is a convenient shorthand for those conditions.
               | 
               | Most use cases care about never getting patches for
               | security problems ever again.
        
         | dehrmann wrote:
         | It's like MySQL and PHP were meant for each other.
        
           | anon776 wrote:
           | How dare you. Also accurate.
        
             | chris_wot wrote:
             | PHP has improved a lot.
        
               | JodieBenitez wrote:
               | yeah... it also has a strict mode... that is off by
               | default.
        
         | tyre wrote:
         | Another fun one that came up at my first job: If you run up
         | against the size limit of a text column in MySQL, it silently
         | truncates.
         | 
         | This came up when a developer used YAML for some configuration
         | file. There was all sorts of confounding behaviour because YAML
         | is still valid when truncated.
         | 
         | Every database has limits, but at least alert the user! I'd
         | rather a query fail than alter data.
         | 
         | Edit:
         | 
         | This is something that might be useful in CS classes, to
         | reflect what being a real-world engineer can often entail. Set
         | an assignment with some silly nonsense like this and students
         | have to debug it.
        
           | leksak wrote:
           | If the point of CS education is to produce programmers and
           | software engineers as opposed to computer scientists, yes.
        
             | taftster wrote:
             | What percentage of CS graduates go onto become computer
             | scientists? Seems like vast majority of CS degree holders
             | aren't practicing "pure" computer science, as you would
             | seemingly hope.
        
           | marginalia_nu wrote:
           | Why were you putting YAML in a relational database?
        
             | tyre wrote:
             | iirc it was to allow customization of an e-commerce item.
             | Basically a column as a document (in the sense of a
             | document database) with custom attributes that could
             | be...large.
             | 
             | It wasn't my implementation. This was another team
        
               | int_19h wrote:
               | It's really weird that they've chosen YAML for it, since
               | its sole (questionable) benefit over the alternatives is
               | that it's geared more towards being edited directly by
               | humans... which is generally not something that you need
               | to do with a database field. JSON is more usual for
               | scenarios like this, not the least because most databases
               | actually provide tools to parse and query it.
        
               | marginalia_nu wrote:
               | It's pretty weird to put any configuration-languages in a
               | relational database in general. Large variable-length
               | objects is not what relational databases are good at.
               | You're mostly getting the drawbacks, and very few of the
               | benefits in that use case. Even a filesystem is probably
               | more suitable.
        
               | capitainenemo wrote:
               | Drupal seems keep a large amount of PHP serialisation of
               | configuration in its database (a JSONish thing), probably
               | because it's the standard data store for the site.
               | Noticed this in fixing a dump one time. Got to use vim's
               | function syntax for regex to ensure the character length
               | count of the string matched the value of the new string.
               | 
               | I think virtually everything generically configurable in
               | the UI that isn't part of a small core with dedicated
               | fields (or in the sitewide PHP config include) is handled
               | that way.
        
             | mrweasel wrote:
             | Maybe also do a checksum or something, seriously it could
             | just be CRC32. I get that it's not something you
             | instinctively think to do. For configurations that are most
             | likely parsed directly by something else though, may check
             | that nothing broke in transit.
        
           | cornel_io wrote:
           | There's a setting for this, but like many settings MySQL has
           | (along with InnoDB, the storage engine) historically chosen
           | exceptionally bad defaults. It's possible to tune the thing
           | to be screaming fast and pretty robust, but it takes a lot of
           | expertise and TBH these days most people would be much better
           | off going either NoSQL or Postgres instead.
        
         | capitainenemo wrote:
         | I do find this behaviour rather incomprehensible actually. I
         | understand that if the "utf8" is BMP it can't have invalid
         | characters in it. What puzzles me is that the approach they
         | went for seems to be almost the worst possible one from a
         | security and data integrity perspective.
         | 
         | Assuming a SQL error can't be thrown for some legacy reason
         | (and you'd think a flag to opt into that would be an option),
         | why not replace the invalid parse with U+FFFD like most things
         | do? Or at a bare minimum, only strip the invalid character
         | which at least makes attacks a bit harder and results in a
         | little less data loss.
        
           | [deleted]
        
           | capitainenemo wrote:
           | Just wanted to reply to myself that I just learned from
           | evanelias' comment a "strict mode" flag is indeed an option.
           | I suppose it's time to audit all my mysqls to see which ones
           | have this enabled, and what other mysql-isms this flag might
           | break.
           | 
           | I know it's definitely not enabled on one updated and
           | maintained BSD - this could be because they didn't want to
           | break existing systems during upgrade.
           | 
           | I suspect this problem is still extremely common. But good to
           | know there's some form of defense.
        
           | pshc wrote:
           | Quietly corrupting data and shipping with the wrong default
           | settings is what MySQL is all about! Keeps the consultants
           | busy.
        
             | taf2 wrote:
             | It's also easy to imagine how it happens... const char
             | *buffer = malloc(...; memcpy;... etc... pretty fast pretty
             | straight forward
        
       | PrimeDirective wrote:
       | The author mentioned MariaDB in the intro and the went on about
       | MySQL, not sure which he actually used.
       | 
       | I think the latest recommendation for MySQL was
       | utf8mb4_0900_ai_ci and utf8mb4_general_ci for MariaDB.
       | 
       | While I agree about the Postgres recommendation, it's a non-
       | starter in the PHP world, unless doing something from scratch.
       | Even then, the support for MySQL/MariaDB is much better.
        
         | timw4mail wrote:
         | What's that mean, non-starter in the PHP world? Postgres is
         | well supported by frameworks and libraries in PHP, not to
         | mention PHP itself.
        
       | iliketrains wrote:
       | Fun story: When I was working at one of the FAANG companies, I
       | have placed the (Night with stars, it does not seem to render
       | here [0]) character in my preferred name. Some time later, I was
       | notified by HR that I must change my preferred name and I can
       | only use UTF-8 characters with maximum length of 3 bytes. I was
       | quite confused why such specific demand, I am pretty sure it was
       | exactly this issue. I replaced it with tta***tto***
       | 
       | [0] https://emojipedia.org/night-with-stars/
        
         | jiggawatts wrote:
         | Are you called Bobby Tables by any chance?
        
           | urbandw311er wrote:
           | Fantastic response
        
       | brightball wrote:
       | I just went through the process of converting a 2TB MySQL
       | database over the summer. Quite a few gotchas along the way that
       | make the process a huge pain.
        
       | mwattsun wrote:
       | Has anyone ever successfully converted a large legacy
       | Apache/PHP/MySQL site to UTF-8? If so you deserve an award.
       | utf8mb4 is just one issue and not the worst.
        
         | fleddr wrote:
         | At least once, but it's been a while. From the top of my head:
         | 
         | - set encoding at HTML head - also at any HTML form element -
         | Set database collation correctly - I believe it also needs to
         | be done per table - The PHP -> MySQL connection itself may need
         | to have the encoding set - All string manipulation to be done
         | with mb_ - Even ensure your code files themselves are in UTF8
        
         | sam_lowry_ wrote:
         | Yes, but I have chosen utf8_bin as encoding and lived happily
         | with this choice for a dozen years.
        
         | ars wrote:
         | Been there, done that, with a phpbb installation. I manually
         | converted all the non-UTF data in the database to UTF8, and
         | updated the PHP code and database settings.
         | 
         | It worked fine, no major issues, just some work.
        
         | onli wrote:
         | https://github.com/s9y/Serendipity/blob/05f58f90d743fe9ade24...
         | is just the detection function I wrote for a PHP blog engine to
         | detect whether it would even be possible to use utf8mb4 on the
         | system. We completely ran into this issue and didn't know how
         | to handle it for literally years. Lots of blogs to migrate, and
         | who knows whether it worked for all of them...
        
           | mwattsun wrote:
           | The craziest issue I had was I couldn't predict what char
           | encoding the text in my database was in. Most users entered
           | Windows-1252, some text blobs were UTF-16, others were
           | European character sets, and some were UTF-8. Some were
           | Japanese SHIFT_JIS. Don't ask me how any of this happened. I
           | retrospect, I should have dumped all the tables from MySQL
           | and used the excellent PyPy Chardet [1] library to see what I
           | was dealing with, do the conversions and then re-import the
           | data. But then someone could copy UTF-16 from a Windows
           | document and paste it in, so you have to convert going in to
           | the database.
           | 
           | You have set Apache to UTF-8, PHP to UTF-8, MySQL to UTF-8,
           | and the MySQL driver you are using to UTF-8. It's not clear
           | how these setting interact. Are there silent conversions
           | happening or do you always have to detect the encoding on
           | data coming from the server? HTML pages have a character
           | encoding specifier, but the BOM at the start of the file
           | takes precedence (I think.) I got it to work by always
           | detecting encoding for any text coming from the database and
           | using iconv, but this turned out to be really slow and
           | unreliable. It was truly the biggest mess by an order of
           | magnitude than any other programming problem I faced in my
           | career.
           | 
           | Would not attempt again.
           | 
           | [1] https://github.com/chardet/chardet
        
         | charrondev wrote:
         | Technical lead for vanillaforums. It's a huge PHP MySQL app.
         | Probably 10b+ rows in production DBs. We did the the migration
         | in 2017 if I remember correctly.
        
           | mwattsun wrote:
           | Not to take anything away but by legacy I meant late last
           | century sites. VanillaForums was released in July 2006
           | according to Wikipedia, which makes it Web 2.0. Wikipedia
           | also says "UTF-8 has been the most common encoding for the
           | World Wide Web since 2008" I'm glad you posted though because
           | now I know that sites are ok starting in that time frame, not
           | before and then never touch the old ones again.
        
         | usbqk wrote:
         | Sure, you just prepend mb_ to all php string calls >:)
        
         | ospzfmbbzr wrote:
         | Yes. I had the exact same issue with '<' in HTML content stored
         | in the database -- cms pages and also some JS that was stored
         | in the DB and injected into the HEAD of all pages.
        
       | [deleted]
        
       | davidjfelix wrote:
       | Does this even fully fix the problem? It looks like utf8mb4 is
       | limited to 4 byte sequences, but as far as I understand, utf-8 is
       | variable width. Can utf8mb4 even encode the scottish flag
       | https://www.iemoji.com/view/emoji/2476/flags/scotland ?
        
         | zhte415 wrote:
         | Variable width is unlikely to be a problem. At 3:5 the Scottish
         | flag does not have an unusual aspect ratio. This is unlike the
         | flag of Qatar, with a ratio of 10:28, or Nepal with both a 3:4
         | (approximate, not exact) aspect ratio plus an irregular shape.
        
         | jeroenhd wrote:
         | Is this a problem? Flags in unicode are defined by several
         | special characters. The flag of Scotland isn't really a single
         | character, it's "<waving flag><tag g><tag b><tag s><tag c><tag
         | t><cancel>".
         | 
         | All of these characters are multi byte combinations. The hex
         | for the flag is not a single, super wide character, it's
         | 0xF09F8FB4 0xF3A081A7 0xF3A081A2 0xF3A081B3 0xF3A081A3
         | 0xF3A081B4 0xF3A081BF. You might get some weird results if you
         | take substrings from that, but it won't be a problem for the
         | backing database store; each separate "binary character" is a
         | four byte sequence (as denoted by the 0xF at the front of the
         | number).
        
         | jerf wrote:
         | Flags are not single code points. UTF-8 refers to how code
         | points are stored. If you look in your link at "Hex Code
         | Point(s)", it is that first one that would be a problem with
         | "utf8" in MySQL, because of the 1 in the 0x10000 position. The
         | other six code points required would fit in fine.
         | 
         | Unicode is developing more and more things that require code
         | points. I'm not sure what the longest legal non-redundant
         | series of code points that can validly represent a glyph
         | somewhere is, but it's getting up there with all the emoji skin
         | modifiers and such.
        
           | jeroenhd wrote:
           | Emoji modifiers, both for gender and skintone, do not produce
           | that many extra code points. Combinations are made in the
           | same way " + e can combine into e: a code point, followed by
           | a combination code point, followed by a modifier. During text
           | rendering, these code points are converted back into a single
           | glyph.
           | 
           | All UTF-8 codepoints in use today can be encoded with four
           | bytes. Theoretically the Unicode system can be used to create
           | 6 byte code points if that ever becomes necessary, but it
           | won't be for a while. Crossing the 4 byte boundary would also
           | introduce compatibility issues with UTF-16, so I'm sure the
           | Unicode Consortium will do their best to prevent this from
           | happening as long as they can.
        
         | fredoralive wrote:
         | Legal UTF8 is limited to 4 bytes, as Unicode only uses ranges
         | that fit the limits of UTF16.
         | 
         | AFAIK The flags are a weird multi-code point encoding of the
         | ISO country codes, and each individual code point is less than
         | 4 bytes.
        
         | warpspin wrote:
         | The biggest codepoint in Unicode fits into 4 bytes of UTF-8.
         | UTF-8 would allow up to 6 bytes, but those codepoints are not
         | in use currently. If they ever become in use, yes, you'd
         | probably need a new character set again. But then a lot more
         | things will break, as higher codepoints would be incompatible
         | with UTF-16 also.
        
           | ghusbands wrote:
           | UTF-8 only allows 4 bytes, since 2003:
           | https://datatracker.ietf.org/doc/html/rfc3629
        
         | loeg wrote:
         | UTF-8 is variable width. The biggest valid codepoint is
         | U+10FFFF, which has a 4-byte encoding in UTF-8. Other
         | codepoints have 1-, 2-, or 3-byte encodings.
        
         | [deleted]
        
       | urbandw311er wrote:
       | I enjoyed the article right up to the point where it concluded at
       | the very end that the solution is to "switch to postgresql"
       | without actually providing any rationale for that sudden and
       | subjective ending.
        
       | talos2110 wrote:
       | So in mysql utf8 does not mean utf8. Reminds me of iso8601 in
       | php, which does not mean iso8601.
       | 
       | https://www.php.net/manual/en/class.datetimeinterface.php
        
       | rubyist5eva wrote:
       | god I hate mysql so much - long live postgres, one of the
       | greatest pieces of software ever created
        
       | hn_throwaway_99 wrote:
       | Wow, never knew this, this is really bad, especially in this day
       | and age where emojis are so prevalent.
       | 
       | MySQL should deprecate utf8 and give a warning if you try to use
       | it.
        
         | loeg wrote:
         | It is deprecated:
         | 
         | > You should also be aware that the utf8mb3 character set is
         | deprecated and you should expect it to be removed in a future
         | MySQL release. Please use utf8mb4 instead.
         | 
         | https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8...
        
           | hn_throwaway_99 wrote:
           | How the heck did someone downvote you? Basically your link is
           | exactly what I was looking for and I'm glad they're making
           | this deprecation.
        
         | daneel_w wrote:
         | Bad idea. If they deprecate "utf8" they will deprecate a
         | standard. It was designed and specified at a time when the
         | Unicode code space had a 21 bit limit. There are some technical
         | (storage space-related) considerations with the suggestion to
         | "just use utf8mb4 everywhere instead" because of how InnoDB's
         | indices work.
        
           | loeg wrote:
           | MySQL "utf8" is still "utf8mb3", which is not a standard
           | anywhere except MySQL. It cannot store the full range of
           | 21-bit code points; I don't know why you keep repeating this.
           | The maximum codepoint in 3-byte UTF-8 is 0xFFFF, which is 16
           | bits.
        
           | hn_throwaway_99 wrote:
           | > If they deprecate "utf8" they will deprecate a standard.
           | 
           | No, they won't. They will be deprecating something they
           | created that _never_ conformed to the UTF-8 standard. That
           | fact that unicode didn 't have codes beyond 21 bits at that
           | point is pretty irrelevant.
           | 
           | It was an invalid implementation from day 1, at least with
           | the name of "utf8".
        
       | daneel_w wrote:
       | This is an old article and it was incorrect when written, and
       | it's incorrect today. The author may possibly have learned since
       | writing it that UTF-8, by design, encodes _up to 21 bits_ of
       | character code point in at most 4 bytes of data. It works and
       | behaves as it should.
        
         | echelon wrote:
         | Well, that sounds great, but I've had to migrate numerous
         | fields to utf8mb4 because we kept getting paged on emoji.
        
           | chris_wot wrote:
           | The OP is wrong about the article being incorrect.
        
           | daneel_w wrote:
           | Same. Inbound SMSes with emojis. Outbound SMSes with emojis.
           | Names and labels with emojis etc. But the implementation of
           | "utf8" still isn't wrong, nor a bug - just like the 20 year
           | old specs of UTF-8, they just failed to take the future into
           | account.
        
         | loeg wrote:
         | UTF-8 was not restricted to U+10FFFF until November 2003. Prior
         | to that, implementers had to assume it could encode up to 6
         | bytes of data.
         | 
         | The "utf8" mode in MySQL can only encode _3 bytes_ of data (up
         | to 16 bit codepoints), which is less than 4 (less than 21 bit
         | codepoints).
         | 
         | "utf8mb4" correctly encodes all UTF-8 codepoints by current
         | standards.
        
           | daneel_w wrote:
           | > _" utf8mb4" correctly encodes all UTF-8 codepoints by
           | current standards._
           | 
           | Yes, all good and known, but not really what the
           | misunderstanding is about.
           | 
           | "utf8" follows the definitions of UTF-8 as they stood at the
           | time, adhering to Unicode's former code space limit.
        
             | butlerm wrote:
             | MySQL's adoption of a utf8mb3 limitation for "utf8" was
             | over seven years obsolete when it was adopted. It would
             | have been obsolete in 1996.
        
       | LinuxBender wrote:
       | Would the DBA's on this thread consider committing code to the
       | MySQL [1] and Postgres [2] tuner scripts that give new DBA's all
       | your learned advise and battle hardening experience? This thread
       | appears to be such an example.
       | 
       | [1] - https://github.com/major/MySQLTuner-perl
       | 
       | [2] - https://github.com/jfcoz/postgresqltuner
        
       | pm90 wrote:
       | Another interesting "issue" is that Case sensitive character sets
       | may lead to unexpected results with aggregations (eg
       | https://blog.mallya.dev/2021/07/25/mysql-cases-sensitivity/)
       | 
       | MySQL has many such issues which have turned me off from
       | recommending it. It's a shame because it's a solid technology
       | that mostly works and has a really long operational history.
        
       | ipaddr wrote:
       | No one talks about the downsides of changing.
       | 
       | Moving from 255 max characters to 191 or 192 max means a lot of
       | your data needs to be moved into a text fields which means things
       | like this field can't be a primary key.
        
         | Pxtl wrote:
         | I'm used to mssql where nvarchar (the UTF-16 datatype) supports
         | basically unlimited length... you can't make long PKs in mysql?
        
           | ipaddr wrote:
           | You are limited by smaller types ints, dates and chars.
        
       | clon wrote:
       | The plain utf8 covers the "basic multilingual plane"
       | (x0000-xFFFF), so it will get you very far, actually.
       | 
       | In our app, we finally went for utf8mb4 to allow people to enter
       | emoji. As a side "bonus" you will also soon see some clever
       | people entering their names to stand out, such as "John". Note
       | that this is not "John". It is actually a series of mathematical
       | symbols:                 U+1D679   MATHEMATICAL MONOSPACE CAPITAL
       | J       U+1D698   MATHEMATICAL MONOSPACE SMALL O       U+1D691
       | MATHEMATICAL MONOSPACE SMALL H       U+1D697   MATHEMATICAL
       | MONOSPACE SMALL N
        
         | RL_Quine wrote:
        
         | capitainenemo wrote:
         | One fun thing you can do in MariaDB is define a collation based
         | off of the Unicode confusables list. The result of this is that
         | select * from usernames where user = 'John' would still work :)
         | 
         | It also avoids registering both 'John' and 'John'. (changed the
         | o)
         | 
         | Hedgewars user registration does this.
        
           | sam_lowry_ wrote:
           | That's a great feature. I was running a community site once
           | where impersonation became a popular game. So we had dozens
           | of users registering accounts with Cyrillic i,o,a,e etc.
        
             | stragies wrote:
             | For looking up relevant further material, this is called a
             | 'homograph' or 'homoglyph attack'.
             | 
             | It was also popular for site-spoofing by sending links with
             | these cyrillic character to lead people to a fake
             | bank/search_engine page, until that became widely used, and
             | then owners of the tech-stack (browsers, registrars, dns-
             | operators, etc) shut down this attack vector mostly.
        
       | fredoralive wrote:
       | According to the docs, MySQL is a bit more explicit and calls the
       | old "utf8" "utf8mb3" nowadays, and notes that it may be removed
       | at some point. Although "utf8" is still aliased to the footgun at
       | the moment.
       | 
       | https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8...
        
       | webkike wrote:
       | Genuine question: why would anyone prefer MySQL over Postgres?
        
         | ars wrote:
         | MySQL is much easier to administer and upgrade.
         | 
         | For example I have two versions of Postges running on my
         | machine right now because Debian does not automatically upgrade
         | between them - you have to export your data, and reimport it
         | into the new version.
         | 
         | https://www.pontikis.net/blog/update-postgres-major-version-...
         | 
         | MySQL just works. There are issues with it, but they are not as
         | major as people make them out to be. Sometimes people like
         | "technical purity" over just get the job done.
        
           | _moof wrote:
           | _> MySQL just works_
           | 
           | Right up until one of your users tries to use an emoji,
           | apparently.
        
           | outworlder wrote:
           | Keep in mind that that's a post from 2016. PG has made many
           | improvements since then. Upgrading is a much easier job right
           | now.
           | 
           | Also, it's strange to conflate PG with Debian issues.
        
         | lewiscollard wrote:
         | In the WordPress world, it is the only database.[0] If you do
         | WP, you do MySQL. You may sniff at WP (I certainly do), but it
         | is used for 43% of all websites, which is not a trivial number
         | of deployments, and not a small market.
         | 
         | I suspect it has a large market share in the "started as a
         | trivial PHP program, grew to large application, too committed
         | to MySQL to migrate to something else" market, too. There's
         | that one website with the blue F logo that went this path, but
         | there's probably thousands more.
         | 
         | That's not the same thing as anyone actually preferring it, but
         | that's two possible reasons why there's an awful lot of MySQL
         | deployments out there.
         | 
         | [0] https://codex.wordpress.org/Using_Alternative_Databases
        
         | kstrauser wrote:
         | Job security. I can't think of any technical reasons.
        
           | ars wrote:
           | Other way around - Postges will give you more job security,
           | since it's rarer and takes more time and work to administer.
           | 
           | People use MySQL because it's easy, and they want to spend
           | their effort in other places.
        
             | kstrauser wrote:
             | That's like saying Python is rarer than JavaScript. Even if
             | it's true, it's not actually rare at all.
             | 
             | In the last 6 years, I've spent approximately 2 hours
             | administering PostgreSQL, and I'd be the one who had to do
             | it.
        
             | jiggawatts wrote:
             | It's "easy" until it silently shreds data and gets your org
             | hacked through truncation attacks.
             | 
             | Don't confuse lazy with easy.
        
         | [deleted]
        
       | dark-star wrote:
       | In the past, Unicode was assumed to be 64k of codepoints, so a
       | 3-byte UTF-8 sequence was considered "long enough", especially
       | since there were surrogate pairs for the rare cases where you
       | have to encode higher code points.
       | 
       | Only "recently" have longer UTF-8 sequences (aka. emojis) become
       | widespread enough that this became a problem.
       | 
       | Yes, it could have been avoided if they had allowed arbitrary-
       | length UTF-8 sequences from the beginning, but I can see that
       | they probably just wanted to optimize a bit.
       | 
       | What I don't understand is why they had to create a different
       | encoding (the utf8mb4) instead of just extending the existing
       | utf8 encoding, since 4-byte UTF-8 is obviously backward-
       | compatible with 3-byte UTF-8... (unless they always used 3 bytes
       | for every character, which would be stupid as UTF-8 has been
       | explicitly designed for variable-length encodings)
       | 
       | Bonus: Many filesystems also do not allow 4+ byte UTF 8 code
       | points in filenames. Test your company's file server to see if it
       | allows you to save a file as "(some random emoji).doc". A few
       | very expensive storage systems also have problems with that (and
       | they have the same workaround: convert your filesystem to a
       | different encoding, instead of simply extending the existing
       | encoding to allow 4+ bytes)
        
         | int_19h wrote:
         | > why they had to create a different encoding (the utf8mb4)
         | instead of just extending the existing utf8 encoding, since
         | 4-byte UTF-8 is obviously backward-compatible with 3-byte UTF-8
         | 
         | Because CHAR and VARCHAR columns have max length specified, and
         | this translates to the corresponding amount being reserved in
         | storage. For a variable-length encoding, this is normally
         | computed assuming the largest possible value for every
         | codepoint.
        
         | bearjaws wrote:
         | Its way more than Emojis, people copy pasting from Excel will
         | inadvertently add UTF8mb4 characters to your database. This is
         | handled extremely poorly by MySQL and results in ugly
         | characters being displayed.
        
         | xigoi wrote:
         | It's not backward-compatible if someone relies on the errors.
         | I'm guessing it's a case of spacebar heating.
        
           | Anthony-G wrote:
           | For those unfamiliar with the spacebar heating reference:
           | https://xkcd.com/1172/
        
         | raphlinus wrote:
         | While non-BMP Unicode has only become popular somewhat recently
         | due to emoji, it's been part of the relevant official specs for
         | a quarter century.
         | 
         | I know you were kinda joking by using scare quotes for
         | "recently", but let me fill that in for people who might be
         | less familiar with the history. Unicode broke out of the 16-bit
         | limit as of version 2.0, released in July 1996, so yeah, a
         | quarter century. The first RFC for UTF-8 (RFC 2044, Oct 1996)
         | explicitly supports UCS-4. As far as I know, there has never
         | been a time where UTF-8 restricted to the BMP (ie 3 bytes) has
         | been in widespread use. In other words, I am questioning your
         | assertion that 3-byte UTF-8 has ever been considered "long
         | enough," except perhaps in sloppy implementations made without
         | much care for following specs.
        
           | chungy wrote:
           | > except perhaps in sloppy implementations made without much
           | care for following specs.
           | 
           | Describes MySQL perfectly ;)
        
         | chungy wrote:
         | > Many filesystems also do not allow 4+ byte UTF 8 code points
         | in filenames.
         | 
         | I can't even think of a single example. Most filesystems just
         | offer 255 8-bit units to filenames, where 4-byte UTF-8
         | sequences are totally a non-issue. ZFS supports utf8only=on,
         | which enforces that filenames do conform to UTF-8, in which
         | case... 4-byte sequences are still not a problem.
         | 
         | What's the filesystem that doesn't allow it?
        
           | brandmeyer wrote:
           | Apple filesystems are infamous for breaking this convention.
           | IIUC, they will support codepoints which expand to 4 bytes in
           | UTF-8, but their implicit normalization rules can trip up
           | programs which expect behavior closer to POSIX norms.
        
         | ghusbands wrote:
         | Almost everything you claim here is wrong. UTF-8 has never been
         | a 3 byte encoding, the spec initially specified [1] an up-to-
         | six-byte encoding after Unicode had already gone past 16 bits
         | [2] and then was reduced to 4 bytes by RFC3629 [3] in 2003.
         | MySQL is the only piece of software that I know of where they
         | invented their own length. (Though a lot of software often
         | failed to validate it at all.)
         | 
         | There's nothing recent about it - it was 22 years ago! There's
         | no significant optimisation advantage in the different length,
         | and you can see from the patch that reduced the max-length [4]
         | that it wasn't about optimisation. I don't think you can name a
         | single file system that restricted UTF-8 to three bytes.
         | 
         | [1] https://datatracker.ietf.org/doc/html/rfc2044 [2]
         | https://unicode.org/faq/utf_bom.html [3]
         | https://datatracker.ietf.org/doc/html/rfc3629 [4]
         | https://github.com/mysql/mysql-server/commit/43a506c0ced0e6e...
        
           | dark-star wrote:
           | It is you who does not know what he's talking about:
           | 
           | > UTF-8 has never been a 3 byte encoding
           | 
           | I never claimed that
           | 
           | > There's nothing recent about it
           | 
           | The non-BMP characters are "recent" because 10 years ago the
           | non-BMP was not allocated except for some small areas. Also I
           | said it "became popular recently", due to emoji. Before that,
           | non-BMP codepoints were rarely used
           | 
           | > I don't think you can name a single file system that
           | restricted UTF-8 to three bytes.
           | 
           | WAFL[1] (unless you "format" it as utf8mb4, which was only
           | implemented a few years ago...)
           | 
           | [1]
           | https://docs.netapp.com/ontap-9/topic/com.netapp.doc.cdot-
           | fa...
        
       | billpg wrote:
       | Dear databases, please don't get hung up about string lengths
       | when dealing with UTF8.
       | 
       | If I ask for a UTF8 string with a max-length of 100, please don't
       | apply the worse case scenario and allocate space for 100 emojis.
       | Please give me a box of 100 bytes and allow me to write any UTF-8
       | string that can fit into 100 bytes in there.
       | 
       | 100 ASCII characters. 20 emojis. Any mixture of the two.
       | 
       | If I ask for UTF-8, it'll be because I'd like to make advantage
       | of UTF-8 and I accept the costs. If that means I can't quickly
       | jump to the character at index 84 in a string, no problem, I've
       | accepted the trade-off.
        
         | grogers wrote:
         | Yes! This and other charset oddities almost make me want to
         | store all strings as varbinary and handle any conversions
         | to/from a string type to utf8 bytes client side. It makes ad-
         | hoc query access from a shell much more annoying though.
        
           | billpg wrote:
           | I wonder if you can do LIKE with a VARBINARY value.
           | 
           | "SELECT * FROM Users WHERE Name LIKE @namePattern"
           | @namePattern = UTF8.GetBytes("%billpg%")
        
         | int_19h wrote:
         | You're literally asking them to break the SQL standard. The
         | meaning of N in CHAR(N) is the number of characters.
        
         | petergeoghegan wrote:
         | Why would the database "allocate space" unless it was truly
         | necessary, in any case? While char(n) works that way, most
         | applications don't actually use it, no matter the underlying DB
         | system.
        
         | mikl wrote:
         | PostgreSQL recommends you use the text type everywhere, and use
         | CHECK constraints if you really really really need to limit a
         | field's length on the database level. (VAR)CHAR is considered a
         | code smell. So many fields are artificially limited to 255
         | bytes because that used to make a performance difference on
         | MySQL 15 years ago.
        
         | dchest wrote:
         | More like "100 ASCII characters. Unknown number of emojis."
         | 
         | You'll get codepoints truncated if you use bytes.
        
           | billpg wrote:
           | I would hope a database engine would throw an error instead
           | of truncating bytes.
        
       | karsinkk wrote:
       | This might be a tad unrelated to the original post, but the
       | following article is one of my favorite primers on Character
       | sets/Unicode :
       | 
       | https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...
        
         | superasn wrote:
         | Excellent article, even though I already knew most of it, given
         | it was written in 2003 must have been one of the most
         | insightful and enlightening article written.
         | 
         | Wonder if you have any more links like this? Please do share!
        
       | cblconfederate wrote:
       | will something terrible happen if i just convert my tables to
       | utf8mb4? Will i have to upgrade each column?
        
       | smarx007 wrote:
       | Of course, nobody "refused" to fix a "bug". Instead, a non-
       | conformant behavior was already relied upon by legacy systems out
       | in the wild and the "fix" was added in a backwards-compatible
       | way.
       | 
       | Edit: Three bytes are enough to fit nearly any of the chars in
       | use in any language, including Chinese and Japanese, so I can
       | only assume someone "smart" in the MySQL dev team decided to
       | "save space" (before emoji were a thing).
        
         | CountSessine wrote:
         | _Three bytes are enough to fit nearly any of the chars in use
         | in any language, including Chinese and Japanese_
         | 
         | With only 3 bytes you'll completely miss plane 2, the
         | "Supplementary Ideographic Plane" which includes tons of
         | Chinese-Japanese-Korean Han characters.
         | 
         | I wish people would stop saying the supplementary characters
         | are just for "emoji". Asian unification was very controversial
         | and ultimately unsuccessful. Plane 1 and Plane 2 are important,
         | especially if you're going to sell software or products in
         | China or Japan where they are mandated.
        
         | cblconfederate wrote:
         | couldnt emoji fit in 3 bytes as well? I don't think there are
         | that many ...
        
           | rakoo wrote:
           | utf8 is an encoding for unicode codepoints. Those codepoints
           | are spread on a space that is extremely vast (up to ~4
           | billion) for which that can be represented with up to 4
           | bytes. It turns out emojis are positioned in a place where
           | the first byte will never be 0, so even if there were only
           | one it would require the full 4 bytes to encode them.
        
             | the_mitsuhiko wrote:
             | > Those codepoints are spread on a space that is extremely
             | vast (up to ~4 billion)
             | 
             | You are off by a lot. The maximum code point is about 21bit
             | high (0x10FFFF). The space is only 1.1 million large.
        
               | capitainenemo wrote:
               | It could be he's thinking of the historical definition
               | which included support for 5 character encoding with a
               | maximum codepoint of U+7FFFFFFF or ~2 billion.
               | 
               | https://en.wikipedia.org/wiki/UTF-8#History
               | 
               | That was restricted I believe primarily for compatibility
               | with more limited encodings like UTF-16.
               | 
               | I guess it's possible that at some future point in human
               | history when UTF-16 has been purged from memory, the 5
               | character encoding might be allowed again. :)
        
               | ninkendo wrote:
               | Related: time to explain UTF-8! (In case anybody is
               | curious. I personally think it's extremely clever and
               | worth understanding:)
               | 
               | Characters <128 are encoded with a single byte: 0xxxxxxx
               | 
               | Characters >128 are encoded with multiple bytes.
               | 
               | A two-byte character looks like:
               | 
               | 110xxxxx 10xxxxxx (11 useful bits, representing code
               | points 128-2047)
               | 
               | A three-byte character looks like:
               | 
               | 1110xxxx 10xxxxxx 10xxxxxx (16 useful bits, representing
               | code points 2048-65535)
               | 
               | A four-byte character looks like:
               | 
               | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (21 useful bits,
               | representing code points 65536-2097151)
               | 
               | Now, technically this scheme could expand to 6-byte
               | characters without getting confused with things like
               | BOM/etc, however any code points larger than 2^21
               | wouldn't be representable in UTF-16, which has its own
               | set of constraints. This means the unicode consortium has
               | basically limited themselves to two million or so
               | possible code points, which is why UTF-8 doesn't need to
               | go more than 4 bytes. (I wonder if a future unicode
               | version will require a larger limit and would thus create
               | a new "utf8mb6" scheme, and drop UTF-16 altogether?)
        
           | _moof wrote:
           | Almost all emoji are four bytes in UTF-8.
        
             | cblconfederate wrote:
             | were all the other positions taken?
        
         | ak217 wrote:
         | That's a terrible excuse. MySQL should have fixed this in a
         | major release. I had to work with a production system that had
         | all kinds of issues because of this bug (engineers assumed,
         | with good reason, that UTF-8 meant UTF-8, when it did not).
         | 
         | This kind of reasoning is how we end up with vulnerabilities
         | like the recent one in Log4j. Just because a behavior made
         | sense in the past, or an unfortunate bug made it into
         | production, is no excuse to let it inflict damage in
         | perpetuity.
        
           | Arch-TK wrote:
           | I don't see how introducing a new major release would fix
           | this? people would use the old version (because of the
           | breaking change) for a while still, you might even end up in
           | a python2/3 situation.
        
         | wruza wrote:
         | Then they should have considered implementing utf-8mb3le
         | surrogate pairs. What a missed opportunity!
        
         | _moof wrote:
         | Even that doesn't make any sense, because refusing to encode
         | characters that require four bytes doesn't save any space; it
         | just makes it impossible to encode those characters. Nothing
         | about the other encoding lengths changes.
         | 
         | The only thing I can figure is that something somewhere is
         | using a 16-bit quantity for decoded codepoints. Four-byte
         | encodings are for codepoints above FFFF. (Which I guess is
         | still someone's idea of "saving space.")
         | 
         | Edit: Apparently the max encoding length used to be six bytes,
         | so there's literally no plausible explanation for this that
         | doesn't end with "thank god I stopped having to deal with MySQL
         | over a decade ago."
        
       | NelsonMinar wrote:
       | In 2022 with a new release of MySQL or MariaDB, is it still
       | possible to create a database with "utf8" encoding? Does it print
       | a warning?
        
       | maxfurman wrote:
       | Let's say for the sake of discussion that your MySQL db has a lot
       | of tables encoded as "utf8." Are there any known drawbacks or
       | gotchas to converting them en masse to "utf8mb4"? Is this a
       | lengthy operation?
        
         | kerblang wrote:
         | The author of original article links to this one in their post:
         | 
         | https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4
         | 
         | And yeah, there's gotchas in there, the main one being that
         | strings that barely fit now won't fit, and also maximum lengths
         | on indexable strings.
        
         | chris_wot wrote:
         | I can't imagine there would be too many, other than it is a
         | tedious operation not easily automated. Depending on your data,
         | might take a while.
         | 
         | Can't believe that neither MySQL or MariaDB has created a
         | utility to do the conversions for you automatically.
        
         | spookthesunset wrote:
         | Schema changes to large tables are always huge pain in the ass
         | in MySQL. Every team I've worked on goes to great lengths to
         | avoid changing the schema, which leads to all kinds of exciting
         | anti-patterns like recycling old fields even if their name no
         | longer matches what is stored in them.
        
           | sshine wrote:
           | I suspect this is because for large databases, schema changes
           | lock the tables for extended periods of time, and if this
           | time exceeds a reasonable time for a service window, you need
           | good database replication. And you probably don't have good
           | MySQL database replication unless you're on AWS or similar.
        
             | spookthesunset wrote:
             | That's exactly what it is. If you have a big table you are
             | gonna have a pretty long outage while you update that
             | schema. Like in the last place I worked our largest table
             | would take more than an hour for a schema update. So we
             | never did schema updates.
        
       | Animats wrote:
       | That was set up when Microsoft and Java had standardized on
       | UTF-16. So this can represent the UTF-16 subset of Unicode, which
       | is Unicode Plane 0, the Basic Multilingual Plane (BMP). The
       | higher-numbered "astral planes" of UTF-8 were rarely used. All
       | modern languages with a significant user base are covered in
       | Plane 0. Plane 1, the Supplementary Multilingual Plane, with
       | Cretan Linear B, Egyptian hieroglyphics, and such, was seldom
       | needed. Few people had fonts for those, anyway.
       | 
       | Because of the way UTF-8 is encoded, it takes 3 bytes to
       | represent the UTF-16 set. That's because it only takes one byte
       | for ASCII characters. Hence, 3-byte MySQL data.
       | 
       | Emoji, though, were put in Unicode Plane 1. That's where mass
       | demand for the astral planes came from. More four byte UTF-8
       | characters started showing up in data.
        
         | tremon wrote:
         | A UTF-16 "subset" of Unicode doesn't exist. all UTF-* encodings
         | are just that: _encodings_. They can all represent the entire
         | Unicode character set. (Except for UTF-7, which never was an
         | official standard).
         | 
         | What you're referring to is UCS-2 [0], which is UTF-16 without
         | support for the high surrogates that make up the rest of the
         | Unicode character set. But to imply that UTF-16 is a subset of
         | Unicode is just not true.
         | 
         | [0] https://en.wikipedia.org/wiki/UCS-2
        
           | Dylan16807 wrote:
           | That they can all represent all of unicode is true, but
           | they're not "just" encodings either. The current code space
           | of unicode is _built around_ UTF-16. That 's why there are 17
           | planes.
        
         | light_hue_1 wrote:
         | There's so much wrong with this answer. It completely confuses
         | Unicode, with Unicode encodings, and with non-Unicode
         | encodings.
         | 
         | > So this can represent the UTF-16 subset of Unicode
         | 
         | There is no UTF-16 subset of Unicode. UTF-16 has surrogate
         | pairs, it can represent all of Unicode. You are talking about
         | UCS-2, a format which has no surrogate pairs.
         | 
         | > That was set up when Microsoft and Java had standardized on
         | UTF-16
         | 
         | Java standardized on UCS-2 initially (which is what you're
         | talking about). In the mid 2000s they moved to actual UTF-16.
         | 
         | > The higher-numbered "astral planes" of UTF-8
         | 
         | "astral planes" are not a UTF-8 concept. They're a Unicode
         | concept. A Unicode code point has 6 hex digits. The plane is
         | just the first 2 of those 6 digits. So code points 0000- FFFF
         | are in the BMP (Basic Multilingual Plane), etc. People nickname
         | planes above the 00 plane, BMP, astral planes.
         | 
         | > Plane 1 ... was seldom needed. Few people had fonts for
         | those, anyway.
         | 
         | So "rarely", that the majority of humanity writes in languages
         | that aren't in BMP. CJK (Chinese, Japanese, Korean and
         | sometimes Vietnamese) languages need characters outside of BMP.
         | 
         | > Emoji, though, were put in Unicode Plane 1. That's where mass
         | demand for the astral planes came from. More four byte UTF-8
         | characters started showing up in data.
         | 
         | That's totally untrue. The demand always existed by the
         | majority of humanity, you just never cared about this massive
         | problem until it impacted you.
        
           | Dylan16807 wrote:
           | > That's totally untrue. The demand always existed by the
           | majority of humanity, you just never cared about this massive
           | problem until it impacted you.
           | 
           | I think you swung so far in the other direction you also
           | landed in untrue territory. Originally the intent was to
           | encode all characters in "modern use" and 16 bits was
           | probably enough to do that for the entire world.
           | 
           | Also CJK is about one quarter of the world population.
        
         | loulouxiv wrote:
         | UTF-16 can represent the full range of Unicode codepoints by
         | using couples of surrogates
        
           | mort96 wrote:
           | You're right. Replace UTF-16 with UCS-2 and the comment
           | sounds at least slightly more correct.
        
           | Animats wrote:
           | Sort of. Applications using UTF-16 have to be aware of pairs
           | at the application level. Many are not.
        
             | int_19h wrote:
             | This isn't a consequence of using UTF-16 as such - Java,
             | .NET etc could totally have an API around UTF-16 strings
             | that handles surrogate pairs. The problem, rather, is that
             | those languages introduced a 16-bit type that they called
             | "character", even though it wasn't even a Unicode
             | codepoint. And then used that type throughout all string
             | APIs, including strings themselves (indexing etc).
             | 
             | In .NET land you're now supposed to use
             | https://docs.microsoft.com/en-
             | us/dotnet/api/system.text.rune instead. It transparently
             | handles surrogate pairs, so the app needn't be aware of
             | anything - and yet the internal encoding is still UTF-16.
        
       | ninju wrote:
       | This HN posting should be marked with (2016)
       | 
       | Is the issue still an issue?
        
         | Deukhoofd wrote:
         | Yes: https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-
         | utf8...
        
         | dang wrote:
         | Year added above. Thanks!
        
         | loeg wrote:
         | Yes and yes.
         | 
         | https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8...
         | 
         | > The utf8 Character Set (Alias for utf8mb3)
        
       | oleg_antonyan wrote:
       | Thanks to emoji (almost)everything now supports unicode
        
       | leros wrote:
       | In addition to emojis, I've also had issues with certain
       | characters copied from Microsoft Word being 4 bytes.
        
       | sshine wrote:
       | In MySQL, if you want to make sure that your Unicode fields are
       | mangled, never use 'utf8', use 'utf8mb3'. Since 'utf8' is an
       | alias to 'utf8mb3', that alias might eventually be updated to
       | point to 'utf8mb4' which won't mangle your characters.
       | 
       | The only safe choice here is to explicitly use 'utf8mb3'.
        
       | didip wrote:
       | MySQL is very similar to PHP in this regard. The good function
       | name is actually broken so a new function with similar name is
       | needed. This creates confusion for new people when adopting the
       | technology.
        
         | egeozcan wrote:
         | Oh the good old "mysql_real_escape_string"[0] days...
         | 
         | I forgot most of my PHP knowledge but that somehow got stuck.
         | 
         | [0]: https://www.php.net/manual/en/function.mysql-real-escape-
         | str...
        
           | duskwuff wrote:
           | You can blame MySQL for that one. The function name is
           | straight out of the MySQL C API:
           | 
           | https://dev.mysql.com/doc/c-api/8.0/en/mysql-real-escape-
           | str...
        
       | fareesh wrote:
       | I decided to build all my new projects on postgres a few years
       | ago. I am often reminded of why that was a good decision.
        
       | [deleted]
        
       | pkrumins wrote:
       | Does anyone know if utf8mb4 is variable length or does each
       | character take full 4 bytes?
        
         | butlerm wrote:
         | It is variable length, by definition. If it were four bytes per
         | code point it would be named something more like UCS-4 or
         | UTF-32.
        
           | pkrumins wrote:
           | Thanks!
        
       | [deleted]
        
       | kolaente wrote:
       | Also relevant about this:
       | https://blog.koehntopp.info/2022/01/12/utf8mb4.html
        
       | patrickcteng wrote:
       | I've had to "upgrade" a couple to utf8, and varchar(255) bites me
       | all the time.
        
       | qwerasdf0987 wrote:
       | I wrote about supporting emojis in Ruby on Rails / MySQL here:
       | 
       | https://josephecombs.com/2018/05/06/how-to-support-emojis-wi...
        
       | bambax wrote:
       | > _If you need a database, don't use MySQL or MariaDB. Use
       | PostgreSQL._
       | 
       | Sure. But for many use cases, SQLite is enough.
        
       | ashvardanian wrote:
       | Wow! I was working on this issue in our DBMS product today!
       | 
       | Fun suggestion, try making a JSON string with a NULL character
       | somewhere in the middle. It will be encoded as \u0000 and is a a
       | valid UTF-8 code, but most C based systems will truncate the
       | string by estimating its length via strlen.
       | 
       | Java community and some other software vendors designed the
       | Modified UTF-8, which replaces the zero with a 2-byte code point.
       | Sleek. Aside from the fact, that you are modifying the data that
       | customer wants to stay consistent.
       | 
       | Postres explicitly bans such cases in the VARCHAR, not sure if it
       | can fit in their JSON columns. Who tried?
        
         | cerved wrote:
         | Null-bytes are not valid characters in JSON strings. Nor any
         | other control characters for that matter
        
       | lsllc wrote:
       | A "hello devops" article from Jan 3, 2022 discussing this topic
       | was posted by /u/flokoe a week ago:
       | 
       | https://www.hellodevops.blog/posts/database-character-sets-a...
       | 
       | And the ensuing discussion:
       | 
       | https://news.ycombinator.com/item?id=29793916
        
       | CountSessine wrote:
       | That's kind of funny - Oracle has a similar issue. If you want
       | industry-standard UTF-8, you have to specify "AL32UTF8" as your
       | encoding. "UTF8" is kind of crazy - it's this monstrous
       | abomination called CESU-8 (https://en.wikipedia.org/wiki/CESU-8),
       | which isnt' UTF-8 at all - it's actually this weird "UTF-16
       | complete with surrogate pairs wrapped in a UTF-8 shell" thing.
        
       | kyralis wrote:
       | > Database systems have subtle bugs and oddities, and you can
       | avoid a lot of bugs by avoiding database systems.
       | 
       | Wat? That is... not a great takeaway from this experience. "Oh,
       | this library had bug reported in it! I know, that means I should
       | just write my own, because clearly they're all buggy and my
       | software is always perfect."
        
         | jerf wrote:
         | Well, I mean, the good news is that if you write your own
         | database system, you'll have so many _other_ bugs in the way
         | that you 'll likely never reach this particular one!
         | 
         | I suggest it can be called the Sirius Database.
         | 
         | "It is very easy to be blinded to the essential uselessness of
         | [the new database you just wrote] by the sense of achievement
         | you get from getting them to work at all. In other words--and
         | this is the rock solid principle on which the whole of the
         | Corporation's Galaxy-wide success is founded--their fundamental
         | design flaws are completely hidden by their superficial design
         | flaws."
        
           | piyh wrote:
           | >their fundamental design flaws are completely hidden by
           | their superficial design flaws
           | 
           | That hit me right in my 9-5 "not invented here" nightmare.
        
         | acdha wrote:
         | It's especially specific to MySQL -- switching to Postgres is
         | great because in addition to better performance and features
         | there are all of these rakes in the grass you can forget about.
         | Unicode just works, you never lose data because someone didn't
         | scrupulously check the warnings after every query, etc.
        
           | Thaxll wrote:
           | PG is actually overall slower.
        
             | acdha wrote:
             | In my experience that's only true for very simple queries
             | which play to MySQL's strongest points, not complex ones or
             | under high transaction volume. I haven't found that to be a
             | good win since the 2000s because caches usually soak up the
             | easy queries and as your app gains complexity you hit
             | performance cliffs which are harder to deal with,
             | especially without some of the more advanced SQL features.
             | MySQL 8 seems to have improved in this regard.
        
           | tehbeard wrote:
           | Instead you run face first into the thread per connection
           | limit rake...
        
           | stjohnswarts wrote:
           | I love that expression "rakes in the grass", it brings back
           | childhood memories of quicksand, stop/drop/roll, and stepping
           | on rakes as being things I'd always have to watch out for as
           | an adult, plus all the loony toon cartoons who used it.
        
           | TedDoesntTalk wrote:
           | Doesn't Postgres have its own set of unique of problems?
        
             | acdha wrote:
             | I've encountered far fewer and nothing treacherous like the
             | "you lost data but we didn't tell you about it" stuff MySQL
             | is famous for.
        
             | mschuster91 wrote:
             | Indeed, the worst of which is you can't do in-place
             | upgrades - upgrading cluster versions is an insanely
             | complex ride.
        
               | onphonenow wrote:
               | I routinely upgrade versions - on AWS RDS this is just a
               | click or two in my case.
               | 
               | https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/US
               | ER_...
               | 
               | When you say "complex ride" what does that mean?
        
               | jrockway wrote:
               | I am not sure if that procedure covers major versions (13
               | -> 14 for example), but if it does, they are ahead of the
               | curve. On GCP, you have to backup the database and
               | restore the backup to a new database. On self-hosted
               | installs, you have to pick one of these processes:
               | https://www.postgresql.org/docs/14/upgrading.html
        
               | acdha wrote:
               | It does. It's really easy to outsource.
        
               | AlisdairO wrote:
               | To confirm, RDS does support in place major version
               | upgrade for postgres.
        
               | onphonenow wrote:
               | They do an in_place upgrade using pg_upgrade I believe.
               | They do have to regenerate system tables - they also
               | automate I think some updates to system statistics. But
               | for very large databases this really is the way to go I
               | think.
               | 
               | They do suggest a backup before you run this in case
               | there are problems, most folks have periodic backups
               | anyways, but the backup is not used.
        
               | mschuster91 wrote:
               | > When you say "complex ride" what does that mean?
               | 
               | On managed postgresql services like RDS it's easy, yes,
               | because Amazon does all of the nasty work for you and has
               | already seen all the kinks you may run into.
               | 
               | If you are on your own and using Docker, you essentially
               | need a third-party image that has the old and new
               | binaries (https://github.com/tianon/docker-postgres-
               | upgrade). Good f...ing luck if you dare run PostgreSQL
               | inside a Kubernetes environment, because adapting that
               | guide to _that_ is a can of worms large enough to make
               | anyone squirm.
               | 
               | If you are on your own and are using distribution
               | packages, it's more complicated
               | (https://blog.samuel.domains/blog/tutorials/from-stretch-
               | to-b...).
               | 
               | If you are on your own and use postgres from source,
               | well... have fun, there's a reason why I prefer using
               | distribution-maintained packages when possible.
        
               | paulryanrogers wrote:
               | Fun fact InnoDB in place updates have plenty of caveats
               | too: must be last column, cannot have non-default on-
               | update/delete clauses. And if you workaround with any
               | table-rename solutions like pt-online-schema-change then
               | you might crash your server. (Some MySQL 8 versions crash
               | on table renames.)
               | 
               | EDIT: Oops, guess you meant version updates not DDL.
               | Leaving comment for posterity.
        
             | iruoy wrote:
             | Postgres doesn't have a way to change the order of columns.
             | Of course you could make a view, but my code doesn't use a
             | view and I don't really want to make one just for when I
             | want to check the database.
        
               | mikl wrote:
               | If that's the biggest problem you have with your
               | database, be happy.
        
               | corrigible wrote:
               | Wouldn't a single application of a temp table be
               | sufficient?[0]
               | 
               | 0: https://brianchildress.co/reorder-columns-postgres/
        
         | time0ut wrote:
         | This is a pretty uncharitable reading considering the very next
         | line is a recommendation to use PostgreSQL if you need a
         | database.
        
         | dang wrote:
         | " _Please don 't pick the most provocative thing in an article
         | or post to complain about in the thread. Find something
         | interesting to respond to instead._"
         | 
         | https://news.ycombinator.com/newsguidelines.html
        
         | dagmx wrote:
         | A) it's an obvious joke B) even if you take it literally, they
         | never suggest implementing something new
         | 
         | You're misreading this on multiple levels
        
         | cbg0 wrote:
         | I'm pretty sure you missed the joke.
        
       | wonder_er wrote:
       | a while ago, in my first engineering job, I was tracking down a
       | strange bug related to chinese character sets.
       | 
       | Turns out some of the db tables were using utf8 and latin1;
       | changing them to `utf8mb4` fixed it right up!
       | 
       | https://josh.works/troubleshooting-chinese-character-sets-in...
       | 
       | It was a fun bug to work on. I learned a lot about character
       | encoding, and enjoyed bringing a refined "process" to the table.
        
       | Dylan16807 wrote:
       | Also make sure mysqldump is using utf8mb4 or you'll get character
       | replacement in your backups. This often needs to be configured
       | separately.
        
       | TedDoesntTalk wrote:
       | > Back in 2002, MySQL gave users a speed boost if users could
       | guarantee that every row in a table had the same number of bytes.
       | To do that, users would declare text columns as "CHAR"
       | 
       | This database type has existed in many databases since at least
       | the 1980s
        
       | mrcarruthers wrote:
       | And here's the commit that changed it:
       | https://github.com/mysql/mysql-server/commit/43a506c0ced0e6e....
       | It was originally set to max 6 bytes but for some reason someone
       | dropped it to 3.
       | 
       | They also seem to be slowly phasing it out. Internally it's
       | utf8mb3 and utf8 is just an alias. The idea is to eventually make
       | utf8 an alias to utf8mb4.
        
       | jmnicolas wrote:
       | It's because of things like that it's still hard for me to
       | respect MySQL in 2022.
       | 
       | I use Postgres since 2008. It has never bitten me even once.
        
         | ars wrote:
         | For something that was fixed 12 years ago?
         | 
         | If Postgres has never bitten you, then you aren't using it
         | much.
        
           | mikl wrote:
           | It's not fixed. MySQL still has a datatype called "utf8"
           | that, in fact, is not utf-8 but a limited subset thereof.
           | 
           | PostgreSQL has some sharp edges, sure, but a lot less obvious
           | footguns like these.
        
       | tapoxi wrote:
       | Maybe this is a cynical take - but we used https://pgloader.io/ a
       | few years ago to migrate to Postgres, and have never been
       | happier. MySQL has a lot of stupid decisions like this.
        
         | gjs278 wrote:
        
         | cakoose wrote:
         | Specifically about UTF-8 in Postgres: "PostgreSQL collation is
         | a massive footgun"
         | https://gist.github.com/rraval/ef4e4bdc63e68fe3e83c9f98f56af...
         | 
         | Postgres defaults to the system's collation for UTF-8, which
         | caught me by surprise. Now I'm stuck with a production DB where
         | text prefix queries, which should be a perfect use of a btree,
         | are a full table scan. Probably gonna have to take some
         | maintenance downtime to fix the situation :-\
         | 
         | And more generally about operating Postgres in production:
         | https://blog.nelhage.com/post/some-opinionated-sql-takes/
         | 
         |  _As for Postgres, I have enormous respect for it and its
         | engineering and capabilities, but, for me, it's just too damn
         | operationally scary. In my experience it's much worse than
         | MySQL for operational footguns and performance cliffs, where
         | using it slightly wrong can utterly tank your performance or
         | availability. In addition, because MySQL is, in my experience,
         | more widely deployed, it's easier to find and hire engineers
         | with experience deploying and operating it. Postgres is a fine
         | choice, especially if you already have expertise using it on
         | your team, but I've personally been burned too many times._
        
           | jtc331 wrote:
           | What else would they reasonably default to? Are you proposing
           | requiring collations to always be specified (a breaking
           | change from the SQL spec)?
        
           | seanw444 wrote:
           | I think the open-source nature is a major factor in the
           | reason it's chosen so much. That's why I roll it.
        
           | Arch-TK wrote:
           | What do you use instead?
        
         | dijit wrote:
         | I'm a big posgresql fan, but I'd maybe argue that you're making
         | tradeoffs. PG has weird behaviour sometimes too.
         | 
         | I don't think it's always smart to just change out your data
         | layer.
         | 
         | But, if you're starting a new project, I do think PG is one of
         | the better options and tends to follow the principle of least
         | surprise. (hence: big fan)
        
           | leetrout wrote:
           | As a huge Postgres fan working at a shop using MySQL I'll
           | point out the first thing I saw MySQL does that Postgres
           | doesn't:
           | 
           | Index hints.
           | 
           | If you aren't hitting an index in Postgres you have to dig in
           | to table stats and figure out what is wrong but MySQL gives
           | you more control.
           | 
           | However, I would still rather work with Postgres AND have to
           | juggle a connection pooler than deal with MySQL. Transactions
           | on DDL are _great_ and the ability to use foreign keys across
           | partitioned tables is how it should be.
        
             | samcrawford wrote:
             | > I would still rather work with Postgres AND have to
             | juggle a connection pooler
             | 
             | Is this comment relating to the overhead of idle
             | connections, which has historically necessitated the use of
             | a pooler in front of PG? If so, I believe this is resolved
             | in postgres 14
             | 
             | https://pganalyze.com/blog/postgres-14-performance-
             | monitorin...
        
             | spookthesunset wrote:
             | But at least Postgres has much better tools for learning
             | about the query plans it might use. The Postgres output of
             | "EXPLAIN" is much much better than MySQL.
             | 
             | It's a shame that almost every job I worked at uses MySQL
             | and not Postgres. But that could be because those companies
             | all got their start like a decade or more ago when Postgres
             | was not as well known.
        
             | pzduniak wrote:
             | > As a huge Postgres fan working at a shop using MySQL I'll
             | point out the first thing I saw MySQL does that Postgres
             | doesn't:
             | 
             | > Index hints.
             | 
             | Eh, it's an extension. One you shouldn't use, but it's
             | there.
        
               | leetrout wrote:
               | I didn't know this!
        
           | hn_throwaway_99 wrote:
           | > PG has weird behaviour sometimes too.
           | 
           | Perhaps, but I'd argue that the "weird" behavior of postgres
           | just tends to be clearly thought out design decisions that
           | they made for a valid reason that may cause you some pain
           | with how you use it (e.g. their process-per-connection
           | model).
           | 
           | MySQL's "weird" behavior, on the other hand, just tends to be
           | completely invalid footguns like this. Despite what some
           | people are arguing in this thread, a 3-byte version of UTF-8
           | was never in any spec anywhere and was an invalid shortcut
           | from day 1.
        
             | ipaddr wrote:
             | Oracle, Postgres, MySql, mSql, sqlite all make weird
             | behaviour in different situation. You pick your poison.
             | 
             | I hated how postgres forced you to create a system user to
             | connect I wonder if it still requires this.
        
         | leftnode wrote:
         | I love, love, love Postgres! The _one_ feature MySQL has that
         | Postgres doesn 't is the ability to add a new column in an
         | arbitrary position in the table. I like my tables to have a
         | common "layout" and I would love to have this feature in
         | Postgres.
        
           | nkozyra wrote:
           | This is annoying but it's not a huge hassle to
           | duplicate/copy/rename, assuming your data is small and not
           | sharded in some unique way
        
           | teaearlgraycold wrote:
           | This seems like something that could be done client side
           | instead. But maybe not an option for psql CLI
        
           | srcreigh wrote:
           | MySQL also lets you control the layout of your table on disk
           | via BTree primary storage and composite primary keys.
        
         | andrewmcwatters wrote:
         | People say this all the time but never elaborate on what they
         | are nor link to any sources that help educate people what they
         | are, so I just assume that when people say this, they don't
         | know what they're talking about and are just parroting.
        
         | JangoSteve wrote:
         | I also helped with a project once that used pgloader to
         | migration an old MySQL db to Postgres, and I think this article
         | may actually explain one of the issues we found. We had a
         | UTF-8-configured table in MySQL (we also discovered in the
         | project, that MySQL can have encodings set differently per-
         | table, while Postgres sets the encoding for the entire
         | database) with UTF-8 data, and when we migrated it into a
         | UTF-8-configured Postgres database, some of the UTF-8
         | characters were silently corrupted. They were UTF-8 characters
         | in both databases, so the corruption didn't raise any errors,
         | but they were different characters, so when the data was read
         | back out by the application, the text was different.
         | 
         | We only caught this, because thankfully, we had written a
         | manual checksum script, which looped through every table, read
         | out all values from each row into the application, and hashed
         | the results, then compared between when the app was connected
         | to the source MySQL database vs the destination Postgres
         | database. We ended up having to massage and fix those silently-
         | corrupted characters.
        
       | DonHopkins wrote:
       | >Of course, they never advertised this (probably because the bug
       | is so embarrassing).
       | 
       | Not as embarrassing as being owned by Oracle.
        
       | cryptos wrote:
       | Best to be used with mysqli_real_escape_string ;-)
       | https://www.php.net/manual/en/mysqli.real-escape-string.php
        
       | tomwojcik wrote:
       | I need to share something as literally today I fixed a bug in our
       | project that's somewhat related.
       | 
       | MS SQL encodes everything with UTF-16. Emojis (code points)
       | require up to 4 bytes. If it's a grapheme (emoji constructed from
       | emojis), it will be even more.
       | 
       | We are using Django. If you check length of an emoji, it will
       | give you `1` and Django assumes utf8 everywhere. If you try to
       | save it to PostgreSQL with char field `max_length=1` it will work
       | just fine, but on MS SQL it will fail as it requires 2 characters
       | (4 bytes) in the DB!
       | 
       | I tried it with MS SQL 2017, 2019 with different collations on
       | nvarchar and I'm pretty sure there's no way around it.
       | 
       | > Because in CHAR(n) and VARCHAR(n) or in NCHAR(n) and
       | NVARCHAR(n), the n defines the byte storage size, not the number
       | of characters that can be stored, it's important to determine the
       | data type size you must convert to, in order to avoid data
       | truncation.
       | 
       | https://docs.microsoft.com/en-US/sql/relational-databases/co...
        
         | gecko wrote:
         | Generically, anything by Microsoft will historically have used
         | UCS-2, and will use UTF-16 these days, so this is utterly
         | unsurprising to me _as an experienced Windows dev_. Conversely,
         | Linux (and POSIX, more generally) deciding that filename
         | encoding is a per-filename and untracked thing is a bit lit,
         | from my perspective. Point being: when it comes to handling
         | unicode and foreign characters, just, like...always read the
         | documentation. Assume nothing.
        
           | capitainenemo wrote:
           | Yep. Microsoft is the main reason for 2 of my favourite
           | unicode links. https://utf8everywhere.org/ and
           | https://simonsapin.github.io/wtf-8/
           | 
           | And, apparently it's mildly inaccurate to say it uses
           | UTF-16... it's more like UCS-2 with UTF-16 hacked in, with no
           | validation. Thus WTF-8.
        
             | ChrisSD wrote:
             | No, it's UTF-16 with no validation at the kernel level.
             | Invalid UTF-16 is also invalid UCS-2 as those code points
             | were explicitly barred from use.
             | 
             | In practice, only malware will create such broken names.
             | High level software (e.g. Microsoft's own VSCode) will not
             | handle broken UTF-16. And indeed the in-built UTF-8 code
             | page will lossily decode UTF-16 (unpaired surrogates are
             | replaced with the Unicode replacement character).
        
               | capitainenemo wrote:
               | Hm. Are you sure? Because the utf8everywhere article (and
               | various microsoft related framework discussions) seem to
               | suggest there's no validation anywhere. You can easily
               | create partial codepoints and just hitting backspace in a
               | text field can do it. That seems to imply there's no
               | UTF-16 validation even at a higher level.
               | 
               | But I will readily defer to your expertise on this. I've
               | not coded in microsoft land for like 18 years. MFC was my
               | last experience in this, where I still have this vague
               | memory of being shocked by an API returning an int32 and
               | instructing on casting to a void pointer (overloaded
               | response message). No wonder they had issues with 64 bit
               | migration at the time.
               | 
               | Edit: cite on the utf8everywhere thing. "in plain Windows
               | edit control (until Vista), it takes two backspaces to
               | delete a character which takes 4 bytes in UTF-16. On
               | Windows 7, the console displays such characters as two
               | invalid characters, regardless of the font being used."
               | 
               | Maybe they've improved since though. But surely there's a
               | lot of that baggage in the libraries.
        
               | user-the-name wrote:
               | Windows Vista came out fifteen years ago.
        
               | ChrisSD wrote:
               | I mean, Vista is ~15 years old at this point. If anything
               | that's still part of Windows makes the backspace mistake
               | then if nothing else it's impressive it's survived this
               | long without being noticed.
        
         | pimeys wrote:
         | With SQL Server 2019, you can write UTF-8 data to `VARCHAR` and
         | `CHAR` fields, if the table collation is
         | `LATIN1_GENERAL_100_CI_AS_SC_UTF8`.
         | 
         | I feel the pain of UTF-16. When I was writing the Rust crate
         | for the TDS protocol, all N-columns for strings always require
         | you to do a full copy from `Vec<u16>` to `Vec<u8>` and back.
        
           | tomwojcik wrote:
           | I used this collation on 2019 when testing and the only thing
           | that has changed was the error message (more verbose).
        
       | dehrmann wrote:
       | Memories... When I was at Amazon, my starter project was
       | modifying Redshift to support four-byte UTF-8 characters, not
       | just three-byte.
        
       ___________________________________________________________________
       (page generated 2022-01-12 23:00 UTC)