[HN Gopher] In MySQL, use "utf8mb4" instead of "utf8" (2016)
___________________________________________________________________
In MySQL, use "utf8mb4" instead of "utf8" (2016)
Author : pcr910303
Score : 452 points
Date : 2022-01-12 15:21 UTC (7 hours ago)
(HTM) web link (adamhooper.medium.com)
(TXT) w3m dump (adamhooper.medium.com)
| donatj wrote:
| Is Medium secretly mining crypto in the background? I have no
| idea what it is but their site makes my CPU spike and my fans
| turn on.
|
| https://jdon.at/JTHj2G
|
| Update - here's a video, it's making hundreds of graphql requests
| a minute...
|
| https://jdon.at/z05ImC
| throw0101a wrote:
| Note:
|
| > _The utf8mb3 character set is deprecated and you should expect
| it to be removed in a future MySQL release. Please use utf8mb4
| instead. utf8 is currently an alias for utf8mb3, but it is now
| deprecated as such, and utf8 is expected subsequently to become a
| reference to utf8mb4. Beginning with MySQL 8.0.28, utf8mb3 is
| also displayed in place of utf8 in columns of Information Schema
| tables, and in the output of SQL SHOW statements._
|
| * https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-...
| ghusbands wrote:
| Maybe they should rename/alias it to brokenutf8, to make it
| more likely that people will notice a problem.
| ksec wrote:
| I really wish they do this in a new, major version release. But
| they dont seems to be keen on releasing 9.0.
| capitainenemo wrote:
| This is also a security vulnerability because MySQL/MariaDB
| silently truncate strings at the first invalid character. This
| can result in data manipulation attacks where a higher level
| layer validates the complete input as UTF-8 for insertion into
| the DB, but the database only stores half the string.
|
| https://bugzilla.mozilla.org/show_bug.cgi?id=1253201
| wanderr wrote:
| Any data in your database that can come from external input
| should be treated as untrusted and validated before it's used.
| Otherwise validation bugs or bypasses will result in bad data
| and exploits that persist beyond the fix. Edit: I'm not arguing
| against the need to use utf8mb4
| kstrauser wrote:
| While that's true, I trust PostgreSQL to store exactly what
| I've asked it to store. At some point, you have to trust
| _something_ to do its job, or else everything built on top of
| it is a castle of sand.
|
| Imagine a bug like this in ext4. No one would reasonably
| contend that the layers on top of it should be validating
| that the files you write out are the ones you'll read back
| in. We write unit tests for all kinds of stuff, but we're not
| _that_ thorough.
| colpabar wrote:
| Sure, but the issue here is that the default mysql "utf8"
| encoding is not actually utf8. You can write as many
| validation layers as you want, but if they are assuming that
| utf8 actually means utf8, they won't help, and mysql will
| potentially screw it up when it gets stored.
| capitainenemo wrote:
| Agreed. The confusing part here as I see it is where
| validation layer A (correctly) asserts the data is valid
| UTF-8 and safe then assumes the database persists what it
| passes to it, since no error is reported.
|
| Then, subsystem B trusts reading the database field (since it
| passed validation layer A).
|
| Obviously more validation layers can be added, but at this
| point validation layer C called by subsystem B needs to
| _know_ what the initial input from layer A is in order to
| differentiate it from the db value which was manipulated - a
| rather tricky thing to do sometimes. (I guess you could add a
| hash to the db to check the db is storing your strings, but
| really.. come on)
|
| Upgrading to utf8mb4 is probably safer than hoping enough
| validation layers thrown at it solves the problem.
| wanderr wrote:
| Yes. I've been burned by this exact issue in the past, not
| realizing that utf8 was not really utf8 in mysql-land. That
| is a major, major WTF IMO.
|
| But I've also seen people relying on validation at the time
| of insertion so many times that I wanted to warn against
| that, too. Not in argument against the need for utf8mb4.
| evanelias wrote:
| Silent truncation only occurs if strict sql_mode is disabled.
| Strict sql_mode has been the default since MySQL 5.7, released
| over 6 years ago. All prior versions have hit EOL support, so
| there is literally no supported version of MySQL which has this
| behavior out-of-the-box.
|
| The topic of MySQL utf8 vs utf8mb4 was discussed in depth just
| _six days ago_ on HN:
| https://news.ycombinator.com/item?id=29793916
| Ginden wrote:
| > Silent truncation only occurs if strict sql_mode is
| disabled. Strict sql_mode has been the default since MySQL
| 5.7, released over 6 years ago.
|
| What happens if you upgrade from older version? Are defaults
| from previous version used, or is it default only on fresh
| installs?
| evanelias wrote:
| See bottom of https://news.ycombinator.com/item?id=29910494
| and my reply https://news.ycombinator.com/item?id=29910863
| johannes1234321 wrote:
| If you don't have an explicit value set in your my.cnf
| config and the client sends no explicit value upon
| connecting you get the current built in default of the
| version you are using. If you set something somewhere it
| will be respected.
| kukx wrote:
| The strict sql_mode is often manually disabled. I bet it is
| off on most shared hostings.
| evanelias wrote:
| Sure, but that's not MySQL's fault. MySQL/Oracle has no
| control over what ill-conceived setting overrides are used
| by DBaaS providers or shared hosting providers.
|
| There are similarly dumb misconfigurations that can
| theoretically be made in most hosted open source software.
| capitainenemo wrote:
| Well, the things it would have been nice for MySQL to
| have done would have been not implementing it this way in
| the first place (yes, that's probably not Oracle's
| fault), and moving utf8 to utf8mb4 a bit more quickly.
| (good to know that's finally happening).
|
| Also, I gotta say, now I have to look into this strict
| mode and what else it breaks, when in this particular
| case a "don't silently truncate strings" flag would be
| preferable and probably lower impact, since "strict" vs
| "security flaw mitigation" is not quite the same thing
| (does such a more granular flag exist?). ... unless
| that's all the strict mode does.
|
| Hm, and since you brought this up elsewhere. Is this
| "strict by default" in MariaDB - which is what most
| (linux) people are (probably) thinking of.
| evanelias wrote:
| > moving utf8 to utf8mb4 a bit more quickly
|
| I definitely agree!
|
| > now I have to look into this strict mode and what else
| it breaks
|
| It's a mode to ensure nothing happens silently: illegal
| data type conversions, attempting to insert invalid
| values for the data type (too large, too long, invalid
| characters, etc), attempting to do an INSERT which
| doesn't specify some columns that lack defaults, etc.
| With strict mode enabled, all of these things throw hard
| errors, like in most other relational databases.
| Generally, it covers the entire class of things that are
| typically the most common complaints about MySQL on HN.
|
| Interesting wording re: "breaks" -- for practical
| purposes this is a real concern for existing
| applications, which may be doing bad things which
| previously didn't throw errors with strict mode off. Do
| you want your DB to suddenly fail loudly for these things
| (strict mode), or do you want your existing application
| to continue working in these cases? Can't have it both
| ways.
|
| > Is this "strict by default" in MariaDB - which is what
| most (linux) people are (probably) thinking of.
|
| MariaDB made it default in 10.2, ~5 years ago.
|
| I don't agree that most Linux people conflate MySQL and
| MariaDB though. At large US tech companies that use
| MySQL, they're almost all literally using Oracle MySQL,
| or a patch-set like Percona Server which directly tracks
| MySQL, or an internal patch-set that does the same.
|
| In recent years, MySQL and MariaDB have diverged a bit on
| feature set and focus; each has some interesting features
| that the other lacks. They are not drop-in replacements
| for each other.
| capitainenemo wrote:
| So... while I'm broadly in agreement with you, and a huge
| fan of strict typing, I will note people use languages
| like Javascript and usually are ok with things like
| string to integer coercion.
|
| It seems to be that there's a world of difference between
| "silently truncate string" and "type coercion" even if I
| readily agree both are sources of security
| vulnerabilities.
|
| Thanks for the replies though, even though I clearly
| annoyed at least 2 people due to the downvotes.
| ars wrote:
| > and moving utf8 to utf8mb4 a bit more quickly. (good to
| know that's finally happening).
|
| That happened in 2010 - I'm not sure 12 years counts as
| "finally happening", or just how much quicker you would
| like than 12 years ago?
|
| Other than a time machine, I'm not sure what else you
| would like them to do.
| capitainenemo wrote:
| I mean the announced plan to make the "utf8" default be
| utf8mb4 instead of utf8mb3.
|
| See https://news.ycombinator.com/item?id=29910023
| butlerm wrote:
| They should have made "utf8" have UTF-8 semantics by
| default as soon as the defect was discovered, not twenty
| plus years later.
| ars wrote:
| They couldn't. The storage size is different internally.
| The article covers this in the section in migrating from
| utf8 to utf8mb4.
|
| Do you actually use MySQL or are you just complaining
| from a theoretical POV?
| butlerm wrote:
| Pretty sure the name "utf8" isn't metaphysically wired to
| a defective physical representation. All they have to do
| is wait until the next minor release, add a compatibility
| name, make the generic name refer to a non-defective
| representation, and update the documentation and release
| notes accordingly.
|
| If they were really conservative, they could wait until
| the next major release to repoint the name like that.
| Haven't there been four major releases since 2003?
| evanelias wrote:
| This would be problematic for existing applications. Even
| if you leave existing binary data as-is (in existing
| tables utf8 becomes utf8mb3), changing the meaning of
| "utf8" in only new tables is still very problematic.
| Think about logical dumps, e.g. CREATE TABLE statements
| which use the "utf8" alias. For example, any self-host
| application which runs a .sql file in its install
| process. Or spinning up a dev environment for a new hire.
| Basically, anything that contains a CREATE TABLE with the
| utf8 alias now has a landmine:
|
| In older versions of MySQL, InnoDB indexes were limited
| to 767 bytes. Meanwhile, VARCHAR(N) can store N
| _characters_. So with utf8mb3, that 's a max of 3N bytes
| worst-case; but with utf8mb4 ("proper" utf8), it's now
| 4N. This meant that VARCHAR(255) CHARACTER SET utf8mb4
| could not be indexed (in its entirety) in these older
| versions; the CREATE TABLE would fail if you try.
|
| When you have a massive installed base of users, and
| those users have lots of dumps / install scripts
| containing "VARCHAR(255) CHARACTER SET utf8", you simply
| cannot change the meaning of "utf8" in a minor release.
|
| I fully agree that MySQL could have made this transition
| faster/earlier than they did, but I disagree that it
| could be done trivially or terribly quickly.
| butlerm wrote:
| Implementing silent string truncation in the first place
| (by default(!)) is an excellent indication of how fast
| and loose and out of touch the original MySQL developers
| were with the database world. It is the worst database
| "feature" I have ever seen.
| evanelias wrote:
| It made more sense at the time. Earlier days of the web,
| dynamic scripting languages used everywhere, etc. No
| emoji, barely any use of 4-byte characters at all. Large
| corps weren't using open source DBs, or sometimes open
| source anything. Simpler times!
|
| And for sake of comparison -- if I understand correctly,
| prior to just _2 months ago_ , sqlite didn't even have
| the option for strict type enforcement [1]. Do you judge
| sqlite as harshly, or think its creator is out of touch
| with the database world?
|
| [1] https://www.sqlite.org/datatype3.html
| butlerm wrote:
| Relational databases have been used in production since
| the late 1970s. You would be hard pressed to find any
| that silently corrupted user data like that, ever, until
| MySQL came along.
|
| Sqlite is more of an embedded database, but if it
| silently truncated character data under some conditions
| that would make it equally unusable. Better to have no
| type enforcement than defective type enforcement.
| evanelias wrote:
| I think you misunderstood my point. Yes, relational
| databases have been around a long time. But in the early
| days of MySQL, most of the companies using MySQL differed
| quite a bit from the companies typically using DB2,
| Sybase, Oracle, etc. I've been using MySQL professionally
| since 2003 and am speaking from first-hand experience
| here: the web development world was largely separate from
| the traditional enterprise world at that time. It was
| considered a _different use case_ , it's not like today
| where the biggest companies in the world are tech
| companies built on open source software.
|
| Regarding "silent", MySQL at least emits a warning when
| this happens. Still not great / obviously wrong in a
| modern context, but just for clarity's sake it wasn't
| _completely_ silent.
|
| As for "defective type enforcement", that's a subjective
| opinion. In C, atoi("123hello456") returns 123 -- once
| you hit an invalid character, everything from that point
| is truncated. Is C atoi defective? Perhaps, but it's an
| opinion.
|
| With strict mode disabled, four-byte characters worked
| the same way in MySQL utf8mb3: four-byte characters are
| _invalid_ in this data type, and everything gets
| truncated at the first invalid character. Would simply
| stripping the invalid characters, but keeping valid chars
| after them, necessarily be more "correct"? No, it's an
| implementation decision, especially considering that
| emoji _did not exist at the time_ and barely any non-BMP
| (4-byte) utf8 characters were used anywhere.
| butlerm wrote:
| It is certainly true that most users of relational
| databases before MySQL used them for applications where
| throwing away data is a potential catastrophe, and most
| uses of MySQL were for toy applications where if that
| happened no big deal. It was about a decade before MySQL
| even supported transactions.
|
| atoi is deficient by modern standards, but at least it
| clearly documents what the behavior is and always has. It
| doesn't truncate anything. For a database, truncating
| data is unforgivable. That is what errors are for, so
| your database doesn't act as a black hole where your data
| may silently disappear forever.
|
| And yes, if you can't return an error for some reason,
| you should preserve as much data as possible. Throwing
| away hundreds or thousands of characters after an initial
| anomaly is insane. But a database should never do that
| anyway, it should either store what you give it, or
| return an error. Same with a filesystem or any other
| trustworthy means of data storage. Sorry we were too lazy
| to tell you we threw your data away isn't acceptable, nor
| is relying on anyone to check for warnings that should be
| hard errors.
| purple_ferret wrote:
| Lots of people turn it off because Warnings are easier to
| ignore than Errors are
| capitainenemo wrote:
| That's good to know, however I was encountering this MySQL
| truncation just last year in COTS software, so clearly
| there's a long tail. As a related example, log4j 1.x has been
| deprecated for years, but 100% of atlassian uses it with
| their own maintained version.
|
| In fact, when the log4j thing came up, we audited all our
| existing COTS/internal and virtually all of it was still on
| the "EOL" log4j 1.x.
|
| Since you're familiar with this. What happens with an
| existing upgrade - is strict mode enabled by a typical distro
| package upgrade or does it only impact new installs?
| evanelias wrote:
| For sure, there's a huge long tail. However, if a company
| is using 6+ year old EOL software that has flaws, that's
| outside of the vendor's control.
|
| > What happens with an existing upgrade
|
| sql_mode is a setting, so it depends on whether the default
| is being overridden: e.g. does the my.cnf config file
| specify a value explicitly? Does your application's
| connection logic set a session-level override for each
| connection? DBaaS vendors add an additional wrinkle since
| they tend to have custom UIs and APIs for global settings.
|
| > typical distro package upgrade
|
| Many Linux distros made it so that attempting to install
| "mysql" actually installs MariaDB instead, so when going
| this route, it's unfortunately always very hard to predict
| what you'll get or what will happen!
| KronisLV wrote:
| > Many Linux distros made it so that attempting to
| install "mysql" actually installs MariaDB instead, so
| when going this route, it's unfortunately always very
| hard to predict what you'll get or what will happen!
|
| MariaDB is indeed pretty pleasant to use and for most use
| cases can be a drop in replacement for MySQL, should you
| not want to use MySQL, but also wouldn't want/need/be
| able to use PostgreSQL or something else:
| https://mariadb.com/kb/en/mariadb-vs-mysql-compatibility/
|
| I might be in the minority here, but i actually rather
| enjoy MySQL/MariaDB, since MySQL Workbench is still the
| best DBA software that i've used, especially because of
| the ER modelling and schema sync functionality, and their
| approach to schemas/users within a DB feels more
| reasonable than that of Oracle/PostgreSQL.
|
| That said, both PL/SQL and PL/pgSQL feel more powerful
| than whatever MySQL/MariaDB has. Also, in regards to
| MariaDB in particular, their versions have diverged from
| MySQL somewhat, so you might still want to look into the
| differences, if migrating over:
| https://mariadb.com/kb/en/mariadb-vs-mysql-compatibility/
|
| As for predictability - nowadays i just run specific
| versions in containers (e.g. 10.7.1 or maybe 10.7 which
| corresponds to the patest PATCH release), to sidestep the
| distro differences altogether. It also lets me easily
| launch new instances, give resource limits to them, have
| multiple versions in parallel and mount the data
| directories wherever i need them, for easier backups etc.
| waynesonfire wrote:
| I'm going to use this as an opportunity to vent.
|
| Wouldn't it be a nice world where you can just declare
| something EOL and it disappears and somehow magically it's
| not a problem anymore for anyone end-to-end? Thing's just
| don't work that way. Like, suddenly, a product that was
| working perfectly fine for who knows how long suddenly
| needs to modified because some ass-hats thinks that EOL
| isn't suitable for production.
|
| There are a number of reasons to upgrade a dependency or a
| product; an EOL label shouldn't be a part of the calculus.
| EOL may imply certain conditions that may influence the
| decision to upgrade but for the sake of EOL you can eat a
| bag of dicks. Think harder why you're proposing an upgrade
| is needed and how it'll add value to the business. EOL
| software is perfectly fine especially because it's been
| battle tested.
|
| We also use log4j 1.x all over the place and I feel zero
| need to change that.
| dimitrios1 wrote:
| It is perfectly reasonable to continue to use EOL
| software, but you have to assume maintenance costs in
| that case. It would be irresponsible to continue to use
| EOL software as is. EOL is a perfectly natural concept.
| Nothing in this world lasts forever, including our
| machines, electronics, cars, batteries, lives, and yes,
| software.
| rkagerer wrote:
| If software lasted a lifetime, life would be so much
| easier.
| Dylan16807 wrote:
| > EOL may imply certain conditions
|
| And "EOL" is a convenient shorthand for those conditions.
|
| Most use cases care about never getting patches for
| security problems ever again.
| dehrmann wrote:
| It's like MySQL and PHP were meant for each other.
| anon776 wrote:
| How dare you. Also accurate.
| chris_wot wrote:
| PHP has improved a lot.
| JodieBenitez wrote:
| yeah... it also has a strict mode... that is off by
| default.
| tyre wrote:
| Another fun one that came up at my first job: If you run up
| against the size limit of a text column in MySQL, it silently
| truncates.
|
| This came up when a developer used YAML for some configuration
| file. There was all sorts of confounding behaviour because YAML
| is still valid when truncated.
|
| Every database has limits, but at least alert the user! I'd
| rather a query fail than alter data.
|
| Edit:
|
| This is something that might be useful in CS classes, to
| reflect what being a real-world engineer can often entail. Set
| an assignment with some silly nonsense like this and students
| have to debug it.
| leksak wrote:
| If the point of CS education is to produce programmers and
| software engineers as opposed to computer scientists, yes.
| taftster wrote:
| What percentage of CS graduates go onto become computer
| scientists? Seems like vast majority of CS degree holders
| aren't practicing "pure" computer science, as you would
| seemingly hope.
| marginalia_nu wrote:
| Why were you putting YAML in a relational database?
| tyre wrote:
| iirc it was to allow customization of an e-commerce item.
| Basically a column as a document (in the sense of a
| document database) with custom attributes that could
| be...large.
|
| It wasn't my implementation. This was another team
| int_19h wrote:
| It's really weird that they've chosen YAML for it, since
| its sole (questionable) benefit over the alternatives is
| that it's geared more towards being edited directly by
| humans... which is generally not something that you need
| to do with a database field. JSON is more usual for
| scenarios like this, not the least because most databases
| actually provide tools to parse and query it.
| marginalia_nu wrote:
| It's pretty weird to put any configuration-languages in a
| relational database in general. Large variable-length
| objects is not what relational databases are good at.
| You're mostly getting the drawbacks, and very few of the
| benefits in that use case. Even a filesystem is probably
| more suitable.
| capitainenemo wrote:
| Drupal seems keep a large amount of PHP serialisation of
| configuration in its database (a JSONish thing), probably
| because it's the standard data store for the site.
| Noticed this in fixing a dump one time. Got to use vim's
| function syntax for regex to ensure the character length
| count of the string matched the value of the new string.
|
| I think virtually everything generically configurable in
| the UI that isn't part of a small core with dedicated
| fields (or in the sitewide PHP config include) is handled
| that way.
| mrweasel wrote:
| Maybe also do a checksum or something, seriously it could
| just be CRC32. I get that it's not something you
| instinctively think to do. For configurations that are most
| likely parsed directly by something else though, may check
| that nothing broke in transit.
| cornel_io wrote:
| There's a setting for this, but like many settings MySQL has
| (along with InnoDB, the storage engine) historically chosen
| exceptionally bad defaults. It's possible to tune the thing
| to be screaming fast and pretty robust, but it takes a lot of
| expertise and TBH these days most people would be much better
| off going either NoSQL or Postgres instead.
| capitainenemo wrote:
| I do find this behaviour rather incomprehensible actually. I
| understand that if the "utf8" is BMP it can't have invalid
| characters in it. What puzzles me is that the approach they
| went for seems to be almost the worst possible one from a
| security and data integrity perspective.
|
| Assuming a SQL error can't be thrown for some legacy reason
| (and you'd think a flag to opt into that would be an option),
| why not replace the invalid parse with U+FFFD like most things
| do? Or at a bare minimum, only strip the invalid character
| which at least makes attacks a bit harder and results in a
| little less data loss.
| [deleted]
| capitainenemo wrote:
| Just wanted to reply to myself that I just learned from
| evanelias' comment a "strict mode" flag is indeed an option.
| I suppose it's time to audit all my mysqls to see which ones
| have this enabled, and what other mysql-isms this flag might
| break.
|
| I know it's definitely not enabled on one updated and
| maintained BSD - this could be because they didn't want to
| break existing systems during upgrade.
|
| I suspect this problem is still extremely common. But good to
| know there's some form of defense.
| pshc wrote:
| Quietly corrupting data and shipping with the wrong default
| settings is what MySQL is all about! Keeps the consultants
| busy.
| taf2 wrote:
| It's also easy to imagine how it happens... const char
| *buffer = malloc(...; memcpy;... etc... pretty fast pretty
| straight forward
| PrimeDirective wrote:
| The author mentioned MariaDB in the intro and the went on about
| MySQL, not sure which he actually used.
|
| I think the latest recommendation for MySQL was
| utf8mb4_0900_ai_ci and utf8mb4_general_ci for MariaDB.
|
| While I agree about the Postgres recommendation, it's a non-
| starter in the PHP world, unless doing something from scratch.
| Even then, the support for MySQL/MariaDB is much better.
| timw4mail wrote:
| What's that mean, non-starter in the PHP world? Postgres is
| well supported by frameworks and libraries in PHP, not to
| mention PHP itself.
| iliketrains wrote:
| Fun story: When I was working at one of the FAANG companies, I
| have placed the (Night with stars, it does not seem to render
| here [0]) character in my preferred name. Some time later, I was
| notified by HR that I must change my preferred name and I can
| only use UTF-8 characters with maximum length of 3 bytes. I was
| quite confused why such specific demand, I am pretty sure it was
| exactly this issue. I replaced it with tta***tto***
|
| [0] https://emojipedia.org/night-with-stars/
| jiggawatts wrote:
| Are you called Bobby Tables by any chance?
| urbandw311er wrote:
| Fantastic response
| brightball wrote:
| I just went through the process of converting a 2TB MySQL
| database over the summer. Quite a few gotchas along the way that
| make the process a huge pain.
| mwattsun wrote:
| Has anyone ever successfully converted a large legacy
| Apache/PHP/MySQL site to UTF-8? If so you deserve an award.
| utf8mb4 is just one issue and not the worst.
| fleddr wrote:
| At least once, but it's been a while. From the top of my head:
|
| - set encoding at HTML head - also at any HTML form element -
| Set database collation correctly - I believe it also needs to
| be done per table - The PHP -> MySQL connection itself may need
| to have the encoding set - All string manipulation to be done
| with mb_ - Even ensure your code files themselves are in UTF8
| sam_lowry_ wrote:
| Yes, but I have chosen utf8_bin as encoding and lived happily
| with this choice for a dozen years.
| ars wrote:
| Been there, done that, with a phpbb installation. I manually
| converted all the non-UTF data in the database to UTF8, and
| updated the PHP code and database settings.
|
| It worked fine, no major issues, just some work.
| onli wrote:
| https://github.com/s9y/Serendipity/blob/05f58f90d743fe9ade24...
| is just the detection function I wrote for a PHP blog engine to
| detect whether it would even be possible to use utf8mb4 on the
| system. We completely ran into this issue and didn't know how
| to handle it for literally years. Lots of blogs to migrate, and
| who knows whether it worked for all of them...
| mwattsun wrote:
| The craziest issue I had was I couldn't predict what char
| encoding the text in my database was in. Most users entered
| Windows-1252, some text blobs were UTF-16, others were
| European character sets, and some were UTF-8. Some were
| Japanese SHIFT_JIS. Don't ask me how any of this happened. I
| retrospect, I should have dumped all the tables from MySQL
| and used the excellent PyPy Chardet [1] library to see what I
| was dealing with, do the conversions and then re-import the
| data. But then someone could copy UTF-16 from a Windows
| document and paste it in, so you have to convert going in to
| the database.
|
| You have set Apache to UTF-8, PHP to UTF-8, MySQL to UTF-8,
| and the MySQL driver you are using to UTF-8. It's not clear
| how these setting interact. Are there silent conversions
| happening or do you always have to detect the encoding on
| data coming from the server? HTML pages have a character
| encoding specifier, but the BOM at the start of the file
| takes precedence (I think.) I got it to work by always
| detecting encoding for any text coming from the database and
| using iconv, but this turned out to be really slow and
| unreliable. It was truly the biggest mess by an order of
| magnitude than any other programming problem I faced in my
| career.
|
| Would not attempt again.
|
| [1] https://github.com/chardet/chardet
| charrondev wrote:
| Technical lead for vanillaforums. It's a huge PHP MySQL app.
| Probably 10b+ rows in production DBs. We did the the migration
| in 2017 if I remember correctly.
| mwattsun wrote:
| Not to take anything away but by legacy I meant late last
| century sites. VanillaForums was released in July 2006
| according to Wikipedia, which makes it Web 2.0. Wikipedia
| also says "UTF-8 has been the most common encoding for the
| World Wide Web since 2008" I'm glad you posted though because
| now I know that sites are ok starting in that time frame, not
| before and then never touch the old ones again.
| usbqk wrote:
| Sure, you just prepend mb_ to all php string calls >:)
| ospzfmbbzr wrote:
| Yes. I had the exact same issue with '<' in HTML content stored
| in the database -- cms pages and also some JS that was stored
| in the DB and injected into the HEAD of all pages.
| [deleted]
| davidjfelix wrote:
| Does this even fully fix the problem? It looks like utf8mb4 is
| limited to 4 byte sequences, but as far as I understand, utf-8 is
| variable width. Can utf8mb4 even encode the scottish flag
| https://www.iemoji.com/view/emoji/2476/flags/scotland ?
| zhte415 wrote:
| Variable width is unlikely to be a problem. At 3:5 the Scottish
| flag does not have an unusual aspect ratio. This is unlike the
| flag of Qatar, with a ratio of 10:28, or Nepal with both a 3:4
| (approximate, not exact) aspect ratio plus an irregular shape.
| jeroenhd wrote:
| Is this a problem? Flags in unicode are defined by several
| special characters. The flag of Scotland isn't really a single
| character, it's "<waving flag><tag g><tag b><tag s><tag c><tag
| t><cancel>".
|
| All of these characters are multi byte combinations. The hex
| for the flag is not a single, super wide character, it's
| 0xF09F8FB4 0xF3A081A7 0xF3A081A2 0xF3A081B3 0xF3A081A3
| 0xF3A081B4 0xF3A081BF. You might get some weird results if you
| take substrings from that, but it won't be a problem for the
| backing database store; each separate "binary character" is a
| four byte sequence (as denoted by the 0xF at the front of the
| number).
| jerf wrote:
| Flags are not single code points. UTF-8 refers to how code
| points are stored. If you look in your link at "Hex Code
| Point(s)", it is that first one that would be a problem with
| "utf8" in MySQL, because of the 1 in the 0x10000 position. The
| other six code points required would fit in fine.
|
| Unicode is developing more and more things that require code
| points. I'm not sure what the longest legal non-redundant
| series of code points that can validly represent a glyph
| somewhere is, but it's getting up there with all the emoji skin
| modifiers and such.
| jeroenhd wrote:
| Emoji modifiers, both for gender and skintone, do not produce
| that many extra code points. Combinations are made in the
| same way " + e can combine into e: a code point, followed by
| a combination code point, followed by a modifier. During text
| rendering, these code points are converted back into a single
| glyph.
|
| All UTF-8 codepoints in use today can be encoded with four
| bytes. Theoretically the Unicode system can be used to create
| 6 byte code points if that ever becomes necessary, but it
| won't be for a while. Crossing the 4 byte boundary would also
| introduce compatibility issues with UTF-16, so I'm sure the
| Unicode Consortium will do their best to prevent this from
| happening as long as they can.
| fredoralive wrote:
| Legal UTF8 is limited to 4 bytes, as Unicode only uses ranges
| that fit the limits of UTF16.
|
| AFAIK The flags are a weird multi-code point encoding of the
| ISO country codes, and each individual code point is less than
| 4 bytes.
| warpspin wrote:
| The biggest codepoint in Unicode fits into 4 bytes of UTF-8.
| UTF-8 would allow up to 6 bytes, but those codepoints are not
| in use currently. If they ever become in use, yes, you'd
| probably need a new character set again. But then a lot more
| things will break, as higher codepoints would be incompatible
| with UTF-16 also.
| ghusbands wrote:
| UTF-8 only allows 4 bytes, since 2003:
| https://datatracker.ietf.org/doc/html/rfc3629
| loeg wrote:
| UTF-8 is variable width. The biggest valid codepoint is
| U+10FFFF, which has a 4-byte encoding in UTF-8. Other
| codepoints have 1-, 2-, or 3-byte encodings.
| [deleted]
| urbandw311er wrote:
| I enjoyed the article right up to the point where it concluded at
| the very end that the solution is to "switch to postgresql"
| without actually providing any rationale for that sudden and
| subjective ending.
| talos2110 wrote:
| So in mysql utf8 does not mean utf8. Reminds me of iso8601 in
| php, which does not mean iso8601.
|
| https://www.php.net/manual/en/class.datetimeinterface.php
| rubyist5eva wrote:
| god I hate mysql so much - long live postgres, one of the
| greatest pieces of software ever created
| hn_throwaway_99 wrote:
| Wow, never knew this, this is really bad, especially in this day
| and age where emojis are so prevalent.
|
| MySQL should deprecate utf8 and give a warning if you try to use
| it.
| loeg wrote:
| It is deprecated:
|
| > You should also be aware that the utf8mb3 character set is
| deprecated and you should expect it to be removed in a future
| MySQL release. Please use utf8mb4 instead.
|
| https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8...
| hn_throwaway_99 wrote:
| How the heck did someone downvote you? Basically your link is
| exactly what I was looking for and I'm glad they're making
| this deprecation.
| daneel_w wrote:
| Bad idea. If they deprecate "utf8" they will deprecate a
| standard. It was designed and specified at a time when the
| Unicode code space had a 21 bit limit. There are some technical
| (storage space-related) considerations with the suggestion to
| "just use utf8mb4 everywhere instead" because of how InnoDB's
| indices work.
| loeg wrote:
| MySQL "utf8" is still "utf8mb3", which is not a standard
| anywhere except MySQL. It cannot store the full range of
| 21-bit code points; I don't know why you keep repeating this.
| The maximum codepoint in 3-byte UTF-8 is 0xFFFF, which is 16
| bits.
| hn_throwaway_99 wrote:
| > If they deprecate "utf8" they will deprecate a standard.
|
| No, they won't. They will be deprecating something they
| created that _never_ conformed to the UTF-8 standard. That
| fact that unicode didn 't have codes beyond 21 bits at that
| point is pretty irrelevant.
|
| It was an invalid implementation from day 1, at least with
| the name of "utf8".
| daneel_w wrote:
| This is an old article and it was incorrect when written, and
| it's incorrect today. The author may possibly have learned since
| writing it that UTF-8, by design, encodes _up to 21 bits_ of
| character code point in at most 4 bytes of data. It works and
| behaves as it should.
| echelon wrote:
| Well, that sounds great, but I've had to migrate numerous
| fields to utf8mb4 because we kept getting paged on emoji.
| chris_wot wrote:
| The OP is wrong about the article being incorrect.
| daneel_w wrote:
| Same. Inbound SMSes with emojis. Outbound SMSes with emojis.
| Names and labels with emojis etc. But the implementation of
| "utf8" still isn't wrong, nor a bug - just like the 20 year
| old specs of UTF-8, they just failed to take the future into
| account.
| loeg wrote:
| UTF-8 was not restricted to U+10FFFF until November 2003. Prior
| to that, implementers had to assume it could encode up to 6
| bytes of data.
|
| The "utf8" mode in MySQL can only encode _3 bytes_ of data (up
| to 16 bit codepoints), which is less than 4 (less than 21 bit
| codepoints).
|
| "utf8mb4" correctly encodes all UTF-8 codepoints by current
| standards.
| daneel_w wrote:
| > _" utf8mb4" correctly encodes all UTF-8 codepoints by
| current standards._
|
| Yes, all good and known, but not really what the
| misunderstanding is about.
|
| "utf8" follows the definitions of UTF-8 as they stood at the
| time, adhering to Unicode's former code space limit.
| butlerm wrote:
| MySQL's adoption of a utf8mb3 limitation for "utf8" was
| over seven years obsolete when it was adopted. It would
| have been obsolete in 1996.
| LinuxBender wrote:
| Would the DBA's on this thread consider committing code to the
| MySQL [1] and Postgres [2] tuner scripts that give new DBA's all
| your learned advise and battle hardening experience? This thread
| appears to be such an example.
|
| [1] - https://github.com/major/MySQLTuner-perl
|
| [2] - https://github.com/jfcoz/postgresqltuner
| pm90 wrote:
| Another interesting "issue" is that Case sensitive character sets
| may lead to unexpected results with aggregations (eg
| https://blog.mallya.dev/2021/07/25/mysql-cases-sensitivity/)
|
| MySQL has many such issues which have turned me off from
| recommending it. It's a shame because it's a solid technology
| that mostly works and has a really long operational history.
| ipaddr wrote:
| No one talks about the downsides of changing.
|
| Moving from 255 max characters to 191 or 192 max means a lot of
| your data needs to be moved into a text fields which means things
| like this field can't be a primary key.
| Pxtl wrote:
| I'm used to mssql where nvarchar (the UTF-16 datatype) supports
| basically unlimited length... you can't make long PKs in mysql?
| ipaddr wrote:
| You are limited by smaller types ints, dates and chars.
| clon wrote:
| The plain utf8 covers the "basic multilingual plane"
| (x0000-xFFFF), so it will get you very far, actually.
|
| In our app, we finally went for utf8mb4 to allow people to enter
| emoji. As a side "bonus" you will also soon see some clever
| people entering their names to stand out, such as "John". Note
| that this is not "John". It is actually a series of mathematical
| symbols: U+1D679 MATHEMATICAL MONOSPACE CAPITAL
| J U+1D698 MATHEMATICAL MONOSPACE SMALL O U+1D691
| MATHEMATICAL MONOSPACE SMALL H U+1D697 MATHEMATICAL
| MONOSPACE SMALL N
| RL_Quine wrote:
| capitainenemo wrote:
| One fun thing you can do in MariaDB is define a collation based
| off of the Unicode confusables list. The result of this is that
| select * from usernames where user = 'John' would still work :)
|
| It also avoids registering both 'John' and 'John'. (changed the
| o)
|
| Hedgewars user registration does this.
| sam_lowry_ wrote:
| That's a great feature. I was running a community site once
| where impersonation became a popular game. So we had dozens
| of users registering accounts with Cyrillic i,o,a,e etc.
| stragies wrote:
| For looking up relevant further material, this is called a
| 'homograph' or 'homoglyph attack'.
|
| It was also popular for site-spoofing by sending links with
| these cyrillic character to lead people to a fake
| bank/search_engine page, until that became widely used, and
| then owners of the tech-stack (browsers, registrars, dns-
| operators, etc) shut down this attack vector mostly.
| fredoralive wrote:
| According to the docs, MySQL is a bit more explicit and calls the
| old "utf8" "utf8mb3" nowadays, and notes that it may be removed
| at some point. Although "utf8" is still aliased to the footgun at
| the moment.
|
| https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8...
| webkike wrote:
| Genuine question: why would anyone prefer MySQL over Postgres?
| ars wrote:
| MySQL is much easier to administer and upgrade.
|
| For example I have two versions of Postges running on my
| machine right now because Debian does not automatically upgrade
| between them - you have to export your data, and reimport it
| into the new version.
|
| https://www.pontikis.net/blog/update-postgres-major-version-...
|
| MySQL just works. There are issues with it, but they are not as
| major as people make them out to be. Sometimes people like
| "technical purity" over just get the job done.
| _moof wrote:
| _> MySQL just works_
|
| Right up until one of your users tries to use an emoji,
| apparently.
| outworlder wrote:
| Keep in mind that that's a post from 2016. PG has made many
| improvements since then. Upgrading is a much easier job right
| now.
|
| Also, it's strange to conflate PG with Debian issues.
| lewiscollard wrote:
| In the WordPress world, it is the only database.[0] If you do
| WP, you do MySQL. You may sniff at WP (I certainly do), but it
| is used for 43% of all websites, which is not a trivial number
| of deployments, and not a small market.
|
| I suspect it has a large market share in the "started as a
| trivial PHP program, grew to large application, too committed
| to MySQL to migrate to something else" market, too. There's
| that one website with the blue F logo that went this path, but
| there's probably thousands more.
|
| That's not the same thing as anyone actually preferring it, but
| that's two possible reasons why there's an awful lot of MySQL
| deployments out there.
|
| [0] https://codex.wordpress.org/Using_Alternative_Databases
| kstrauser wrote:
| Job security. I can't think of any technical reasons.
| ars wrote:
| Other way around - Postges will give you more job security,
| since it's rarer and takes more time and work to administer.
|
| People use MySQL because it's easy, and they want to spend
| their effort in other places.
| kstrauser wrote:
| That's like saying Python is rarer than JavaScript. Even if
| it's true, it's not actually rare at all.
|
| In the last 6 years, I've spent approximately 2 hours
| administering PostgreSQL, and I'd be the one who had to do
| it.
| jiggawatts wrote:
| It's "easy" until it silently shreds data and gets your org
| hacked through truncation attacks.
|
| Don't confuse lazy with easy.
| [deleted]
| dark-star wrote:
| In the past, Unicode was assumed to be 64k of codepoints, so a
| 3-byte UTF-8 sequence was considered "long enough", especially
| since there were surrogate pairs for the rare cases where you
| have to encode higher code points.
|
| Only "recently" have longer UTF-8 sequences (aka. emojis) become
| widespread enough that this became a problem.
|
| Yes, it could have been avoided if they had allowed arbitrary-
| length UTF-8 sequences from the beginning, but I can see that
| they probably just wanted to optimize a bit.
|
| What I don't understand is why they had to create a different
| encoding (the utf8mb4) instead of just extending the existing
| utf8 encoding, since 4-byte UTF-8 is obviously backward-
| compatible with 3-byte UTF-8... (unless they always used 3 bytes
| for every character, which would be stupid as UTF-8 has been
| explicitly designed for variable-length encodings)
|
| Bonus: Many filesystems also do not allow 4+ byte UTF 8 code
| points in filenames. Test your company's file server to see if it
| allows you to save a file as "(some random emoji).doc". A few
| very expensive storage systems also have problems with that (and
| they have the same workaround: convert your filesystem to a
| different encoding, instead of simply extending the existing
| encoding to allow 4+ bytes)
| int_19h wrote:
| > why they had to create a different encoding (the utf8mb4)
| instead of just extending the existing utf8 encoding, since
| 4-byte UTF-8 is obviously backward-compatible with 3-byte UTF-8
|
| Because CHAR and VARCHAR columns have max length specified, and
| this translates to the corresponding amount being reserved in
| storage. For a variable-length encoding, this is normally
| computed assuming the largest possible value for every
| codepoint.
| bearjaws wrote:
| Its way more than Emojis, people copy pasting from Excel will
| inadvertently add UTF8mb4 characters to your database. This is
| handled extremely poorly by MySQL and results in ugly
| characters being displayed.
| xigoi wrote:
| It's not backward-compatible if someone relies on the errors.
| I'm guessing it's a case of spacebar heating.
| Anthony-G wrote:
| For those unfamiliar with the spacebar heating reference:
| https://xkcd.com/1172/
| raphlinus wrote:
| While non-BMP Unicode has only become popular somewhat recently
| due to emoji, it's been part of the relevant official specs for
| a quarter century.
|
| I know you were kinda joking by using scare quotes for
| "recently", but let me fill that in for people who might be
| less familiar with the history. Unicode broke out of the 16-bit
| limit as of version 2.0, released in July 1996, so yeah, a
| quarter century. The first RFC for UTF-8 (RFC 2044, Oct 1996)
| explicitly supports UCS-4. As far as I know, there has never
| been a time where UTF-8 restricted to the BMP (ie 3 bytes) has
| been in widespread use. In other words, I am questioning your
| assertion that 3-byte UTF-8 has ever been considered "long
| enough," except perhaps in sloppy implementations made without
| much care for following specs.
| chungy wrote:
| > except perhaps in sloppy implementations made without much
| care for following specs.
|
| Describes MySQL perfectly ;)
| chungy wrote:
| > Many filesystems also do not allow 4+ byte UTF 8 code points
| in filenames.
|
| I can't even think of a single example. Most filesystems just
| offer 255 8-bit units to filenames, where 4-byte UTF-8
| sequences are totally a non-issue. ZFS supports utf8only=on,
| which enforces that filenames do conform to UTF-8, in which
| case... 4-byte sequences are still not a problem.
|
| What's the filesystem that doesn't allow it?
| brandmeyer wrote:
| Apple filesystems are infamous for breaking this convention.
| IIUC, they will support codepoints which expand to 4 bytes in
| UTF-8, but their implicit normalization rules can trip up
| programs which expect behavior closer to POSIX norms.
| ghusbands wrote:
| Almost everything you claim here is wrong. UTF-8 has never been
| a 3 byte encoding, the spec initially specified [1] an up-to-
| six-byte encoding after Unicode had already gone past 16 bits
| [2] and then was reduced to 4 bytes by RFC3629 [3] in 2003.
| MySQL is the only piece of software that I know of where they
| invented their own length. (Though a lot of software often
| failed to validate it at all.)
|
| There's nothing recent about it - it was 22 years ago! There's
| no significant optimisation advantage in the different length,
| and you can see from the patch that reduced the max-length [4]
| that it wasn't about optimisation. I don't think you can name a
| single file system that restricted UTF-8 to three bytes.
|
| [1] https://datatracker.ietf.org/doc/html/rfc2044 [2]
| https://unicode.org/faq/utf_bom.html [3]
| https://datatracker.ietf.org/doc/html/rfc3629 [4]
| https://github.com/mysql/mysql-server/commit/43a506c0ced0e6e...
| dark-star wrote:
| It is you who does not know what he's talking about:
|
| > UTF-8 has never been a 3 byte encoding
|
| I never claimed that
|
| > There's nothing recent about it
|
| The non-BMP characters are "recent" because 10 years ago the
| non-BMP was not allocated except for some small areas. Also I
| said it "became popular recently", due to emoji. Before that,
| non-BMP codepoints were rarely used
|
| > I don't think you can name a single file system that
| restricted UTF-8 to three bytes.
|
| WAFL[1] (unless you "format" it as utf8mb4, which was only
| implemented a few years ago...)
|
| [1]
| https://docs.netapp.com/ontap-9/topic/com.netapp.doc.cdot-
| fa...
| billpg wrote:
| Dear databases, please don't get hung up about string lengths
| when dealing with UTF8.
|
| If I ask for a UTF8 string with a max-length of 100, please don't
| apply the worse case scenario and allocate space for 100 emojis.
| Please give me a box of 100 bytes and allow me to write any UTF-8
| string that can fit into 100 bytes in there.
|
| 100 ASCII characters. 20 emojis. Any mixture of the two.
|
| If I ask for UTF-8, it'll be because I'd like to make advantage
| of UTF-8 and I accept the costs. If that means I can't quickly
| jump to the character at index 84 in a string, no problem, I've
| accepted the trade-off.
| grogers wrote:
| Yes! This and other charset oddities almost make me want to
| store all strings as varbinary and handle any conversions
| to/from a string type to utf8 bytes client side. It makes ad-
| hoc query access from a shell much more annoying though.
| billpg wrote:
| I wonder if you can do LIKE with a VARBINARY value.
|
| "SELECT * FROM Users WHERE Name LIKE @namePattern"
| @namePattern = UTF8.GetBytes("%billpg%")
| int_19h wrote:
| You're literally asking them to break the SQL standard. The
| meaning of N in CHAR(N) is the number of characters.
| petergeoghegan wrote:
| Why would the database "allocate space" unless it was truly
| necessary, in any case? While char(n) works that way, most
| applications don't actually use it, no matter the underlying DB
| system.
| mikl wrote:
| PostgreSQL recommends you use the text type everywhere, and use
| CHECK constraints if you really really really need to limit a
| field's length on the database level. (VAR)CHAR is considered a
| code smell. So many fields are artificially limited to 255
| bytes because that used to make a performance difference on
| MySQL 15 years ago.
| dchest wrote:
| More like "100 ASCII characters. Unknown number of emojis."
|
| You'll get codepoints truncated if you use bytes.
| billpg wrote:
| I would hope a database engine would throw an error instead
| of truncating bytes.
| karsinkk wrote:
| This might be a tad unrelated to the original post, but the
| following article is one of my favorite primers on Character
| sets/Unicode :
|
| https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...
| superasn wrote:
| Excellent article, even though I already knew most of it, given
| it was written in 2003 must have been one of the most
| insightful and enlightening article written.
|
| Wonder if you have any more links like this? Please do share!
| cblconfederate wrote:
| will something terrible happen if i just convert my tables to
| utf8mb4? Will i have to upgrade each column?
| smarx007 wrote:
| Of course, nobody "refused" to fix a "bug". Instead, a non-
| conformant behavior was already relied upon by legacy systems out
| in the wild and the "fix" was added in a backwards-compatible
| way.
|
| Edit: Three bytes are enough to fit nearly any of the chars in
| use in any language, including Chinese and Japanese, so I can
| only assume someone "smart" in the MySQL dev team decided to
| "save space" (before emoji were a thing).
| CountSessine wrote:
| _Three bytes are enough to fit nearly any of the chars in use
| in any language, including Chinese and Japanese_
|
| With only 3 bytes you'll completely miss plane 2, the
| "Supplementary Ideographic Plane" which includes tons of
| Chinese-Japanese-Korean Han characters.
|
| I wish people would stop saying the supplementary characters
| are just for "emoji". Asian unification was very controversial
| and ultimately unsuccessful. Plane 1 and Plane 2 are important,
| especially if you're going to sell software or products in
| China or Japan where they are mandated.
| cblconfederate wrote:
| couldnt emoji fit in 3 bytes as well? I don't think there are
| that many ...
| rakoo wrote:
| utf8 is an encoding for unicode codepoints. Those codepoints
| are spread on a space that is extremely vast (up to ~4
| billion) for which that can be represented with up to 4
| bytes. It turns out emojis are positioned in a place where
| the first byte will never be 0, so even if there were only
| one it would require the full 4 bytes to encode them.
| the_mitsuhiko wrote:
| > Those codepoints are spread on a space that is extremely
| vast (up to ~4 billion)
|
| You are off by a lot. The maximum code point is about 21bit
| high (0x10FFFF). The space is only 1.1 million large.
| capitainenemo wrote:
| It could be he's thinking of the historical definition
| which included support for 5 character encoding with a
| maximum codepoint of U+7FFFFFFF or ~2 billion.
|
| https://en.wikipedia.org/wiki/UTF-8#History
|
| That was restricted I believe primarily for compatibility
| with more limited encodings like UTF-16.
|
| I guess it's possible that at some future point in human
| history when UTF-16 has been purged from memory, the 5
| character encoding might be allowed again. :)
| ninkendo wrote:
| Related: time to explain UTF-8! (In case anybody is
| curious. I personally think it's extremely clever and
| worth understanding:)
|
| Characters <128 are encoded with a single byte: 0xxxxxxx
|
| Characters >128 are encoded with multiple bytes.
|
| A two-byte character looks like:
|
| 110xxxxx 10xxxxxx (11 useful bits, representing code
| points 128-2047)
|
| A three-byte character looks like:
|
| 1110xxxx 10xxxxxx 10xxxxxx (16 useful bits, representing
| code points 2048-65535)
|
| A four-byte character looks like:
|
| 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (21 useful bits,
| representing code points 65536-2097151)
|
| Now, technically this scheme could expand to 6-byte
| characters without getting confused with things like
| BOM/etc, however any code points larger than 2^21
| wouldn't be representable in UTF-16, which has its own
| set of constraints. This means the unicode consortium has
| basically limited themselves to two million or so
| possible code points, which is why UTF-8 doesn't need to
| go more than 4 bytes. (I wonder if a future unicode
| version will require a larger limit and would thus create
| a new "utf8mb6" scheme, and drop UTF-16 altogether?)
| _moof wrote:
| Almost all emoji are four bytes in UTF-8.
| cblconfederate wrote:
| were all the other positions taken?
| ak217 wrote:
| That's a terrible excuse. MySQL should have fixed this in a
| major release. I had to work with a production system that had
| all kinds of issues because of this bug (engineers assumed,
| with good reason, that UTF-8 meant UTF-8, when it did not).
|
| This kind of reasoning is how we end up with vulnerabilities
| like the recent one in Log4j. Just because a behavior made
| sense in the past, or an unfortunate bug made it into
| production, is no excuse to let it inflict damage in
| perpetuity.
| Arch-TK wrote:
| I don't see how introducing a new major release would fix
| this? people would use the old version (because of the
| breaking change) for a while still, you might even end up in
| a python2/3 situation.
| wruza wrote:
| Then they should have considered implementing utf-8mb3le
| surrogate pairs. What a missed opportunity!
| _moof wrote:
| Even that doesn't make any sense, because refusing to encode
| characters that require four bytes doesn't save any space; it
| just makes it impossible to encode those characters. Nothing
| about the other encoding lengths changes.
|
| The only thing I can figure is that something somewhere is
| using a 16-bit quantity for decoded codepoints. Four-byte
| encodings are for codepoints above FFFF. (Which I guess is
| still someone's idea of "saving space.")
|
| Edit: Apparently the max encoding length used to be six bytes,
| so there's literally no plausible explanation for this that
| doesn't end with "thank god I stopped having to deal with MySQL
| over a decade ago."
| NelsonMinar wrote:
| In 2022 with a new release of MySQL or MariaDB, is it still
| possible to create a database with "utf8" encoding? Does it print
| a warning?
| maxfurman wrote:
| Let's say for the sake of discussion that your MySQL db has a lot
| of tables encoded as "utf8." Are there any known drawbacks or
| gotchas to converting them en masse to "utf8mb4"? Is this a
| lengthy operation?
| kerblang wrote:
| The author of original article links to this one in their post:
|
| https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4
|
| And yeah, there's gotchas in there, the main one being that
| strings that barely fit now won't fit, and also maximum lengths
| on indexable strings.
| chris_wot wrote:
| I can't imagine there would be too many, other than it is a
| tedious operation not easily automated. Depending on your data,
| might take a while.
|
| Can't believe that neither MySQL or MariaDB has created a
| utility to do the conversions for you automatically.
| spookthesunset wrote:
| Schema changes to large tables are always huge pain in the ass
| in MySQL. Every team I've worked on goes to great lengths to
| avoid changing the schema, which leads to all kinds of exciting
| anti-patterns like recycling old fields even if their name no
| longer matches what is stored in them.
| sshine wrote:
| I suspect this is because for large databases, schema changes
| lock the tables for extended periods of time, and if this
| time exceeds a reasonable time for a service window, you need
| good database replication. And you probably don't have good
| MySQL database replication unless you're on AWS or similar.
| spookthesunset wrote:
| That's exactly what it is. If you have a big table you are
| gonna have a pretty long outage while you update that
| schema. Like in the last place I worked our largest table
| would take more than an hour for a schema update. So we
| never did schema updates.
| Animats wrote:
| That was set up when Microsoft and Java had standardized on
| UTF-16. So this can represent the UTF-16 subset of Unicode, which
| is Unicode Plane 0, the Basic Multilingual Plane (BMP). The
| higher-numbered "astral planes" of UTF-8 were rarely used. All
| modern languages with a significant user base are covered in
| Plane 0. Plane 1, the Supplementary Multilingual Plane, with
| Cretan Linear B, Egyptian hieroglyphics, and such, was seldom
| needed. Few people had fonts for those, anyway.
|
| Because of the way UTF-8 is encoded, it takes 3 bytes to
| represent the UTF-16 set. That's because it only takes one byte
| for ASCII characters. Hence, 3-byte MySQL data.
|
| Emoji, though, were put in Unicode Plane 1. That's where mass
| demand for the astral planes came from. More four byte UTF-8
| characters started showing up in data.
| tremon wrote:
| A UTF-16 "subset" of Unicode doesn't exist. all UTF-* encodings
| are just that: _encodings_. They can all represent the entire
| Unicode character set. (Except for UTF-7, which never was an
| official standard).
|
| What you're referring to is UCS-2 [0], which is UTF-16 without
| support for the high surrogates that make up the rest of the
| Unicode character set. But to imply that UTF-16 is a subset of
| Unicode is just not true.
|
| [0] https://en.wikipedia.org/wiki/UCS-2
| Dylan16807 wrote:
| That they can all represent all of unicode is true, but
| they're not "just" encodings either. The current code space
| of unicode is _built around_ UTF-16. That 's why there are 17
| planes.
| light_hue_1 wrote:
| There's so much wrong with this answer. It completely confuses
| Unicode, with Unicode encodings, and with non-Unicode
| encodings.
|
| > So this can represent the UTF-16 subset of Unicode
|
| There is no UTF-16 subset of Unicode. UTF-16 has surrogate
| pairs, it can represent all of Unicode. You are talking about
| UCS-2, a format which has no surrogate pairs.
|
| > That was set up when Microsoft and Java had standardized on
| UTF-16
|
| Java standardized on UCS-2 initially (which is what you're
| talking about). In the mid 2000s they moved to actual UTF-16.
|
| > The higher-numbered "astral planes" of UTF-8
|
| "astral planes" are not a UTF-8 concept. They're a Unicode
| concept. A Unicode code point has 6 hex digits. The plane is
| just the first 2 of those 6 digits. So code points 0000- FFFF
| are in the BMP (Basic Multilingual Plane), etc. People nickname
| planes above the 00 plane, BMP, astral planes.
|
| > Plane 1 ... was seldom needed. Few people had fonts for
| those, anyway.
|
| So "rarely", that the majority of humanity writes in languages
| that aren't in BMP. CJK (Chinese, Japanese, Korean and
| sometimes Vietnamese) languages need characters outside of BMP.
|
| > Emoji, though, were put in Unicode Plane 1. That's where mass
| demand for the astral planes came from. More four byte UTF-8
| characters started showing up in data.
|
| That's totally untrue. The demand always existed by the
| majority of humanity, you just never cared about this massive
| problem until it impacted you.
| Dylan16807 wrote:
| > That's totally untrue. The demand always existed by the
| majority of humanity, you just never cared about this massive
| problem until it impacted you.
|
| I think you swung so far in the other direction you also
| landed in untrue territory. Originally the intent was to
| encode all characters in "modern use" and 16 bits was
| probably enough to do that for the entire world.
|
| Also CJK is about one quarter of the world population.
| loulouxiv wrote:
| UTF-16 can represent the full range of Unicode codepoints by
| using couples of surrogates
| mort96 wrote:
| You're right. Replace UTF-16 with UCS-2 and the comment
| sounds at least slightly more correct.
| Animats wrote:
| Sort of. Applications using UTF-16 have to be aware of pairs
| at the application level. Many are not.
| int_19h wrote:
| This isn't a consequence of using UTF-16 as such - Java,
| .NET etc could totally have an API around UTF-16 strings
| that handles surrogate pairs. The problem, rather, is that
| those languages introduced a 16-bit type that they called
| "character", even though it wasn't even a Unicode
| codepoint. And then used that type throughout all string
| APIs, including strings themselves (indexing etc).
|
| In .NET land you're now supposed to use
| https://docs.microsoft.com/en-
| us/dotnet/api/system.text.rune instead. It transparently
| handles surrogate pairs, so the app needn't be aware of
| anything - and yet the internal encoding is still UTF-16.
| ninju wrote:
| This HN posting should be marked with (2016)
|
| Is the issue still an issue?
| Deukhoofd wrote:
| Yes: https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-
| utf8...
| dang wrote:
| Year added above. Thanks!
| loeg wrote:
| Yes and yes.
|
| https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8...
|
| > The utf8 Character Set (Alias for utf8mb3)
| oleg_antonyan wrote:
| Thanks to emoji (almost)everything now supports unicode
| leros wrote:
| In addition to emojis, I've also had issues with certain
| characters copied from Microsoft Word being 4 bytes.
| sshine wrote:
| In MySQL, if you want to make sure that your Unicode fields are
| mangled, never use 'utf8', use 'utf8mb3'. Since 'utf8' is an
| alias to 'utf8mb3', that alias might eventually be updated to
| point to 'utf8mb4' which won't mangle your characters.
|
| The only safe choice here is to explicitly use 'utf8mb3'.
| didip wrote:
| MySQL is very similar to PHP in this regard. The good function
| name is actually broken so a new function with similar name is
| needed. This creates confusion for new people when adopting the
| technology.
| egeozcan wrote:
| Oh the good old "mysql_real_escape_string"[0] days...
|
| I forgot most of my PHP knowledge but that somehow got stuck.
|
| [0]: https://www.php.net/manual/en/function.mysql-real-escape-
| str...
| duskwuff wrote:
| You can blame MySQL for that one. The function name is
| straight out of the MySQL C API:
|
| https://dev.mysql.com/doc/c-api/8.0/en/mysql-real-escape-
| str...
| fareesh wrote:
| I decided to build all my new projects on postgres a few years
| ago. I am often reminded of why that was a good decision.
| [deleted]
| pkrumins wrote:
| Does anyone know if utf8mb4 is variable length or does each
| character take full 4 bytes?
| butlerm wrote:
| It is variable length, by definition. If it were four bytes per
| code point it would be named something more like UCS-4 or
| UTF-32.
| pkrumins wrote:
| Thanks!
| [deleted]
| kolaente wrote:
| Also relevant about this:
| https://blog.koehntopp.info/2022/01/12/utf8mb4.html
| patrickcteng wrote:
| I've had to "upgrade" a couple to utf8, and varchar(255) bites me
| all the time.
| qwerasdf0987 wrote:
| I wrote about supporting emojis in Ruby on Rails / MySQL here:
|
| https://josephecombs.com/2018/05/06/how-to-support-emojis-wi...
| bambax wrote:
| > _If you need a database, don't use MySQL or MariaDB. Use
| PostgreSQL._
|
| Sure. But for many use cases, SQLite is enough.
| ashvardanian wrote:
| Wow! I was working on this issue in our DBMS product today!
|
| Fun suggestion, try making a JSON string with a NULL character
| somewhere in the middle. It will be encoded as \u0000 and is a a
| valid UTF-8 code, but most C based systems will truncate the
| string by estimating its length via strlen.
|
| Java community and some other software vendors designed the
| Modified UTF-8, which replaces the zero with a 2-byte code point.
| Sleek. Aside from the fact, that you are modifying the data that
| customer wants to stay consistent.
|
| Postres explicitly bans such cases in the VARCHAR, not sure if it
| can fit in their JSON columns. Who tried?
| cerved wrote:
| Null-bytes are not valid characters in JSON strings. Nor any
| other control characters for that matter
| lsllc wrote:
| A "hello devops" article from Jan 3, 2022 discussing this topic
| was posted by /u/flokoe a week ago:
|
| https://www.hellodevops.blog/posts/database-character-sets-a...
|
| And the ensuing discussion:
|
| https://news.ycombinator.com/item?id=29793916
| CountSessine wrote:
| That's kind of funny - Oracle has a similar issue. If you want
| industry-standard UTF-8, you have to specify "AL32UTF8" as your
| encoding. "UTF8" is kind of crazy - it's this monstrous
| abomination called CESU-8 (https://en.wikipedia.org/wiki/CESU-8),
| which isnt' UTF-8 at all - it's actually this weird "UTF-16
| complete with surrogate pairs wrapped in a UTF-8 shell" thing.
| kyralis wrote:
| > Database systems have subtle bugs and oddities, and you can
| avoid a lot of bugs by avoiding database systems.
|
| Wat? That is... not a great takeaway from this experience. "Oh,
| this library had bug reported in it! I know, that means I should
| just write my own, because clearly they're all buggy and my
| software is always perfect."
| jerf wrote:
| Well, I mean, the good news is that if you write your own
| database system, you'll have so many _other_ bugs in the way
| that you 'll likely never reach this particular one!
|
| I suggest it can be called the Sirius Database.
|
| "It is very easy to be blinded to the essential uselessness of
| [the new database you just wrote] by the sense of achievement
| you get from getting them to work at all. In other words--and
| this is the rock solid principle on which the whole of the
| Corporation's Galaxy-wide success is founded--their fundamental
| design flaws are completely hidden by their superficial design
| flaws."
| piyh wrote:
| >their fundamental design flaws are completely hidden by
| their superficial design flaws
|
| That hit me right in my 9-5 "not invented here" nightmare.
| acdha wrote:
| It's especially specific to MySQL -- switching to Postgres is
| great because in addition to better performance and features
| there are all of these rakes in the grass you can forget about.
| Unicode just works, you never lose data because someone didn't
| scrupulously check the warnings after every query, etc.
| Thaxll wrote:
| PG is actually overall slower.
| acdha wrote:
| In my experience that's only true for very simple queries
| which play to MySQL's strongest points, not complex ones or
| under high transaction volume. I haven't found that to be a
| good win since the 2000s because caches usually soak up the
| easy queries and as your app gains complexity you hit
| performance cliffs which are harder to deal with,
| especially without some of the more advanced SQL features.
| MySQL 8 seems to have improved in this regard.
| tehbeard wrote:
| Instead you run face first into the thread per connection
| limit rake...
| stjohnswarts wrote:
| I love that expression "rakes in the grass", it brings back
| childhood memories of quicksand, stop/drop/roll, and stepping
| on rakes as being things I'd always have to watch out for as
| an adult, plus all the loony toon cartoons who used it.
| TedDoesntTalk wrote:
| Doesn't Postgres have its own set of unique of problems?
| acdha wrote:
| I've encountered far fewer and nothing treacherous like the
| "you lost data but we didn't tell you about it" stuff MySQL
| is famous for.
| mschuster91 wrote:
| Indeed, the worst of which is you can't do in-place
| upgrades - upgrading cluster versions is an insanely
| complex ride.
| onphonenow wrote:
| I routinely upgrade versions - on AWS RDS this is just a
| click or two in my case.
|
| https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/US
| ER_...
|
| When you say "complex ride" what does that mean?
| jrockway wrote:
| I am not sure if that procedure covers major versions (13
| -> 14 for example), but if it does, they are ahead of the
| curve. On GCP, you have to backup the database and
| restore the backup to a new database. On self-hosted
| installs, you have to pick one of these processes:
| https://www.postgresql.org/docs/14/upgrading.html
| acdha wrote:
| It does. It's really easy to outsource.
| AlisdairO wrote:
| To confirm, RDS does support in place major version
| upgrade for postgres.
| onphonenow wrote:
| They do an in_place upgrade using pg_upgrade I believe.
| They do have to regenerate system tables - they also
| automate I think some updates to system statistics. But
| for very large databases this really is the way to go I
| think.
|
| They do suggest a backup before you run this in case
| there are problems, most folks have periodic backups
| anyways, but the backup is not used.
| mschuster91 wrote:
| > When you say "complex ride" what does that mean?
|
| On managed postgresql services like RDS it's easy, yes,
| because Amazon does all of the nasty work for you and has
| already seen all the kinks you may run into.
|
| If you are on your own and using Docker, you essentially
| need a third-party image that has the old and new
| binaries (https://github.com/tianon/docker-postgres-
| upgrade). Good f...ing luck if you dare run PostgreSQL
| inside a Kubernetes environment, because adapting that
| guide to _that_ is a can of worms large enough to make
| anyone squirm.
|
| If you are on your own and are using distribution
| packages, it's more complicated
| (https://blog.samuel.domains/blog/tutorials/from-stretch-
| to-b...).
|
| If you are on your own and use postgres from source,
| well... have fun, there's a reason why I prefer using
| distribution-maintained packages when possible.
| paulryanrogers wrote:
| Fun fact InnoDB in place updates have plenty of caveats
| too: must be last column, cannot have non-default on-
| update/delete clauses. And if you workaround with any
| table-rename solutions like pt-online-schema-change then
| you might crash your server. (Some MySQL 8 versions crash
| on table renames.)
|
| EDIT: Oops, guess you meant version updates not DDL.
| Leaving comment for posterity.
| iruoy wrote:
| Postgres doesn't have a way to change the order of columns.
| Of course you could make a view, but my code doesn't use a
| view and I don't really want to make one just for when I
| want to check the database.
| mikl wrote:
| If that's the biggest problem you have with your
| database, be happy.
| corrigible wrote:
| Wouldn't a single application of a temp table be
| sufficient?[0]
|
| 0: https://brianchildress.co/reorder-columns-postgres/
| time0ut wrote:
| This is a pretty uncharitable reading considering the very next
| line is a recommendation to use PostgreSQL if you need a
| database.
| dang wrote:
| " _Please don 't pick the most provocative thing in an article
| or post to complain about in the thread. Find something
| interesting to respond to instead._"
|
| https://news.ycombinator.com/newsguidelines.html
| dagmx wrote:
| A) it's an obvious joke B) even if you take it literally, they
| never suggest implementing something new
|
| You're misreading this on multiple levels
| cbg0 wrote:
| I'm pretty sure you missed the joke.
| wonder_er wrote:
| a while ago, in my first engineering job, I was tracking down a
| strange bug related to chinese character sets.
|
| Turns out some of the db tables were using utf8 and latin1;
| changing them to `utf8mb4` fixed it right up!
|
| https://josh.works/troubleshooting-chinese-character-sets-in...
|
| It was a fun bug to work on. I learned a lot about character
| encoding, and enjoyed bringing a refined "process" to the table.
| Dylan16807 wrote:
| Also make sure mysqldump is using utf8mb4 or you'll get character
| replacement in your backups. This often needs to be configured
| separately.
| TedDoesntTalk wrote:
| > Back in 2002, MySQL gave users a speed boost if users could
| guarantee that every row in a table had the same number of bytes.
| To do that, users would declare text columns as "CHAR"
|
| This database type has existed in many databases since at least
| the 1980s
| mrcarruthers wrote:
| And here's the commit that changed it:
| https://github.com/mysql/mysql-server/commit/43a506c0ced0e6e....
| It was originally set to max 6 bytes but for some reason someone
| dropped it to 3.
|
| They also seem to be slowly phasing it out. Internally it's
| utf8mb3 and utf8 is just an alias. The idea is to eventually make
| utf8 an alias to utf8mb4.
| jmnicolas wrote:
| It's because of things like that it's still hard for me to
| respect MySQL in 2022.
|
| I use Postgres since 2008. It has never bitten me even once.
| ars wrote:
| For something that was fixed 12 years ago?
|
| If Postgres has never bitten you, then you aren't using it
| much.
| mikl wrote:
| It's not fixed. MySQL still has a datatype called "utf8"
| that, in fact, is not utf-8 but a limited subset thereof.
|
| PostgreSQL has some sharp edges, sure, but a lot less obvious
| footguns like these.
| tapoxi wrote:
| Maybe this is a cynical take - but we used https://pgloader.io/ a
| few years ago to migrate to Postgres, and have never been
| happier. MySQL has a lot of stupid decisions like this.
| gjs278 wrote:
| cakoose wrote:
| Specifically about UTF-8 in Postgres: "PostgreSQL collation is
| a massive footgun"
| https://gist.github.com/rraval/ef4e4bdc63e68fe3e83c9f98f56af...
|
| Postgres defaults to the system's collation for UTF-8, which
| caught me by surprise. Now I'm stuck with a production DB where
| text prefix queries, which should be a perfect use of a btree,
| are a full table scan. Probably gonna have to take some
| maintenance downtime to fix the situation :-\
|
| And more generally about operating Postgres in production:
| https://blog.nelhage.com/post/some-opinionated-sql-takes/
|
| _As for Postgres, I have enormous respect for it and its
| engineering and capabilities, but, for me, it's just too damn
| operationally scary. In my experience it's much worse than
| MySQL for operational footguns and performance cliffs, where
| using it slightly wrong can utterly tank your performance or
| availability. In addition, because MySQL is, in my experience,
| more widely deployed, it's easier to find and hire engineers
| with experience deploying and operating it. Postgres is a fine
| choice, especially if you already have expertise using it on
| your team, but I've personally been burned too many times._
| jtc331 wrote:
| What else would they reasonably default to? Are you proposing
| requiring collations to always be specified (a breaking
| change from the SQL spec)?
| seanw444 wrote:
| I think the open-source nature is a major factor in the
| reason it's chosen so much. That's why I roll it.
| Arch-TK wrote:
| What do you use instead?
| dijit wrote:
| I'm a big posgresql fan, but I'd maybe argue that you're making
| tradeoffs. PG has weird behaviour sometimes too.
|
| I don't think it's always smart to just change out your data
| layer.
|
| But, if you're starting a new project, I do think PG is one of
| the better options and tends to follow the principle of least
| surprise. (hence: big fan)
| leetrout wrote:
| As a huge Postgres fan working at a shop using MySQL I'll
| point out the first thing I saw MySQL does that Postgres
| doesn't:
|
| Index hints.
|
| If you aren't hitting an index in Postgres you have to dig in
| to table stats and figure out what is wrong but MySQL gives
| you more control.
|
| However, I would still rather work with Postgres AND have to
| juggle a connection pooler than deal with MySQL. Transactions
| on DDL are _great_ and the ability to use foreign keys across
| partitioned tables is how it should be.
| samcrawford wrote:
| > I would still rather work with Postgres AND have to
| juggle a connection pooler
|
| Is this comment relating to the overhead of idle
| connections, which has historically necessitated the use of
| a pooler in front of PG? If so, I believe this is resolved
| in postgres 14
|
| https://pganalyze.com/blog/postgres-14-performance-
| monitorin...
| spookthesunset wrote:
| But at least Postgres has much better tools for learning
| about the query plans it might use. The Postgres output of
| "EXPLAIN" is much much better than MySQL.
|
| It's a shame that almost every job I worked at uses MySQL
| and not Postgres. But that could be because those companies
| all got their start like a decade or more ago when Postgres
| was not as well known.
| pzduniak wrote:
| > As a huge Postgres fan working at a shop using MySQL I'll
| point out the first thing I saw MySQL does that Postgres
| doesn't:
|
| > Index hints.
|
| Eh, it's an extension. One you shouldn't use, but it's
| there.
| leetrout wrote:
| I didn't know this!
| hn_throwaway_99 wrote:
| > PG has weird behaviour sometimes too.
|
| Perhaps, but I'd argue that the "weird" behavior of postgres
| just tends to be clearly thought out design decisions that
| they made for a valid reason that may cause you some pain
| with how you use it (e.g. their process-per-connection
| model).
|
| MySQL's "weird" behavior, on the other hand, just tends to be
| completely invalid footguns like this. Despite what some
| people are arguing in this thread, a 3-byte version of UTF-8
| was never in any spec anywhere and was an invalid shortcut
| from day 1.
| ipaddr wrote:
| Oracle, Postgres, MySql, mSql, sqlite all make weird
| behaviour in different situation. You pick your poison.
|
| I hated how postgres forced you to create a system user to
| connect I wonder if it still requires this.
| leftnode wrote:
| I love, love, love Postgres! The _one_ feature MySQL has that
| Postgres doesn 't is the ability to add a new column in an
| arbitrary position in the table. I like my tables to have a
| common "layout" and I would love to have this feature in
| Postgres.
| nkozyra wrote:
| This is annoying but it's not a huge hassle to
| duplicate/copy/rename, assuming your data is small and not
| sharded in some unique way
| teaearlgraycold wrote:
| This seems like something that could be done client side
| instead. But maybe not an option for psql CLI
| srcreigh wrote:
| MySQL also lets you control the layout of your table on disk
| via BTree primary storage and composite primary keys.
| andrewmcwatters wrote:
| People say this all the time but never elaborate on what they
| are nor link to any sources that help educate people what they
| are, so I just assume that when people say this, they don't
| know what they're talking about and are just parroting.
| JangoSteve wrote:
| I also helped with a project once that used pgloader to
| migration an old MySQL db to Postgres, and I think this article
| may actually explain one of the issues we found. We had a
| UTF-8-configured table in MySQL (we also discovered in the
| project, that MySQL can have encodings set differently per-
| table, while Postgres sets the encoding for the entire
| database) with UTF-8 data, and when we migrated it into a
| UTF-8-configured Postgres database, some of the UTF-8
| characters were silently corrupted. They were UTF-8 characters
| in both databases, so the corruption didn't raise any errors,
| but they were different characters, so when the data was read
| back out by the application, the text was different.
|
| We only caught this, because thankfully, we had written a
| manual checksum script, which looped through every table, read
| out all values from each row into the application, and hashed
| the results, then compared between when the app was connected
| to the source MySQL database vs the destination Postgres
| database. We ended up having to massage and fix those silently-
| corrupted characters.
| DonHopkins wrote:
| >Of course, they never advertised this (probably because the bug
| is so embarrassing).
|
| Not as embarrassing as being owned by Oracle.
| cryptos wrote:
| Best to be used with mysqli_real_escape_string ;-)
| https://www.php.net/manual/en/mysqli.real-escape-string.php
| tomwojcik wrote:
| I need to share something as literally today I fixed a bug in our
| project that's somewhat related.
|
| MS SQL encodes everything with UTF-16. Emojis (code points)
| require up to 4 bytes. If it's a grapheme (emoji constructed from
| emojis), it will be even more.
|
| We are using Django. If you check length of an emoji, it will
| give you `1` and Django assumes utf8 everywhere. If you try to
| save it to PostgreSQL with char field `max_length=1` it will work
| just fine, but on MS SQL it will fail as it requires 2 characters
| (4 bytes) in the DB!
|
| I tried it with MS SQL 2017, 2019 with different collations on
| nvarchar and I'm pretty sure there's no way around it.
|
| > Because in CHAR(n) and VARCHAR(n) or in NCHAR(n) and
| NVARCHAR(n), the n defines the byte storage size, not the number
| of characters that can be stored, it's important to determine the
| data type size you must convert to, in order to avoid data
| truncation.
|
| https://docs.microsoft.com/en-US/sql/relational-databases/co...
| gecko wrote:
| Generically, anything by Microsoft will historically have used
| UCS-2, and will use UTF-16 these days, so this is utterly
| unsurprising to me _as an experienced Windows dev_. Conversely,
| Linux (and POSIX, more generally) deciding that filename
| encoding is a per-filename and untracked thing is a bit lit,
| from my perspective. Point being: when it comes to handling
| unicode and foreign characters, just, like...always read the
| documentation. Assume nothing.
| capitainenemo wrote:
| Yep. Microsoft is the main reason for 2 of my favourite
| unicode links. https://utf8everywhere.org/ and
| https://simonsapin.github.io/wtf-8/
|
| And, apparently it's mildly inaccurate to say it uses
| UTF-16... it's more like UCS-2 with UTF-16 hacked in, with no
| validation. Thus WTF-8.
| ChrisSD wrote:
| No, it's UTF-16 with no validation at the kernel level.
| Invalid UTF-16 is also invalid UCS-2 as those code points
| were explicitly barred from use.
|
| In practice, only malware will create such broken names.
| High level software (e.g. Microsoft's own VSCode) will not
| handle broken UTF-16. And indeed the in-built UTF-8 code
| page will lossily decode UTF-16 (unpaired surrogates are
| replaced with the Unicode replacement character).
| capitainenemo wrote:
| Hm. Are you sure? Because the utf8everywhere article (and
| various microsoft related framework discussions) seem to
| suggest there's no validation anywhere. You can easily
| create partial codepoints and just hitting backspace in a
| text field can do it. That seems to imply there's no
| UTF-16 validation even at a higher level.
|
| But I will readily defer to your expertise on this. I've
| not coded in microsoft land for like 18 years. MFC was my
| last experience in this, where I still have this vague
| memory of being shocked by an API returning an int32 and
| instructing on casting to a void pointer (overloaded
| response message). No wonder they had issues with 64 bit
| migration at the time.
|
| Edit: cite on the utf8everywhere thing. "in plain Windows
| edit control (until Vista), it takes two backspaces to
| delete a character which takes 4 bytes in UTF-16. On
| Windows 7, the console displays such characters as two
| invalid characters, regardless of the font being used."
|
| Maybe they've improved since though. But surely there's a
| lot of that baggage in the libraries.
| user-the-name wrote:
| Windows Vista came out fifteen years ago.
| ChrisSD wrote:
| I mean, Vista is ~15 years old at this point. If anything
| that's still part of Windows makes the backspace mistake
| then if nothing else it's impressive it's survived this
| long without being noticed.
| pimeys wrote:
| With SQL Server 2019, you can write UTF-8 data to `VARCHAR` and
| `CHAR` fields, if the table collation is
| `LATIN1_GENERAL_100_CI_AS_SC_UTF8`.
|
| I feel the pain of UTF-16. When I was writing the Rust crate
| for the TDS protocol, all N-columns for strings always require
| you to do a full copy from `Vec<u16>` to `Vec<u8>` and back.
| tomwojcik wrote:
| I used this collation on 2019 when testing and the only thing
| that has changed was the error message (more verbose).
| dehrmann wrote:
| Memories... When I was at Amazon, my starter project was
| modifying Redshift to support four-byte UTF-8 characters, not
| just three-byte.
___________________________________________________________________
(page generated 2022-01-12 23:00 UTC)