[HN Gopher] Why do database columns have a character length of 191?
___________________________________________________________________
Why do database columns have a character length of 191?
Author : andyjih_
Score : 84 points
Date : 2021-05-17 18:18 UTC (4 hours ago)
(HTM) web link (www.grouparoo.com)
(TXT) w3m dump (www.grouparoo.com)
| ipaddr wrote:
| This number is based on the character set of the column type.
| utf8 has more space available so you get 255 with utf8mb4 you get
| 191 which is mysql's default now.
| [deleted]
| mfer wrote:
| History, like this, is great to understand and learn from.
| shadowgovt wrote:
| Newbie database question:
|
| I'm surprised the database cares at all about field length for
| indexing purposes, because I'd assume for indexing purposes it
| would hash the field and be done with it; a hash is a great tool
| for doing quick equality-comparisons.
|
| Is this not sufficient because the index also needs to support
| sorting, so the original values matter to high resolution?
| ericyan wrote:
| Hash index is only useful for exact matches, a full table scan
| would be required for any other kinds of queries. That is why
| most database engines have settled with B+ tree.
|
| Regardless the type of index, field length still matters if you
| care about performance. Smaller fields and smaller indices
| means you can fits more things (the working set) in RAM and
| less I/O when you have to hit disks.
| throwaway81523 wrote:
| Hash indexing is sometimes used, but B-trees are more common so
| you can select with ORDER BY and also because you can insert
| additional records as many times as you want, by splitting
| nodes now and then, but never having to rehash the whole index.
| masklinn wrote:
| Databases use btree indices for two reasons:
|
| 1. They provide much better locality and updatability, because
| databases would historically get the data from extremely slow
| IO (spinning rust) btrees are a great fit, they're also
| extremely predictable (search, insert, and delete work in O(log
| n)).
|
| 2. hash indices only support exact matches, btrees support
| prefix and range matching, and they are ordered, so when all
| the stars align you don't even need to touch the data pages,
| you can serve an entire query by just reading the index.
|
| Most databases allow opt-in hash indexes (and even other more
| complicated forms of indexes), but these are either much less
| flexible (hash) or much more expensive (gist/gin). The btree
| remains an excellent default.
| TazeTSchnitzel wrote:
| Prefixes are a neat trick. One nice consequence is that an
| index for the column pair (a, b) is also an index for (a).
| pizza234 wrote:
| The article is missing important notions (and at risk of being
| misleading) about index storage and access.
|
| It is correct that InnoDB indexes used to have a per-entry limit
| of 767 bytes; this is however the index _left prefix_. It means
| that one can still index longer strings, but infix /suffix index
| access can't be performed. Prefix index access can (limited to
| the number of chars).
|
| Now, if one is indexing very long strings, they should really
| think twice about what they're doing, as some other data
| structures may be more appropriate. Indexing is a balancing act,
| and this is not a problem of MySQL, it's a inherent problem of
| database systems. One may think they're going to add a 2048-bytes
| index and speed up access, while they may be actually slowing
| down the system (ie. because full table scan may be faster, and
| the index slows down writing and query-optimizing).
|
| To be noted that nowadays the limit is not 767 bytes, but 3072,
| at least, for the default row format[1].
|
| 1=https://docs.oracle.com/cd/E17952_01/mysql-8.0-en/innodb-lim...
| paulryanrogers wrote:
| I always wondered how the index prefixing impacts RTL language
| strings.
| duskwuff wrote:
| It doesn't. LTR/RTL only applies at display time -- the first
| character of a string is still stored first, regardless of
| whether it's going to show up on the left or the right side
| of the screen.
| adolph wrote:
| As I understand it, the bytes are ordered the same (without
| regarding big vs little endian). The RTL (or LTR) has to do
| with the visual representation and can be set by a
| nonprinting character.
|
| https://en.wikipedia.org/wiki/Bidirectional_text
|
| https://en.wikipedia.org/wiki/Right-to-left_mark
|
| https://betterexplained.com/articles/understanding-big-
| and-l...
| jzelinskie wrote:
| If you are building software that depends on a SQL database,
| always specify your encodings (and explicitly name your indexes)!
| I previously built an enterprise product that didn't specify our
| string encodings in its initial migration and spent years
| fielding support tickets and dealing with the hell of making
| migrations that could support the default encodings across
| platforms until we could justify copying the whole column.
|
| Postgres largely avoids this problem by using utf8 as the default
| pretty much everywhere, but MySQL definitely has platforms that
| ship with defaults as latin1.
| masklinn wrote:
| > Postgres largely avoids this problem by using utf8 as the
| default pretty much everywhere
|
| That's... not actually really relevant to the issue. And you
| can configure postgres to use a non-utf8 encoding as its
| default.
|
| Postgres avoids the issue by automatically moving large value
| outside of the row itself ("TOAST" storage). That's also why
| you can have an unbounded vachar in postgres, and there's no
| real difference between varchar, varchar(n) and text. And
| char(n) is _slower_ than the others: char(n) will always pad to
| (n), while varchar and text will only use whatever they need (+
| some overhead).
|
| TOAST works in terms of bytes, so your UTF8 might move to the
| next storage location earlier than if you'd use a fixed-size
| single-byte encoding. It just doesn't matter that much (it'll
| just be a bit slower) unless you actually reach the limit of
| TOAST storage itself (1GB per value).
|
| And Postgres has its own issue with fixed-size storage e.g.
| identifiers (names of tables, columns, indexes, constraints,
| etc...) can't be more than 63 bytes, they'll be silently and
| implicitly truncated to that.
| jzelinskie wrote:
| It's true that Postgres does this. But the problem I was
| highlighting was the developer-error of lacking to explicitly
| declare an encoding. It's only really related to the OP link
| insofar-as when you fail to do so on MySQL, you'll receive
| error messages that inform you of the dreaded 767-byte index.
| masklinn wrote:
| You can actually have a similar error in postgres, it's
| just that the index limit is at 2730 bytes, and it's
| unlikely you'll have indexed a column which can get content
| more than 2.5k.
|
| It does happen though, as I've discovered (repeatedly,
| because I'm a slow learner).
| VintageCool wrote:
| MySQL finally ships with utf8mb4 by default in MySQL 8, but
| anything older still uses latin1 / latin1_swedish_ci by default
| and should be converted to utf8mb4.
| jillesvangurp wrote:
| I ran into this in Amazon's RDS not so long ago (3 years?)
| where I wasted a good few hours figuring out why mysql was
| mangling our emojis. A good test for this is trying to insert
| the character, which will only work with utf8mb4 as it uses
| the full 4 bytes, which mysql's broken 3 byte utf-8
| implementation simply does not support. This is also the
| reason for 191 character limit as 191 is 3/4 of the 255 bytes
| you'd be able to store with their broken implementation.
| Which of course 1/3rd of the maximum entry size for innodb of
| 764 bytes (the last one is \0), or 4*191. They never changed
| the underlying implementation. Hence the weird limit.
| hinkley wrote:
| It's been years and I still get mad every time I think about
| the fact that 'utf8' is a broken encoding and they had to
| make a new one called utf8mb4.
|
| I have to trust my database. If I can't trust the authors of
| the database, then how do I trust the database?
| 7steps2much wrote:
| These kind of stories are really interesting because they provide
| so much context to questions that I couldn't answer up to now!
|
| Like why my MySQL couldn't create indexes on data I migrated from
| PostgreSQL. Granted, that ship has long since sailed and we
| decided to stick with PostgreSQL, but nice to know the reason
| none the less!
| superasn wrote:
| Site was not loading for me so here is an archived version to
| save you guys a click:
|
| https://web.archive.org/web/20210517182448if_/https://www.gr...
| remus wrote:
| > All modern popular relational database support (almost)
| unlimited sized strings with a text or blob-type column, so why
| not use that? The reason is indexes.
|
| Indexes are not the only reason. Typical databases (mysql, mssql
| etc.) are row oriented so they store rows of data in blocks on
| disk. If you have a text or blob field it is potentially
| unbounded in size and if the field does get too big it will
| typically be moved 'off the row' and replaced with a pointer to
| the full object. If instead you have a varchar field then it's
| size is bounded and the db can keep all the data in contiguous
| blocks on the disk. Having all your data in contiguous blocks can
| improve read performance (though it's less of an issue with
| SSDs).
| jeltz wrote:
| PostgreSQL is also row oriented but can store small text and
| blob (bytes) fields in the main table. PostgreSQL determines
| per value if it fits in the main table storage or not. This
| means that varchar and text are virtually identical.
|
| This is done for all variably sized data types, including third
| party GIS polygons.
| spookthesunset wrote:
| Postgres always amazes me how well designed it is. It's too
| bad MySQL has the brand recognition it does because Postgres
| is just far and away the better choice for almost everything.
| Twisol wrote:
| Fascinating! But it replaces one question with another:
|
| > The way innodb MySQL databases worked was that you can only
| have 767 bytes for an index
|
| Was somebody a Boeing fan or is there another story behind the
| 767 figure?
|
| EDIT: aaand I figured it out. 256 * 3 is 768, then take away one
| for the null terminator. Wild.
| dmux wrote:
| Additional information/investigation I found:
| https://dba.stackexchange.com/questions/57005/why-are-innodb...
| masklinn wrote:
| > EDIT: aaand I figured it out. 256 * 3 is 768, then take away
| one for the null terminator. Wild.
|
| That's the consequence not the cause. The cause is
|
| > If you reduce the InnoDB page size to 8KB or 4KB by
| specifying the innodb_page_size option when creating the MySQL
| instance, the maximum length of the index key is lowered
| proportionally, based on the limit of 3072 bytes for a 16KB
| page size. That is, the maximum index key length is 1536 bytes
| when the page size is 8KB, and 768 bytes when the page size is
| 4KB.
|
| So some ancient MySQL probably used 4KB page which only had
| spacer for 768 bytes index records (probably because of the
| page structure). When the page size was increased, the key
| length was likely kept.
|
| Incidentally, Postgres' default (BTree) index has a higher-but-
| not-infinite limit of 2730 bytes. It can be surprising when you
| index a varchar column without restricting column length and
| one day inserts start blowing up because someone tried
| inserting 5k worth of crap in the column.
| Twisol wrote:
| Nice extra context, thanks!
|
| You've now introduced two more magic constants, though, and
| so the question continues: where do those come from? Since
| 3072 is 3 * (16KB / 16), I'm guessing the 3 still comes from
| the original article's `Latin1` 3-byte encoding origin story.
| The 1/16th sounds like a balancing factor between the width
| of an index element and the number of elements per page.
| TonyTrapp wrote:
| This reminds me of the times when I wanted to do things "the
| right way" so instead of using phpMyAdmin I switched to using
| MySQL Workbench (which I no longer use now, it was slow and
| painful). Its default VARCHAR size was an oddly specific 45
| characters.
|
| Edit: Seems like someone else was wondering about the same thing,
| with no clear answer:
| https://stackoverflow.com/questions/13174307/mysql-workbench...
___________________________________________________________________
(page generated 2021-05-17 23:00 UTC)