hngopher.com

       [HN Gopher] Why do database columns have a character length of 191?
       ___________________________________________________________________
        
       Why do database columns have a character length of 191?
        
       Author : andyjih_
       Score  : 84 points
       Date   : 2021-05-17 18:18 UTC (4 hours ago)
        
 (HTM) web link (www.grouparoo.com)
 (TXT) w3m dump (www.grouparoo.com)
        
       | ipaddr wrote:
       | This number is based on the character set of the column type.
       | utf8 has more space available so you get 255 with utf8mb4 you get
       | 191 which is mysql's default now.
        
         | [deleted]
        
       | mfer wrote:
       | History, like this, is great to understand and learn from.
        
       | shadowgovt wrote:
       | Newbie database question:
       | 
       | I'm surprised the database cares at all about field length for
       | indexing purposes, because I'd assume for indexing purposes it
       | would hash the field and be done with it; a hash is a great tool
       | for doing quick equality-comparisons.
       | 
       | Is this not sufficient because the index also needs to support
       | sorting, so the original values matter to high resolution?
        
         | ericyan wrote:
         | Hash index is only useful for exact matches, a full table scan
         | would be required for any other kinds of queries. That is why
         | most database engines have settled with B+ tree.
         | 
         | Regardless the type of index, field length still matters if you
         | care about performance. Smaller fields and smaller indices
         | means you can fits more things (the working set) in RAM and
         | less I/O when you have to hit disks.
        
         | throwaway81523 wrote:
         | Hash indexing is sometimes used, but B-trees are more common so
         | you can select with ORDER BY and also because you can insert
         | additional records as many times as you want, by splitting
         | nodes now and then, but never having to rehash the whole index.
        
         | masklinn wrote:
         | Databases use btree indices for two reasons:
         | 
         | 1. They provide much better locality and updatability, because
         | databases would historically get the data from extremely slow
         | IO (spinning rust) btrees are a great fit, they're also
         | extremely predictable (search, insert, and delete work in O(log
         | n)).
         | 
         | 2. hash indices only support exact matches, btrees support
         | prefix and range matching, and they are ordered, so when all
         | the stars align you don't even need to touch the data pages,
         | you can serve an entire query by just reading the index.
         | 
         | Most databases allow opt-in hash indexes (and even other more
         | complicated forms of indexes), but these are either much less
         | flexible (hash) or much more expensive (gist/gin). The btree
         | remains an excellent default.
        
           | TazeTSchnitzel wrote:
           | Prefixes are a neat trick. One nice consequence is that an
           | index for the column pair (a, b) is also an index for (a).
        
       | pizza234 wrote:
       | The article is missing important notions (and at risk of being
       | misleading) about index storage and access.
       | 
       | It is correct that InnoDB indexes used to have a per-entry limit
       | of 767 bytes; this is however the index _left prefix_. It means
       | that one can still index longer strings, but infix /suffix index
       | access can't be performed. Prefix index access can (limited to
       | the number of chars).
       | 
       | Now, if one is indexing very long strings, they should really
       | think twice about what they're doing, as some other data
       | structures may be more appropriate. Indexing is a balancing act,
       | and this is not a problem of MySQL, it's a inherent problem of
       | database systems. One may think they're going to add a 2048-bytes
       | index and speed up access, while they may be actually slowing
       | down the system (ie. because full table scan may be faster, and
       | the index slows down writing and query-optimizing).
       | 
       | To be noted that nowadays the limit is not 767 bytes, but 3072,
       | at least, for the default row format[1].
       | 
       | 1=https://docs.oracle.com/cd/E17952_01/mysql-8.0-en/innodb-lim...
        
         | paulryanrogers wrote:
         | I always wondered how the index prefixing impacts RTL language
         | strings.
        
           | duskwuff wrote:
           | It doesn't. LTR/RTL only applies at display time -- the first
           | character of a string is still stored first, regardless of
           | whether it's going to show up on the left or the right side
           | of the screen.
        
           | adolph wrote:
           | As I understand it, the bytes are ordered the same (without
           | regarding big vs little endian). The RTL (or LTR) has to do
           | with the visual representation and can be set by a
           | nonprinting character.
           | 
           | https://en.wikipedia.org/wiki/Bidirectional_text
           | 
           | https://en.wikipedia.org/wiki/Right-to-left_mark
           | 
           | https://betterexplained.com/articles/understanding-big-
           | and-l...
        
       | jzelinskie wrote:
       | If you are building software that depends on a SQL database,
       | always specify your encodings (and explicitly name your indexes)!
       | I previously built an enterprise product that didn't specify our
       | string encodings in its initial migration and spent years
       | fielding support tickets and dealing with the hell of making
       | migrations that could support the default encodings across
       | platforms until we could justify copying the whole column.
       | 
       | Postgres largely avoids this problem by using utf8 as the default
       | pretty much everywhere, but MySQL definitely has platforms that
       | ship with defaults as latin1.
        
         | masklinn wrote:
         | > Postgres largely avoids this problem by using utf8 as the
         | default pretty much everywhere
         | 
         | That's... not actually really relevant to the issue. And you
         | can configure postgres to use a non-utf8 encoding as its
         | default.
         | 
         | Postgres avoids the issue by automatically moving large value
         | outside of the row itself ("TOAST" storage). That's also why
         | you can have an unbounded vachar in postgres, and there's no
         | real difference between varchar, varchar(n) and text. And
         | char(n) is _slower_ than the others: char(n) will always pad to
         | (n), while varchar and text will only use whatever they need (+
         | some overhead).
         | 
         | TOAST works in terms of bytes, so your UTF8 might move to the
         | next storage location earlier than if you'd use a fixed-size
         | single-byte encoding. It just doesn't matter that much (it'll
         | just be a bit slower) unless you actually reach the limit of
         | TOAST storage itself (1GB per value).
         | 
         | And Postgres has its own issue with fixed-size storage e.g.
         | identifiers (names of tables, columns, indexes, constraints,
         | etc...) can't be more than 63 bytes, they'll be silently and
         | implicitly truncated to that.
        
           | jzelinskie wrote:
           | It's true that Postgres does this. But the problem I was
           | highlighting was the developer-error of lacking to explicitly
           | declare an encoding. It's only really related to the OP link
           | insofar-as when you fail to do so on MySQL, you'll receive
           | error messages that inform you of the dreaded 767-byte index.
        
             | masklinn wrote:
             | You can actually have a similar error in postgres, it's
             | just that the index limit is at 2730 bytes, and it's
             | unlikely you'll have indexed a column which can get content
             | more than 2.5k.
             | 
             | It does happen though, as I've discovered (repeatedly,
             | because I'm a slow learner).
        
         | VintageCool wrote:
         | MySQL finally ships with utf8mb4 by default in MySQL 8, but
         | anything older still uses latin1 / latin1_swedish_ci by default
         | and should be converted to utf8mb4.
        
           | jillesvangurp wrote:
           | I ran into this in Amazon's RDS not so long ago (3 years?)
           | where I wasted a good few hours figuring out why mysql was
           | mangling our emojis. A good test for this is trying to insert
           | the character, which will only work with utf8mb4 as it uses
           | the full 4 bytes, which mysql's broken 3 byte utf-8
           | implementation simply does not support. This is also the
           | reason for 191 character limit as 191 is 3/4 of the 255 bytes
           | you'd be able to store with their broken implementation.
           | Which of course 1/3rd of the maximum entry size for innodb of
           | 764 bytes (the last one is \0), or 4*191. They never changed
           | the underlying implementation. Hence the weird limit.
        
           | hinkley wrote:
           | It's been years and I still get mad every time I think about
           | the fact that 'utf8' is a broken encoding and they had to
           | make a new one called utf8mb4.
           | 
           | I have to trust my database. If I can't trust the authors of
           | the database, then how do I trust the database?
        
       | 7steps2much wrote:
       | These kind of stories are really interesting because they provide
       | so much context to questions that I couldn't answer up to now!
       | 
       | Like why my MySQL couldn't create indexes on data I migrated from
       | PostgreSQL. Granted, that ship has long since sailed and we
       | decided to stick with PostgreSQL, but nice to know the reason
       | none the less!
        
       | superasn wrote:
       | Site was not loading for me so here is an archived version to
       | save you guys a click:
       | 
       | https://web.archive.org/web/20210517182448if_/https://www.gr...
        
       | remus wrote:
       | > All modern popular relational database support (almost)
       | unlimited sized strings with a text or blob-type column, so why
       | not use that? The reason is indexes.
       | 
       | Indexes are not the only reason. Typical databases (mysql, mssql
       | etc.) are row oriented so they store rows of data in blocks on
       | disk. If you have a text or blob field it is potentially
       | unbounded in size and if the field does get too big it will
       | typically be moved 'off the row' and replaced with a pointer to
       | the full object. If instead you have a varchar field then it's
       | size is bounded and the db can keep all the data in contiguous
       | blocks on the disk. Having all your data in contiguous blocks can
       | improve read performance (though it's less of an issue with
       | SSDs).
        
         | jeltz wrote:
         | PostgreSQL is also row oriented but can store small text and
         | blob (bytes) fields in the main table. PostgreSQL determines
         | per value if it fits in the main table storage or not. This
         | means that varchar and text are virtually identical.
         | 
         | This is done for all variably sized data types, including third
         | party GIS polygons.
        
           | spookthesunset wrote:
           | Postgres always amazes me how well designed it is. It's too
           | bad MySQL has the brand recognition it does because Postgres
           | is just far and away the better choice for almost everything.
        
       | Twisol wrote:
       | Fascinating! But it replaces one question with another:
       | 
       | > The way innodb MySQL databases worked was that you can only
       | have 767 bytes for an index
       | 
       | Was somebody a Boeing fan or is there another story behind the
       | 767 figure?
       | 
       | EDIT: aaand I figured it out. 256 * 3 is 768, then take away one
       | for the null terminator. Wild.
        
         | dmux wrote:
         | Additional information/investigation I found:
         | https://dba.stackexchange.com/questions/57005/why-are-innodb...
        
         | masklinn wrote:
         | > EDIT: aaand I figured it out. 256 * 3 is 768, then take away
         | one for the null terminator. Wild.
         | 
         | That's the consequence not the cause. The cause is
         | 
         | > If you reduce the InnoDB page size to 8KB or 4KB by
         | specifying the innodb_page_size option when creating the MySQL
         | instance, the maximum length of the index key is lowered
         | proportionally, based on the limit of 3072 bytes for a 16KB
         | page size. That is, the maximum index key length is 1536 bytes
         | when the page size is 8KB, and 768 bytes when the page size is
         | 4KB.
         | 
         | So some ancient MySQL probably used 4KB page which only had
         | spacer for 768 bytes index records (probably because of the
         | page structure). When the page size was increased, the key
         | length was likely kept.
         | 
         | Incidentally, Postgres' default (BTree) index has a higher-but-
         | not-infinite limit of 2730 bytes. It can be surprising when you
         | index a varchar column without restricting column length and
         | one day inserts start blowing up because someone tried
         | inserting 5k worth of crap in the column.
        
           | Twisol wrote:
           | Nice extra context, thanks!
           | 
           | You've now introduced two more magic constants, though, and
           | so the question continues: where do those come from? Since
           | 3072 is 3 * (16KB / 16), I'm guessing the 3 still comes from
           | the original article's `Latin1` 3-byte encoding origin story.
           | The 1/16th sounds like a balancing factor between the width
           | of an index element and the number of elements per page.
        
       | TonyTrapp wrote:
       | This reminds me of the times when I wanted to do things "the
       | right way" so instead of using phpMyAdmin I switched to using
       | MySQL Workbench (which I no longer use now, it was slow and
       | painful). Its default VARCHAR size was an oddly specific 45
       | characters.
       | 
       | Edit: Seems like someone else was wondering about the same thing,
       | with no clear answer:
       | https://stackoverflow.com/questions/13174307/mysql-workbench...
        
       ___________________________________________________________________
       (page generated 2021-05-17 23:00 UTC)