[HN Gopher] Unexpected downsides of UUID keys in PostgreSQL
       ___________________________________________________________________
        
       Unexpected downsides of UUID keys in PostgreSQL
        
       Author : alexzeitler
       Score  : 227 points
       Date   : 2023-06-22 10:10 UTC (12 hours ago)
        
 (HTM) web link (www.cybertec-postgresql.com)
 (TXT) w3m dump (www.cybertec-postgresql.com)
        
       | robertlagrant wrote:
       | Is this a downside of UUID keys in general, or of using them as
       | indexes? Would it be possible to create a primary index on a
       | combination of inserted date and uuid and get opaque UUIDs _and_
       | good indexing?
        
         | exyi wrote:
         | Yes, it's a downside of using them in b-tree indexes. However,
         | the purpose of having an ID is usually to use it for lookup
         | (from foreign keys, query string arguments, ...). If you create
         | an index on two columns, you won't be able to lookup using only
         | the second value, at least not in Postgres.
        
           | robertlagrant wrote:
           | That's interesting - I don't know enough about this! If it's
           | lookup from a foreign key, won't it just go straight to the
           | record? What's the index for there?
        
         | bandrami wrote:
         | The alternative is having two required-unique columns only one
         | of which is definitionally unique. It solves some problems and
         | introduces others.
        
           | rjh29 wrote:
           | You can use UUIDv7 or ULID for that. One value with both the
           | timestamp and random data inside.
        
             | bandrami wrote:
             | Sorry, right; I meant that's the collation alternative to a
             | multi-field record like v7.
        
       | MPSimmons wrote:
       | I don't find myself dealing with databases in any depth very
       | often, so this was news to me, and I thought it was well written,
       | for what it's worth. Thanks for posting.
        
       | zero_iq wrote:
       | Article should be called "Totally Expected Downsides..."
       | 
       | If you want temporal locality, use ULIDs instead.
        
         | rjh29 wrote:
         | I was going to use UUID with the time portion at the start
         | (also known as UUIDv7) but this looks better.
        
           | gmhafiz wrote:
           | One potential downside is ULID does not have an RFC, unlike
           | UUID V7
        
             | samokhvalov wrote:
             | Hopefully, it will soon have it:
             | https://datatracker.ietf.org/doc/html/draft-ietf-uuidrev-
             | rfc...
        
               | OJFord wrote:
               | It's only mentioned there in 'motivation', a study of
               | trends?
        
           | heipei wrote:
           | I've been thinking about converting a legacy application
           | which uses UUIDv4 to use ULIDs instead going forward, but
           | then to represent these ULIDs in a format that is compliant
           | with a UUIDv4. I have not thought through the possible
           | downsides, but I think it should be a pretty straightforward
           | change. Of course old records will remain true random
           | UUIDv4s, but at least new records will be time-order and as
           | such will create less stress on the B-Tree index when writing
           | them.
        
       | ilyt wrote:
       | How is any of that "unexpected" ? What did the author expected ?
        
       | daneel_w wrote:
       | Nothing unexpected about it at all. PostgreSQL's default index in
       | this case is a B-tree, and they don't index disorderly data very
       | well which is the nature of all UUID versions.
       | 
       | On the topic of how important it is for B-trees to be "orderly":
       | https://news.ycombinator.com/item?id=34404641
       | 
       | PostgreSQL can do (among other types) 32-bit hash indices which
       | work out better for certain use cases. Personally I would avoid
       | B-trees for any UUID unless I really had to do partial or ranged
       | scans on them.
        
         | marginalia_nu wrote:
         | > On the topic of how important it is for B-trees to be
         | "orderly": https://news.ycombinator.com/item?id=34404641
         | 
         | I don't think this is the same phenomenon at all, if anything
         | that looks like an implementation problem rather than something
         | that's somehow inherent in B-trees.
         | 
         | Depending on how you construct the tree, it's however still
         | possible to end up with something that's very fragmented and
         | inefficient, but you can always construct a dense b-tree with a
         | space complexity of O(N/(B-1)). To see why you can just lay out
         | the data in a list and manually create the index layers. Each
         | layer will be bounded by N/B, N/B2, N/B3, ... (and the sum of
         | 1/B^n over all n is 1/(B-1) )
         | 
         | Considering an actual tree, for N=10, B=3                 1 2 3
         | 4 5 6  7 8 9   10    data          <3 |   <6|    <9 | <10
         | index 1               <6  |<10 |          index 2
         | <10               root
         | 
         | Looking at this, I think it should be apparent that if you
         | change this ordered sequence 1-10 to ten UUIDs ordered in the
         | same way, there would be no change in the structure of the
         | tree.
         | 
         | It would be a bit bigger on disk because UUIDs are bigger than
         | integers, and you may end up with an additional data layer
         | because of that (because you want to align with the disk block
         | size).
         | 
         | One reason you might see a difference even when the
         | implementation makes the appropriate assumptions is that
         | B-trees can do very clever things with somewhat sequential data
         | when it comes to joins, where you can get linear nearly
         | runtimes for the operation. But as mentioned, this only works
         | with relatively ordered data.
        
       | croes wrote:
       | Can't you simply add a additional timestamp column for time based
       | sorting? No need to raise the collision probability.
        
         | plagiarist wrote:
         | If size isn't an issue it seems so. I know of one
         | implementation that uses wall clock to get a "close enough"
         | sorting https://github.com/segmentio/ksuid
        
       | damethos wrote:
       | Why not just use auto-increment integers and expose them with
       | something like this: https://www.npmjs.com/package/hashids (The
       | linked library is just an example). Yes, you will need to
       | transform the IDs before returning them to an external caller
       | which is a little bit annoying but it's probably not that a big
       | of a deal.
       | 
       | Also another approach I've seen, which personally I find it a bit
       | complicated, is to use auto-increment primary keys for the
       | internal system and UUIDs for the public facing interactions.
        
       | insanitybit wrote:
       | Another (potentially!) significant problem with uuids is that
       | they're inefficient to store. Compared to an 8 byte value
       | obviously they double the storage size. But it's actually worse -
       | if you have loosely ordered integers, even with gaps, you can
       | compress those down to even less, like practically 1 byte on
       | average. That makes uuids ~16x worse for disk storage, ~2x worse
       | for memory storage (cache).
       | 
       | That + Losing locality can be, for some workloads, a significant
       | loss. Where UUIDs reign supreme for performance is in terms of
       | generation - if you have a super high write load and really high
       | latency requirements it may not be viable to have a single
       | integer counter.
       | 
       | Although, in my own testing, I've found Postgres is more than
       | capable of hundreds of thousands of increments per second and if
       | you're willing to allow for gaps/ interleaving in your counter
       | (slight hit to locality) you're really unlikely to hit a
       | bottleneck.
       | 
       | Sequentual uuids help and are desirable but they still can't
       | compete on storage :)
        
         | fabian2k wrote:
         | I'm not aware of any situation where Postgres (without
         | extensions) would compress ints that way? That is more of a
         | column-oriented or time-series DB thing to me.
         | 
         | Are there any cases in Postgres where this actually plays a
         | role and sequential ints get compressed?
        
           | insanitybit wrote:
           | Sorry, let me clarify. I used Postgres to generate the
           | counters as external IDs into other data stores (so instead
           | of `give me a new uuid` I said `hey postgres give me a new
           | counter`. Postgres was only used to maintain the counters
           | themselves in a tiny database where compression isn't
           | relevant anyways.
           | 
           | The other data stores where I used this were a combination of
           | Parquet on S3 (where you get compression) and ScyllaDB (where
           | you also get compression).
        
         | insanitybit wrote:
         | Oh yeah, this also makes pagination trivial. You can basically
         | just give your clients a literal number indicating where they
         | are. So good, you basically get paging APIs for free.
        
       | osigurdson wrote:
       | On the one hand, a serial base id like
       | https://example.com/user/items/55, seems less secure than a UUID
       | based approach as if all else fails the UUID is still hard to
       | guess. On the other, some systems (github private repos for
       | example), don't seem to worry about how easily a URL can be
       | guessed. Is it perhaps the case that probability of successful
       | attack is not materially reduced in with a hard to guess url in
       | the final analysis?
        
       | moi2388 wrote:
       | int key for internal use, guid for external purposes
        
       | seanhunter wrote:
       | I'll genuinely never understand why people benchmark in this way.
       | They are doing a thing that is slow and inefficient in a
       | fundamental way and then producing a blogpost about it being slow
       | and inefficient in a much more trivial respect. If you care about
       | performance,                   select count(uuid) from records;
       | 
       | Makes absolutely no sense. You know the uuid is unique so you are
       | deliberately selecting a value and then throwing it away just to
       | count it.                   select count(1) from records;
       | 
       | Would give the same value and could be answered from the index
       | without requiring a scan of the table.
       | 
       | This is true of any unique column in the table no matter what
       | type (it doesn't have to be a uuid). The fact that uuids are a
       | bit slower than other types of keys when you scan the entire
       | table unnecessarily seems beside the point.
        
         | insanitybit wrote:
         | I can't imagine that the author is saying "don't count uuids
         | this way!" but instead is trying to explain how locality works
         | in btree indices, using a trivial example to demonstrate it.
        
         | seanhunter wrote:
         | I should add for people who are not aware, the same concept is
         | very important to bear in mind if there is a WHERE EXISTS or
         | WHERE NOT EXISTS subclause to your query. Something like
         | select           u.username as good_user        from
         | users u        where           not exists (
         | select 1 from naughty_users n where n.id = u.id            )
         | 
         | ...is generally going to be much much faster than what a lot of
         | people instinctively do which is to select the username or user
         | id from the inner query. Since you're just checking for
         | (non)existence it doesn't matter what the inner query returns
         | so select 1 will answer the query from the index (assuming the
         | id is indexed which if it isn't you have bigger problems).
         | 
         | If you spend a while looking at explain plans for your queries
         | you will spot common patterns like this where you can avoid a
         | table scan often.
        
       | neurostimulant wrote:
       | CouchDB uses sequential UUID by default:
       | https://docs.couchdb.org/en/stable/config/misc.html#uuids-co...
       | 
       | > Impact of UUID choices: the choice of UUID has a significant
       | impact on the layout of the B-tree, prior to compaction.
       | 
       | > For example, using a sequential UUID algorithm while uploading
       | a large batch of documents will avoid the need to rewrite many
       | intermediate B-tree nodes. A random UUID algorithm may require
       | rewriting intermediate nodes on a regular basis, resulting in
       | significantly decreased throughput and wasted disk space space
       | due to the append-only B-tree design.
       | 
       | > It is generally recommended to set your own UUIDs, or use the
       | sequential algorithm unless you have a specific need and take
       | into account the likely need for compaction to re-balance the
       | B-tree and reclaim wasted space.
        
       | hn_throwaway_99 wrote:
       | The new/upcoming UUID v7 really solves pretty much all of these
       | problems:
       | 
       | 1. They're ordered by a timestamp, so they preserve DB locality
       | 
       | 2. They include a total of 74 bits of random data. This isn't
       | enough to be used as unguessable keys, but it _does_ offer some
       | good protection if you have other bugs that could otherwise lead
       | to IDOR vulnerabilities.
       | 
       | 3. They're still 16 bytes, but IMO any minor hit to storage/time
       | is completely worth it for the benefits they provide, especially
       | since the other option is usually to have an integer primary key
       | AND another "public ID" column that stores a UUID.
        
       | [deleted]
        
       | [deleted]
        
       | VWWHFSfQ wrote:
       | My go-to pattern for many years now is to use a plain bigint
       | autoincrement column for internal database relations and then a
       | uuid for application-level identifiers and natural keys.
       | Basically never use the uuid as the actual primary key because
       | they're enormous and now the DBMS has to copy that gigantic
       | number to every side of the relation. Don't do it
        
         | sumtechguy wrote:
         | This is a good answer. You want your primary/cluster key to
         | usually be something that makes sense in your data. If you use
         | something like UUID the placement can be basically random. That
         | is usually in SQL is not a desirable trait. If you use it for a
         | primary key you are basically saying you are fine with a memory
         | cmp vs a register compare for you finding the right data. That
         | can be undesirable too. UUIDs do have very desirable traits
         | such as uniqueness. But that does come at a cost and usually
         | you can fix that by deciding what process decides the source of
         | truth in a different way.
        
         | zzzeek wrote:
         | Same, I'm surprised nobody is mentioning this. All the foreign
         | key columns become uuids also and the DB becomes miserable to
         | deal with.
        
       | londons_explore wrote:
       | Note that if you use something like spanner, you _never want data
       | locality_ , because it will prevent your data being spread
       | effectively between nodes.
       | 
       | So your id's of top level entities should be something random.
        
       | samokhvalov wrote:
       | We discussed some of these problems here during a Postgres.tv
       | session, and developed a patch to support ULID / UUIDv7:
       | https://www.youtube.com/watch?v=YPq_hiOE-N8
       | 
       | And then Andrey proposed a patch in the pgsql-hackers mailing
       | list: https://www.postgresql.org/message-
       | id/flat/CAAhFRxitJv%3DyoG...,
       | https://commitfest.postgresql.org/43/4388/
       | 
       | Everyone who can help (test, discuss, etc.) - please participate
       | in that discussion.
       | 
       | The standard is not finalized yet, but there some expectations
       | that it will be, if it happens, it would be great to have this in
       | future Postgres 17.
        
       | coding123 wrote:
       | Use flakeids
        
       | penciltwirler wrote:
       | Have a look at ULIDs
        
       | pachico wrote:
       | How can this be unexpected? Isn't this what Percona discussed for
       | MySQL already several years ago?
       | 
       | Or am I missing something?
        
         | Tehnix wrote:
         | The amount of people with actual experience scaling databases
         | is _very_ small compared to the amount of e.g. Backend
         | Engineers that end up implementing the solutions.
         | 
         | I think a root cause is that there's almost never a case for
         | preferring UUIDv4 over ULID/UUIDv7, but we still see almost all
         | learning material reach for the former. So people only find out
         | once they have a table that exhibits weird/low performance and
         | suddenly need to google for specific performance problems
         | they've never encountered before.
         | 
         | Similarly, efficiently using indexes is also a very common
         | thing I see people not know much about, since with SQL
         | "everything performs well until it doesn't" (i.e. you reach
         | scale) :)
        
         | indymike wrote:
         | > How can this be unexpected?
         | 
         | If you routinely work with junior developers this will come up
         | because UUID4 seems miraculous at first. There's always a bit
         | of surprise when the question gets asked about why we have so
         | many sequential keys when we could just use UUID4...
        
         | bguebert wrote:
         | They suggest to use UUID v7 at the end for more sequential
         | keys. Maybe that's the new part since v7 is relatively recent.
        
           | mattmanser wrote:
           | MSSQL has had sequential GUIDs for over a decade to solve
           | this very problem, surprised it wasn't done sooner.
        
             | evntdrvn wrote:
             | their format is pretty funky though lol
        
       | frodowtf wrote:
       | With some basic knowledge about how a db stores data, there is
       | nothing really "unexpected" about this behavior. It's weird
       | though how little devs actually know or care about it.
        
       | JoeAltmaier wrote:
       | The correlation of locality and sequential keys was a
       | coincidence. It's unfortunate that folks rely on this coincidence
       | for performance. Databases can be made with better algorithms now
       | that UUIDs are common. It would be a mistake to enshrine
       | sequential indexes as some special optimization, just to be blown
       | away later when better algorithms get used.
        
       | aarondf wrote:
       | I did a video covering choosing a primary key type for MySQL [1],
       | including UUIDs.
       | 
       | Long story short: you should probably use a bigint unless you
       | have a pretty good reason not to. Good reasons exist! If you do
       | use a UUID, make sure you use a version that is time-sorted.
       | 
       | 1: https://planetscale.com/courses/mysql-for-
       | developers/indexes...
        
       | xmdx wrote:
       | That's what created_at and updated_at are for surely
        
         | sam_lowry_ wrote:
         | From the ActiveRecord fame ;-)
        
         | yesimahuman wrote:
         | I was going to ask, when would you only rely on a sequential id
         | for temporal locality, wouldn't most people be indexing and
         | sorting on `created_at` or something equivalent?
        
           | Tehnix wrote:
           | You don't need to need temporal elements, the simple fact
           | that UUIDs are a bad fit for B-Trees means that simply
           | inserting a bunch of data that relies on a UUIDv4 for its
           | unique ID, will also run into these same problems.
           | 
           | You'd typically see this as either lower performance of a
           | table than you expect, or higher IOPS usage of your database
           | (which gets expensive at scale).
        
       | gregwebs wrote:
       | UUID is also used to avoid leaking information about the
       | underlying system. This includes temporal information that could
       | be used to infer the size of the dataset for all users.
       | 
       | If this isn't a concern, then using a timestamp based approach as
       | recommended in this article is a good approach. That is the
       | default in MongoDB.
       | 
       | If it is a concern, one approach is to use random UUIDs to give
       | to end users but then internally to have something like auto
       | increment ids.
        
         | gmhafiz wrote:
         | Expanding on temporal information leak
         | 
         | 1. Size: If a client receives a record with id=10004578, they
         | can guess that 4578 orders have been made.
         | 
         | 2. Rate of growth: Receiving two different orders means they
         | can track the growth rate of record insertion.
         | 
         | And also
         | 
         | Iteration attack: If your API endpoints do not have
         | authorization, an attacker can try to access with GET
         | /api/users/1, GET /api/users/2, GET /api/users/3, etc. UUID
         | makes this next to impossible.
        
         | WeAddValue wrote:
         | Or use hashids.org. Use it to obfuscate your integer auto
         | increment ids anywhere a user might see it.
        
           | VWWHFSfQ wrote:
           | It's a good option purely for cosmetics but don't rely on it
           | for any kind of serious obscurity since it's trivially
           | reversed. I've used it to great effect in the past to encode
           | multiple integer values like start/end ID, sort order (0/1),
           | etc. for cursor-based pagination. But that's only because
           | there's nothing secret in those numbers. Just purely for
           | convenience.
        
             | willvarfar wrote:
             | I once heard of some fancy data science that inferred that
             | the company was the supplier's major customer from the gaps
             | in the sequential ids they saw when using that supplier's
             | API. That was useful in negotiations.
             | 
             | And so the company that did that data science realised they
             | too were susceptible to exactly the same 'attack'. So they
             | created a system to obscure the ids they were themselves
             | exposing to their customers, using some cheap cut-down
             | tea64 encryption iirc. My memory is it never went live,
             | though.
        
               | theluketaylor wrote:
               | Not that fancy (or at least not that novel). In WWII the
               | allies used sequential serial numbers on destroyed and
               | captured tanks to have a frighteningly accurate picture
               | of German industrial output.
               | 
               | https://en.wikipedia.org/wiki/German_tank_problem
        
             | code_biologist wrote:
             | Salting your hash should work?
             | 
             | You could also use a 32 or 64 bit block cipher like skip32
             | if you want to prevent reversal. Or at least, it makes
             | reversing non-trivial.
        
               | VWWHFSfQ wrote:
               | This is a common misconception. hashids is _not_ an
               | encryption algorithm. It's just an encoding with a tiny
               | bit of obscurity layered in. hashids.org:
               | 
               | > Do not encode sensitive data. This includes sensitive
               | integers, like numeric passwords or PIN numbers. This is
               | not a true encryption algorithm. There are people that
               | dedicate their lives to cryptography and there are plenty
               | of more appropriate algorithms: bcrypt, md5, aes, sha1,
               | blowfish.
        
         | babbledabbler wrote:
         | I actually have been using a combination of a numeric integer
         | ID as the PK and a UUID as a lookup field to do routing lookups
         | etc. for this purpose in Postgres backed app I'm working on.
         | 
         | I found this approach to be more trouble than it's worth and
         | plan on switching to a UUID PK key and doing away with the
         | integer sequence.
         | 
         | Here are the complications I ran into:
         | 
         | The libraries I'm using for the ORM and API are designed to
         | work with a primary key for single record get access. For
         | example they want you to do `resource.get(ID)` where ID is the
         | primary key, however, I now have to do `resource.find({ where:
         | { uuid: 'myuuid' }})`. This is for all resources on all pages.
         | 
         | In Postgres, the integer PK sequence has a state that keeps
         | track of what number it's at. In certain circumstances it can
         | get out of sequence and this can trip up migrations.
         | 
         | We already have created_at and updated_at fields and these are
         | probably better for ordering than the sequencing.
         | 
         | Had I to do it again, I would just use UUIDv4 until it runs
         | into issues and either date fields or a sequence where
         | necessary. If anyone has better ideas I would be most grateful
         | as this is something I go back and forth on.
        
           | bavell wrote:
           | As someone who has been on the fence about this for awhile
           | now, thank you for your valuable anecdote!
        
           | moonchrome wrote:
           | That sounds like a very simplistic framework and I'm sure you
           | could do some metaprogramming to abstract boilerplate. Like
           | you couldn't do database multitenancy with those constraints.
           | 
           | I've used the int keys and UUID public keys on multiple
           | projects - it wasn't an issue for EF core or RoR
        
             | babbledabbler wrote:
             | I'm simplifying a bit for brevity and we can do some
             | abstraction to handle it so it's not that the framework is
             | simplistic. I'm just having trouble justifying adding this
             | complexity with two types of IDs.
        
           | gregwebs wrote:
           | I personally find the numeric id extremely valuable for
           | internal data analysis and sharing. I can refer to rows by
           | the numeric id, including a range of rows, and seeing the ids
           | gives exactly that intuitive information about it's relation
           | in the set that we are hiding from end users. Numeric ids can
           | also be used for the same reason in an admin-only UI.
           | 
           | On the efficiency side, joining and querying by id is
           | generally more efficient on CPU usage for querying, but you
           | do have to pay the cost of having the additional column and
           | index.
        
             | babbledabbler wrote:
             | This was generally the reason I went with numeric ID as PK
             | originally. It makes working with and analyzing the data as
             | well as cross referencing relations easier.
             | 
             | For all my tables I have a base schema that looks something
             | like this.
             | 
             | id: integer sequence PK uuid: uuidV4 created_at: datetime
             | updated_at: datetime
             | 
             | The concern I have is when I have to distribute my system
             | when scaling. Those numeric IDs will have to be replaced
             | with the UUIDs so I figure I might as well do it now.
        
               | gregwebs wrote:
               | Everything breaks at scale. In my experience most tables
               | don't end up with more than a few million rows and will
               | work fine with this. If you did want to transition a
               | large table to be UUID only, the nice thing about this
               | approach is that you could do it with no down time. If
               | you are using a DB that only scales writes vertically
               | though (most DBs, including distributed DBs) then how are
               | you actually going to scale the DB layer horizontally?
               | Pretty much just CRDB (PG) or TiDB (MySQL) are the
               | options there- look at their docs for how to setup your
               | ids.
        
               | babbledabbler wrote:
               | I'm not so much concerned with figuring out scaling in
               | terms of volume as I expect to be able to handle millions
               | of rows in a single DB and that would be an
               | implementation detail and fine tuning. I'm more concerned
               | about scaling in terms of complexity and keeping the
               | system easy to reason about when more people, tech are
               | involved.
               | 
               | Lets say I have a <CAR>-[1:N]-<TRIP> in two tables in a
               | relational DB. This works fine at first even for millions
               | of rows as you say.
               | 
               | At some point in the future it makes sense to have these
               | two entities managed by different team/services/db. Let's
               | say TRIP becomes a whole feature laden thing with fares,
               | hotels, itinerary, dates. So I need to take this local
               | relation and move it to different services and different
               | DB.
               | 
               | If I had been using an integer PK/FK this would be a more
               | complicated migration than if I used UUIDs.
               | 
               | My assumption is that we would not want to have a
               | sequenced integer key used in a distributed system.
               | 
               | In other words it seems safer bet if there's a
               | possibility of needing to move to a distributed system to
               | use a UUID for the key from the beginning.
        
               | gregwebs wrote:
               | I think switching this with zero downtime to do foreign
               | key references with UUIDs will be easier than any of the
               | pain you would deal with from having to do cross-DB
               | joins.
               | 
               | What specific issues are you worried about with the
               | integer key? Usually the issue is dupming data into
               | something like a staging or development environment
               | rather than a production concern. If you attempt to dump
               | 2 datasets into one db you will have a conflict. Or if
               | you write to an environment and then dump into you will
               | have a conflict.
        
               | babbledabbler wrote:
               | Mainly portability of data and options for the future.
               | I'm all on one postgres instance right now and don't plan
               | on breaking it up until necessary. If at some point I
               | need to take a table and move it to another type of
               | database I want that migration process to be
               | straightforward. If I have integer keys with sequencing
               | behavior I anticipate having to do that porting. That
               | internal key would then become external to do the lookup
               | and if it's an external key I want it to be UUID for
               | security as well. Integers as IDs are guessable so I want
               | to keep them internal.
        
           | damethos wrote:
           | > Had I to do it again, I would just use UUIDv4 until it runs
           | into issues and either date fields or a sequence where
           | necessary
           | 
           | How soon is soon? And how would you handle it if you already
           | had all PKs as UUIDs
        
             | babbledabbler wrote:
             | Probably when there is some downtime or as the opportunity
             | presents.
             | 
             | I would use a date field to do ordering or a sequenced
             | field when needed.
        
           | 3pt14159 wrote:
           | Well, I'll give another take.
           | 
           | If you do not have two separate forms of identifier AND you
           | have a "public" API (including basically any client apps,
           | frontend JS, or anything found in query params) then you are
           | making compliance with European regulators a massive headache
           | when it comes to erasure of PII, since shared identifiers
           | must be destroyed one way or another. Trying to merely delete
           | the records is complicated by the fact that you need legal
           | holds to comply with various federal laws.
           | 
           | Between the work involved in using two identifiers, one for
           | joins and one for external lookups, versus the work involved
           | in manually coding up all sorts of erasure work arounds,
           | something I was in charge of in the past, I would strongly
           | consider just using two IDs.
           | 
           | If your ORM gets in the way, just modify the ORM. This is
           | easier than you'd think. For example, in Django just make a
           | helpers module with something like this:
           | class OurModel(Model):             def get(...):
           | 
           | And have some sort of programatic way (lint, etc) of ensuring
           | that your models.py doesn't use the stock class. It's
           | simpler. It ruffles some feathers at first, but if your
           | framework is getting in the way of a real use case just
           | change the framework and don't worry about it.
        
             | babbledabbler wrote:
             | I think you raise some good questions around IDs and PII
             | and we definitely will be tackling GDPR sooner or later.
             | 
             | I don't quite follow on the European regulation issues
             | raised by using as a UUID in a route and that being the PK
             | of the record.
             | 
             | I know you should not expose PII or any information that
             | can be used to identify a person, however, in our case any
             | route is behind an authed login on an SSL connection which
             | encrypts the path (we don't use query params).
             | 
             | The only place that contains data that ties a UUID to a
             | person is in the database. This would be the case whether
             | we used a PK as an integer or not.
             | 
             | Could you elaborate or share any resources around dual IDs
             | DB design for PII compliance? That would be super helpful.
             | 
             | Regarding framework hacking or workarounds, I have a
             | principle to not go against the grain of a framework. The
             | reason for this is that modifying/hacking adds complexity
             | when building on top of it or onboarding other software
             | engineers. If necessary I'll do it as a last resort.
        
               | 3pt14159 wrote:
               | > any route is behind an authed login on an SSL
               | connection which encrypts the path
               | 
               | If your application only services a single user with
               | their own resources then you have nothing to fear. But
               | few applications meet this definition. If, for example,
               | you're running an invoicing application, then at some
               | point you'll want to share some resource, say an invoice
               | or an expense or a time sheet, with another party. If
               | your API exposes the identifiers from one resource to
               | another, or even a user's id when potentially adding them
               | to a team, then these identifiers are considered PII
               | according to European regulators.
               | 
               | I understand that this is frustrating, but it comes from
               | a posture that prioritizes right-to-be-forgotten over
               | programmer ergonomics. Imagine, for example, API crawlers
               | that hit your /search endpoint with email=[some
               | predetermined list of emails] and harvest user ids to
               | match with future data.
               | 
               | In the end, the best thing you can do is keep join keys
               | internal and API keys separated. There are other
               | workarounds, but they're so much trouble that they aren't
               | really viable alternatives. Now, whether you use UUIDs
               | for both identifiers or UUID for external and integer ids
               | for join keys is up to you and your performance and
               | scaling requirements. Personally, I prefer integer keys
               | for internal unless I really expect the database to grow
               | to more than 200m rows before the company hits 1000
               | people, since int ids mean you do not need secondary
               | indexes on things like the created_at fields, but even
               | there, it's not such a big deal to have an extra index on
               | every table.
               | 
               | > I have a principle to not go against the grain of a
               | framework. > hacking adds complexity
               | 
               | Here we essentially agree, but with the right integration
               | tests, upgrading and onboarding is a lot easier than
               | feared. That said, do not add to the framework unless the
               | benefit is worth it.
        
         | jandrewrogers wrote:
         | Organizations that care about leaking information encrypt their
         | identifiers as UUIDs. UUIDs conveniently have the same size as
         | a single AES block, for which there is dedicated silicon on all
         | modern CPUs. There is almost no overhead.
         | 
         | Every other issue with random UUIDs etc, which are ignored
         | here, are solved by encrypting your identifier. Random UUIDs
         | (i.e. UUIDv4) are banned in many places for good reasons.
        
         | hbrn wrote:
         | Key type is not the right layer for solving information leak
         | (and incorrect joins, mentioned above).
         | 
         | Use hashids to avoid the leak, and as a bonus your client-
         | facing keys will be short and easily copy-pastable.
        
         | jmull wrote:
         | In my experience internal auto-increment ids and external
         | random UUID works well.
        
         | CGamesPlay wrote:
         | How could temporal information be used to infer the size of the
         | dataset for all users? Specifically, more accurately than "I
         | now know that this database was created in 2022".
        
           | iwwr wrote:
           | Sequential serial numbers reveal information. See:
           | https://en.m.wikipedia.org/wiki/German_tank_problem
        
           | tgv wrote:
           | I use uuids in links. Using sequential ids gives people
           | access to information they shouldn't see. Unfortunately, the
           | link has to be publicly accessible. Then UUIDs become a very
           | practical key for your records.
           | 
           | That said, the number of times those records are accessed is
           | low, so there are no performance considerations.
        
       | jongjong wrote:
       | I can't think of many use cases where I would sacrifice the
       | beauty and elegance of UUIDs to optimize access times by a
       | millisecond or two. UUID is totally worth the cost.
       | 
       | UUID actually performs much better than I thought based on the
       | author's example with a COUNT query. COUNT queries aren't very
       | efficient because typically, all records are traversed; here
       | we're talking about a 50% slowdown on 10 million records... How
       | often do you need to access 10 million contiguous records? Most
       | of the time, for user-facing apps, you'll be accessing 100
       | contiguous records at most and, in fact, most queries will access
       | a single record... I suspect that the average per-query
       | performance loss for a typical app is probably less than 5%.
       | Also, you could simply index by a separate date/timestamp field
       | if you need records to be ordered by time and probably won't
       | incur any performance cost.
       | 
       | IMO, unless you're building an app for high-frequency trading,
       | auto-incrementing IDs aren't worth the pain and lack of
       | flexibility.
        
         | danpalmer wrote:
         | > I can't think of many use cases where I would sacrifice the
         | beauty and elegance of UUIDs...
         | 
         | I find auto-incrementing IDs far more elegant in many ways.
         | They are much easier to make sense of for users, they give you
         | some loose metadata (ordering, sometimes a rough time range)
         | which can be handy in debugging stuff.
         | 
         | At my previous workplace most things were auto-incrementing
         | IDs, and while we got bitten by them a few times, there was
         | significant debugging value in seeing a user ID as an integer.
         | Internally, order IDs and customer support ticket IDs were also
         | auto-incrementing, but we offset them so that even numbers were
         | orders and odd numbers were support tickets, and this helped
         | quite a bit with debugging or even just non-technical users
         | sending each other references. Not that this can't be done with
         | prefixes on UUIDs with a bit of extra work.
         | 
         | I wouldn't necessarily recommend auto-incrementing IDs or UUIDs
         | over the other, but I don't think one is more elegant at all.
         | 
         | > to optimize access times by a millisecond or two. UUID is
         | totally worth the cost.
         | 
         | 2ms * 20 queries is 40ms, which could take a typical page load
         | for a web app from 150ms to 190ms, which is quite a regression.
        
           | biztos wrote:
           | I'm a big fan of SERIAL and BIGSERIAL, for
           | aesthetic/cognitive reasons more than performance, but it can
           | be really nice to have a data-generating process generate the
           | primary ID of a piece of data, which you can then log before
           | you put it in the database and look up the same ID everywhere
           | when debugging.
           | 
           | Not to mention the distributed database situation.
           | 
           | In fact I like that enough that I sometimes use ULIDs for
           | secondary unique keys just so I have something to log, even
           | if the primary key is numeric.
           | 
           | It's nice that you can store ULIDs in Postgres natively as
           | UUIDs, and it's really nice to have a timestamp embedded in
           | the ULID if you ever need it... but it's also really tempting
           | to use it in public-facing stuff and thus leak your creation
           | timestamp.
        
             | danpalmer wrote:
             | Absolutely, these are also good things to take into
             | account. There are really situations for all of these
             | things, and this is why I don't like the approach of trying
             | to "solve" this problem. Lots of people say "UUIDs solve
             | IDs", or "UUIDs cause too many issues to use", or "ULIDs
             | solve UUID issues", and in reality this is just another
             | place where there are trade-offs, and what works for one
             | company/project/dbms/table may not work for another, or may
             | even cease to work over time as requirements or scale
             | change.
        
         | jongjong wrote:
         | Recently, I wrote a client-side wallet app for cryptocurrency
         | use case and it required transactions to be created and signed
         | entirely on the client-side and so the transaction UUID had to
         | be created on the client side (before signing) and I was
         | surprised at how elegant and simple the application's logic
         | turned out to be (both on the front end and back end).
         | 
         | One of the best things about UUIDs is that if your front end
         | tries to create a resource and fails due to a bad connection,
         | you can simply resend that exact same resource for creation
         | again (with the same UUID) and you don't need to worry about
         | duplicate records being inserted into the database since you
         | would get an ID collision; it results in much simpler/cleaner
         | code which is idempotent and deterministic by default.
         | 
         | With auto-incrementing IDs, if there is a connection issue or
         | other failure, it's not possible to cleanly figure out if a
         | resource was already created or not since only the database
         | knows what the ID of that resource was until that ID is sent
         | back and reaches the client across the network.
         | 
         | Just because the client did not receive a successful response
         | from the server, it does not mean that the resource was not
         | created; and if it was, you cannot know since the client has no
         | way of referencing that resource in a reliable way. It just
         | seems completely wrong that the client (which created the
         | resource) cannot reference it until the database has inserted
         | it.
         | 
         | It's a hack to pretend that resource creation starts in the
         | database, when in fact, it starts on the front end.
        
           | [deleted]
        
           | gradys wrote:
           | You can achieve this without UUID keys if you attach UUIDs to
           | the requests. This has the advantage of working for all kinds
           | of mutating requests, not just creates, but requires that you
           | store these request UUIDs for some period of time as well.
        
             | jongjong wrote:
             | It's still not as elegant because it doesn't work in a
             | multi-process or multi-server back end; if the request
             | fails, you need to make sure that the retry request will
             | hit the same server process (assuming you're keeping the
             | UUIDs in memory on the server). You can use some in-memory
             | data store like Redis to share the request UUIDs across
             | multiple processes/hosts but that can add a lot more
             | complexity, latency and risks and is just not worth it IMO.
             | 
             | Even in a single-host setup, it can be a problem because
             | what happens if your process crashes and restarts just
             | after a resource was created in the database but before the
             | ID was sent to the client (with success message)? You would
             | end up with a duplicate record in your DB after the retry
             | since your newly restarted server would not have the UUID
             | in its memory (even though the resource was in fact already
             | created on the server a few milliseconds before the last
             | crash).
             | 
             | With the Redis (or similar) solution, you need to make sure
             | that the request UUIDs expire after some time and are
             | cleaned up to avoid memory bloat which is a pain... I mean
             | that complex solution probably uses up a lot more resources
             | than just using UUIDs in the database as IDs.
        
               | ramchip wrote:
               | You can use the DB for this as well, just make a table
               | e.g. "requests" with two columns, the user and the
               | request token. Old entries can be purged on a cron job.
               | 
               | If you add a column to store params as well then you can
               | also do better validation:
               | 
               | > Responding when a customer changes request parameters
               | on a subsequent call where the client request ID stays
               | the same
               | 
               | > We design our APIs to allow our customers to explicitly
               | state their intent. Take the situation where we receive a
               | unique client request token that we have seen before, but
               | there is a parameter combination that is different from
               | the earlier request. We find that it is safest to assume
               | that the customer intended a different outcome, and that
               | this might not be the same request. In response to this
               | situation, we return a validation error indicating a
               | parameter mismatch between idempotent requests. To
               | support this deep validation, we also store the
               | parameters used to make the initial request along with
               | the client request identifier.
               | 
               | https://aws.amazon.com/builders-library/making-retries-
               | safe-...
        
               | jongjong wrote:
               | You can but you add an additional database lookup step
               | which will cost you more in performance (and latency)
               | than just using UUIDs as the ID directly with a single
               | table.
        
               | ramchip wrote:
               | It's only at creation time. When you lookup the record
               | after that you can use the ID directly, and then you
               | don't hit the locality problems from the article.
        
         | whack wrote:
         | > _The moral of the story is that data locality matters, and it
         | can pop up in the most surprising of places. Using random is
         | typically the worst thing you can do for locality, so if you
         | want to use UUID's, try to use a sequential variant. UUID v7 is
         | a good option_
         | 
         | The author isn't suggesting abandoning UUIDs, he is suggesting
         | using something like UUIDv7 which preserves locality. For most
         | use-cases, this seems like a very reasonable recommendation
         | with no significant downsides I can think of.
        
           | [deleted]
        
           | fauigerzigerk wrote:
           | I agree that UUIDv7 will usually be the best choice. The one
           | downside I can think of is that it may sometimes be necessary
           | to let people know the identity of a thing without also
           | telling them exactly when that thing was created.
        
             | remram wrote:
             | Encrypt it. Block ciphers are bijective so they will turn
             | sequential IDs into unique random IDs in a way you can
             | easily reverse.
             | 
             | You can also do that to get opaque identifiers from an
             | auto-increment primary key.
        
               | fauigerzigerk wrote:
               | Seems awfully complicated to me.
        
             | pkilgore wrote:
             | COMB UUIDs can help here.
        
               | fauigerzigerk wrote:
               | Why do COMB UUIDs not have this problem? Do they not
               | contain a timestamp?
        
         | [deleted]
        
         | dewey wrote:
         | > sacrifice the beauty and elegance of UUIDs
         | 
         | Out of curiosity, what makes UUIDs more elegant or beautiful
         | than just plain old integers?
        
           | maxgee wrote:
           | Nothing. Integers are simpler and faster. They're also
           | incorrect about COUNTs requiring a full record scan.
        
             | xkqd wrote:
             | How are integers simpler? Even generation of unique
             | integers is more complicated, let alone migrating them or
             | moving them around.
        
             | catlifeonmars wrote:
             | 1. You have to handle race conditions if there is ever more
             | than one process writing to the DB at one time 2. It's
             | ambiguous whether a number is an ID. E.g. when looking at
             | logs 3. Numbers leak information about table size, row age,
             | and crucially _future_ row IDs.
        
             | phito wrote:
             | Nothing? Really? Have you ever thought about that?
        
           | ceeam wrote:
           | Their uniqueness of course.
           | 
           | > To be able to generate keys independently of the database
           | 
           | > To move sets of related records between different databases
           | without having to deal with renumbering everything
        
             | yes_man wrote:
             | And adding greppability of logs in debugging situations.
             | You can just search logs in any systems of yours with the
             | UUID and find exact hits you are dealing with. Whereas with
             | integers you will get all kinds of hits
        
               | phkahler wrote:
               | Because UUIDs are _universally_ Unique IDentifiers.
        
             | jandrewrogers wrote:
             | Unfortunately, UUIDs aren't as universally unique in
             | practice as people think. The standard flavors are banned
             | in quite a few organizations that care a lot about data
             | integrity for valid reasons. Standard UUIDs work for _some_
             | applications but definitely not all.
             | 
             | For most applications you are better off using a sensible
             | structured key, like incrementing an integer or similar,
             | and encrypting it if you want to obscure it. Encrypting a
             | 128-bit value is approximately free on modern CPUs.
        
           | kogus wrote:
           | "5" might be the key for a thousand different records in your
           | database. A uuid is a key for exactly one. Integer keys
           | permit wrong joins to have the appearance of working.
        
             | FroshKiller wrote:
             | Crucially, "5" might also be the key to a thousand
             | different records in dozens of other development & QA
             | databases for applications with integrations configured
             | incorrectly. Integers can match coincidentally. UUIDs
             | won't.
        
             | magicalhippo wrote:
             | While wrong joins is a weak argument IMHO, our customers
             | somewhat frequently ask to merge data, either from test to
             | prod or from one company to another (after acquisition
             | f.ex.).
             | 
             | In a lot of cases that can be a pain due to overlapping
             | integer primary keys and large parent-child table sets.
        
             | antifa wrote:
             | You can tell all your tables to use the same sequence.
        
             | totalhack wrote:
             | You can incorrectly join on any columns you want. That's
             | not really the fault of the column types IMO, that's a
             | problem in the layers above.
        
               | dpcx wrote:
               | True, but you can't incorrectly join on UUID due to its
               | anti-collision nature.
        
               | ibash wrote:
               | Yes you can, you'll not get the correct result set.
        
               | OJFord wrote:
               | As sibling says, you can make exactly the same mistake
               | and it not colliding just means no join - but if it's an
               | outer join and one of many joins or in a nested query it
               | might not be immediately obvious it's happened, or what
               | the issue is even if it is.
        
             | smolder wrote:
             | We have some app generated internal IDs which have a 1
             | character prefix by type and then base64 sequential stuff
             | after that. I'll admit it's nice knowing the type of an
             | identifier by looking at it. I don't really like that they
             | are strings, however. The original architect made some
             | weird choices as far as deciding where and where not to
             | optimize for performance.
        
             | Phil_Latio wrote:
             | [dead]
        
               | allknowingfrog wrote:
               | That's not a very constructive response. UUIDs do provide
               | global uniqueness, which can occasionally make mistakes
               | more obvious. Do you have a counter argument to support
               | your strong objection?
        
           | Raz2 wrote:
           | > To be able to generate keys independently of the database
           | 
           | This is a must if you don't want to couple your domain logic
           | and DB. I want to generate a record with an ID inside my
           | domain layer without depending on the DB and doing awkward DB
           | save - DB read.
        
         | pgaddict wrote:
         | It is not a matter of a couple milliseconds.
         | 
         | The loss of locality for reads is bad, especially for data sets
         | that don't fit into cache / RAM (while the active set would).
         | 
         | Where it really bites you is writes, because it can trigger
         | pretty massive write amplification.
         | 
         | Imagine you have 128 GB index on UUID column, that's ~16M pages
         | (8kB) and insert 1M random values. Congrats! You've probably
         | just wrote 8GB to the WAL, because of FPW and stuff like that.
         | With serial IDs we'd write a fraction of that. It doesn't take
         | much to hit max_wal_size and trigger a checkpoint, starting a
         | new cycle with FPWs. Got a replica? Well, now you need to send
         | the WAL over network. Is the bandwidth limited (another DC?),
         | sorry to hear that. Is the replica sync and you have to wait.
         | Well, that's unfortunate.
         | 
         | In other words, the lack of locality seems like a detail but at
         | scale it's actually a damn huge deal.
        
           | qaq wrote:
           | this sounds horrible until you realise that a modern SSD will
           | write that in about 1.5 sec.
        
             | pgaddict wrote:
             | That was just an example calculation, to illustrate the
             | write amplification factor, of course. You can scale it up
             | pretty arbitrarily.
             | 
             | I mentioned only WAL for simplicity, but it also has to
             | modify and write out the index pages themselves, and write
             | them out eventually. And that's going to be mostly random
             | I/O. Flash storage is good at handling that, ofc, but if
             | things are adding up like this ...
             | 
             | Not to mention you still have to copy the WAL over network
             | to replica, or perhaps to multiple replicas. And if you
             | have physical backups with PITR, you gotta keep all the WAL
             | somewhere too.
        
               | qaq wrote:
               | sure but not many workloads are writing out million
               | inserts per second.
        
           | [deleted]
        
           | ok123456 wrote:
           | Then just make a generated column that's a 32bit integer hash
           | of the uuid for this particular case and create an index on
           | that? Use that during expensive queries that blow up your
           | cache locality if it matters.
        
             | insanitybit wrote:
             | A 32bit integer hash won't have locality either.
        
               | ok123456 wrote:
               | Ok. Pick a resolution where you won't get collisions and
               | that is a native datatype.
        
               | patrec wrote:
               | I feel that you don't understand what locality is about:
               | the property of IDs that were generated closely together
               | in time to be close together numerically. You don't get
               | that by "hashing" the UUID (which makes no sense anyway,
               | since you might as well just take some truncation of the
               | UUID). In fact the whole idea behind a hash is to destroy
               | this property of the input data. The reason the numerical
               | locality matters is that it is much more efficient to in-
               | sequence-insert several numbers that are close together
               | into an index than to in-sequence-insert the identical
               | amount of randomly distributed numbers.
               | 
               | The GP was talking about the first graph here:
               | https://www.2ndquadrant.com/en/blog/on-the-impact-of-
               | full-pa...
        
               | ok123456 wrote:
               | I thought you were talking about cache locality within
               | the CPU. Real cache locality.
               | 
               | If you care about the insert location, why not just add a
               | brin index on a timestamp field and use that instead of
               | assuming that the index is sequential.
        
               | patrec wrote:
               | How would adding a BRIN index help in any way with
               | reducing the discussed problems, such as write
               | amplification?
        
               | pgaddict wrote:
               | This is a bit confusing, as it mixes two things - BRIN
               | index and index on a timestamp.
               | 
               | The main source of write amplification comes from
               | updating random pages of the btree index. Imagine
               | inserting 10 random UUID values into a large index - it's
               | pretty likely those will go into 10 different leaf pages.
               | And every first update of a page after a checkpoint
               | (which typically happens every 30 minutes or so), we have
               | to write a FPI (i.e. the whole 8kB page) to WAL. So
               | because btrees are based on ordering, random values end
               | up on random leaf pages, causing write amplification.
               | 
               | If you have BRIN index on UUID column, this does not
               | happen, because the index is not based on ordering but
               | location in the table. If the 10 rows get appended to the
               | same table page, that'll be just 1 write, with one FPI.
               | 
               | This is why BRIN does not have the write amplification
               | issue. But it's also a bit pointless, because BRIN on
               | random data is pretty useless for querying (Well, at
               | least the minmax indexes, are. Let's ignore BRIN bloom
               | indexes here.)
               | 
               | If you create BRIN on timestamp, that's not going to have
               | write amplification problem, and it'll be good for
               | querying. The thing is - BTREE would not have write
               | amplification problem either, because the timestamps are
               | going to be sequential (hence no updates to random leaf
               | pages).
        
               | saltcured wrote:
               | Right, you just need to avoid having the btree index on
               | the UUID field. Similarly, don't pose queries to sort by
               | the randomized UUID field either. This is where the often
               | maligned hash index type could be useful, to allow lookup
               | of individual rows by UUID without all the expense of a
               | btree index maintenance. Use other fields for ordering
               | that have better write locality.
               | 
               | What this means in practice, of course, is that you
               | shouldn't expect to do application driven pagination with
               | UUID keys either. You would need to expose some other
               | boundary marker with a total order that works well with
               | btrees. And this could bring you back to "leaking"
               | predictable key material that you were trying to hide by
               | adopting UUIDs...
        
               | insanitybit wrote:
               | CPU cache locality is not "real cache locality", it is
               | just another place where caches exist and locality is
               | optimized for. Your solution wouldn't work for CPUs
               | either, other than that you could fit more data into a
               | cache line - but that's like saying "increase the cache
               | size to solve this problem", which obviously can help but
               | is not addressing the inherent issue.
        
               | insanitybit wrote:
               | Hashes evenly distribute information regardless of the
               | input. You will never have locality unless your dataset
               | is so small that all of it can fit into a single cache
               | line.
               | 
               | Your solution just doesn't really make any sense. The
               | time ordered UUID suggested in the blog post makes way
               | more sense.
        
           | unshavedyak wrote:
           | Out of my scope, but why are UUIDs even discussed? ULIDs
           | ~~(and i think Nanoids?)~~ don't suffer these same problems.
           | Locality and ordering alone make me[1] think ordered ULIDs
           | (and friends) are the only thing worth discussing.
           | 
           | Is there some value to UUIDs over ULIDs that make these
           | discussions largely revolve around Autoincrement vs UUIDs
           | rather than Autoincrement vs ULIDs(and friends)?
           | 
           | [1]: Again, totally out of my wheel house.
           | 
           |  _edit_ : Apparently UUIDv7 exists, which is similar to ULID,
           | so my question pertains to UUIDv5 and below, i think.
           | 
           |  _edit2_ : I think Nanoids _do_ suffer the same problem. They
           | 're just small... i think.
        
             | ljm wrote:
             | I've worked with ULIDs a bit but honestly haven't operated
             | at a scale where you might run into issues. And the main
             | reason for choosing ULID was because I could only find
             | experimental support for UUIDv6 or UUIDv7, and KSUID
             | (another alternative) only had time precision down to a
             | second.
             | 
             | And the reason for that was to have lexicographically
             | sortable IDs (even if not monotonic, which would require an
             | extra server) so we didn't have to index a timestamp column
             | for chronological ordering. The alternative was to have a
             | traditional integer sequence but it's not always ideal to
             | expose those on your API.
        
               | cratermoon wrote:
               | Out of curiosity, why do your keys need time precision
               | less than a second? Are they carrying some secondary
               | expectations beyond uniqueness and k-sortable?
        
             | letitbeirie wrote:
             | Postgres supports UUIDs natively
        
               | unshavedyak wrote:
               | Is that worth the pain of dealing with randomized
               | inserts? I guess i just don't mind creating a ULID (or i
               | guess UUIDv7 is newly proposed and sortable) and
               | inserting that.
               | 
               | Native DB support is irrelevant to me for randomized bits
               | unless it affects storage, sorting, paging, etc. Does it?
        
               | code-e wrote:
               | It does affect storage and sorting. A native UUID type
               | uses 16 bytes. The alternative is text encoding (32 bytes
               | for hex), or maybe a raw BYTEA. Postgres also has
               | SortSupport for the UUID type, which basically means if
               | the first 8 bytes only has one matching row, then the
               | remaining 8 bytes can be skipped. Combine that with a
               | ULID where the most first half is basically a timestamp,
               | you'll get performance close to using a single 8byte
               | BIGSERIAL.
               | 
               | You can also write a plpgsql function to generate these
               | ULIDs in the database.
        
               | lolinder wrote:
               | ULIDs are byte-compatible with UUIDs, so the only thing
               | Postgres's native support gives you over ULIDs is that
               | Postgres can generate new UUIDs for you instead of having
               | to do it in the application before insertion.
        
               | letitbeirie wrote:
               | It also buys you that support in the driver libraries.
               | 
               | I ran into this last year - storing them as UUIDs works
               | great, unless all of the drivers for the language you're
               | using (Go, in my case) try to cast/validate those bytes
               | _as_ a UUID before you can access them.
        
             | xahrepap wrote:
             | ULIDs have made our MySql caches breathe a sigh of relief.
             | 
             | One place we're avoiding ULIDs (and other counters) is in
             | publicly-facing IDs. Preferring random to help keep them
             | unguessable. (say what you will about security-through-
             | obscurity).
             | 
             | So we do ULIDs for private IDs. Random UUIDs for public
             | IDs. Seems to work well.
        
               | ilyt wrote:
               | I wonder if just encrypting them for public usage would
               | be enough here. Then application can convert between
               | public and private representation at will.
        
               | smaddox wrote:
               | KSUID's are have temporal-lexicographical order plus 128
               | bits of entropy, which is more than UUIDv4.
               | 
               | https://github.com/segmentio/ksuid
        
           | catlifeonmars wrote:
           | At scale you're probably sharded across multiple DBs and
           | you're already operating through replicas. Point being you're
           | less likely to hit a warm cache as you scale up anyway as
           | your application layer gets load balanced to different DB
           | endpoints.
        
             | magicalhippo wrote:
             | Reminds me of a talk about ZFS performance, where they
             | presented some benchmark results showing that as the number
             | of concurrent clints doing pure sequential IO increases,
             | the more the load appears as random IO to the filesystem.
             | 
             | So with high enough concurrent load it's effectively all
             | random IO and that's the primary thing worth optimizing
             | for.
        
             | jongjong wrote:
             | Yes exactly, with proper sharding, raw performance is not
             | as important; to some extent, you trade it away for
             | improved concurrency. In fact, I struggle to see how one
             | would implement sharding with auto-incrementing integers
             | (you would get ID collisions for different resources across
             | different shards/database instances); there needs to be a
             | way to uniquely refer to resources across potentially
             | multiple databases and UUIDs are one of the best ways to
             | achieve that.
             | 
             | Auto-incrementing IDs simply don't scale beyond a single
             | host so I don't see how they can be good for scalability.
             | Too many devs conflate raw performance with scalability.
             | They are not the same at all - In fact, high scalability
             | often incurs a performance overhead.
        
               | pgaddict wrote:
               | I respectfully disagree with the notion that sharding
               | makes resource usage somehow less important. Sure, it
               | allows you to overcome the limits that would apply to a
               | single node, but if you stop caring about using resources
               | efficiently (e.g. memory / disk / network bandwidth),
               | it's not very different from building huge and expensive
               | single boxes.
               | 
               | Also, I don't think the blog (or me) suggested going back
               | to using autoincrement IDs. There are other (better)
               | options.
        
               | coredog64 wrote:
               | We did auto-increment integers with multiple servers like
               | 20 years ago: The caveat is that you have to know how
               | many servers are in the set in advance. Each server
               | increments by the population size, and their starting
               | number is their position within the pool.
               | 
               | Not hyper scale, but good enough for failover or a 3-5
               | node setup.
        
               | viraptor wrote:
               | Or increment by more than you need. If you think you'll
               | only ever have 5 nodes, increment by 20. Lots of space
               | for expansion and in practice you'll hit the datatype
               | limits just a bit earlier - you'd need to work around
               | them anyways.
        
               | ilyt wrote:
               | IIRC that's what MySQL Galera is doing by default
               | mysql> CREATE TABLE animals (             ->      id
               | MEDIUMINT NOT NULL AUTO_INCREMENT,             ->
               | name CHAR(30) NOT NULL,             ->      PRIMARY KEY
               | (id)             -> );         Query OK, 0 rows affected
               | (0.34 sec)                  mysql> INSERT INTO animals
               | (name) VALUES             ->
               | ('dog'),('cat'),('penguin'),             ->
               | ('lax'),('whale'),('ostrich');         Query OK, 6 rows
               | affected (0.01 sec)         Records: 6  Duplicates: 0
               | Warnings: 0                  mysql> SELECT * FROM
               | animals;         +----+---------+         | id | name
               | |         +----+---------+         |  3 | dog     |
               | |  6 | cat     |         |  9 | penguin |         | 12 |
               | lax     |         | 15 | whale   |         | 18 | ostrich
               | |         +----+---------+         6 rows in set (0.00
               | sec)
        
               | egeozcan wrote:
               | Hi/Lo algorithm also works fine for most cases
        
           | uppiiii765 wrote:
           | Mmmhhh should that not be highly efficient for flash based
           | storage?
        
           | nine_k wrote:
           | If you want locality, speed, simplicity, etc above all, use
           | an incremented integer and be done with it.
           | 
           | UUIDs belong where you _can 't afford_ that simplicity. Where
           | you e.g. cannot coordinate the creation of your primary keys.
           | Or where you cannot allow them to be predictable. There you
           | pay the price.
           | 
           | In practice I noticed that the size of PKs and their poor
           | locality start to play a role only after a huge basket of
           | lower-hanging fruit has been collected. There are relatively
           | few places where their role is dramatic.
        
             | marcosdumay wrote:
             | > Where you e.g. cannot coordinate the creation of your
             | primary keys.
             | 
             | You are writing your data into a DBMS. Coordinating the
             | creation of primary keys is one of the cheapest tasks
             | around, if you can't do that, how is your database still
             | online?
             | 
             | > Or where you cannot allow them to be predictable.
             | 
             | You don't need to export your PKs for the rest of the
             | world. You can have non-predictable data outside of your
             | PK. Yes, a different column will still have some of the
             | problems with index maintenance, but it becomes a much
             | smaller problem if only one table cares about the value.
        
               | nine_k wrote:
               | You can almost always opt for artificial PKs that are
               | more performant.
               | 
               | But sometimes you have to make these keys public, e.g. as
               | user or other resource IDs. You want to make them UUIDs
               | so they won't be predictable. Not having to join
               | everywhere with the UUID-to-artificial-PK table may be a
               | bigger performance win than the losses from larger size
               | of UUIDs.
               | 
               | Sometimes you have a distributed / sharded system, and
               | don't want the keys to clash, and also avoid assigning
               | ranges. Sometimes you have to accept someone else's ID,
               | not originating in your system. In cases like that, large
               | random numbers, e.g. UUID v4, work reasonably well.
               | 
               | Of course when you just have one DB, and a relative slow
               | stream of new rows, it's easy to fully control PK
               | creation. And this covers the majority of practical
               | cases.
        
               | jgraettinger1 wrote:
               | We use a macaddr8 that embeds a wall-clock timestamp (so
               | they're ascending order, achieving data locality) with
               | some additional shard and sequence-number bits. It's
               | worked really well for us:
               | 
               | https://github.com/estuary/flow/blob/master/supabase/migr
               | ati...
               | 
               | we use macaddr8 instead of bigint, because it has a
               | postgres serialization / JSON encoding which lossless-ly
               | round-trips with browsers and it works well with
               | PostgREST. The same CANNOT be said for bigint, which is a
               | huge footgun.
        
             | gregmac wrote:
             | > UUIDs belong where you can't afford that simplicity.
             | Where you e.g. cannot coordinate the creation of your
             | primary keys. Or where you cannot allow them to be
             | predictable. There you pay the price.
             | 
             | You often don't realize you you have non-simple needs until
             | your application is reasonably mature, and in production.
             | If you already picked integer keys, you now either forever
             | deal with the issues caused by not using UUIDs, or you deal
             | with the unknown-but-non-zero pain of converting to UUIDs.
        
       | bfrog wrote:
       | Snowflake ids fit this sort of thing better, and maintain
       | temporal locality.
       | 
       | I ran into this exact issue when building out a graph like
       | database in Postgres many moons ago.
       | 
       | It's probably less of an issue with fast nvme sad drives now, but
       | on mechanical drives ordering and locality are massive.
        
       | jamil7 wrote:
       | Can UUID v7 still be created independently, on a client for
       | instance? Or do they ideally need to be generated in the same
       | place?
        
         | jayknight wrote:
         | They can be generated anywhere, otherwise they wouldn't be
         | universally unique.
        
           | jamil7 wrote:
           | I'm not asking about the uniqueness but about ordering across
           | different clients.
        
             | roenxi wrote:
             | There is probably an impossibility theory somewhere for two
             | clients generating ordered ID without communicating with
             | each other. What if one client was travelling near the
             | speed of light for example? It isn't obvious what order
             | would be correct since the two clients would have wildly
             | different perceptions of time.
        
               | ahoka wrote:
               | Yes, it's impossible in theory. Same as the two generals
               | problem, if I'm not mistaken.
        
             | sibit wrote:
             | If they're generated on the client you would need to keep
             | clock drift[0] in mind since the leading bits are time
             | based.
             | 
             | [0]: https://youtu.be/mAyW-4LeXZo (Clock Synchronization in
             | Distributed Systems by Martin Kleppmann)
        
         | stevesimmons wrote:
         | Yes, UUID v7 can be generated independently. The UUID has three
         | parts, which maintains strict chronological ordering for UUIDs
         | generated in one process, and to within clock skew when UUIDs
         | are generated on different systems.
         | 
         | The three parts are:
         | 
         | - time-based leading bits.
         | 
         | - sequential counter, so that multiple UUID 7s generated very
         | rapidly within the same process will be monotonic even if the
         | time counter does not increment.
         | 
         | - enough random bits to ensure no collisions and UUIDs cannot
         | be guessed.
        
           | jamil7 wrote:
           | Thank you.
        
           | westurner wrote:
           | > _There are new UUID formats that are timestamp-sortable;
           | for when blockchain cryptographic hashes aren 't enough
           | entropy._
           | 
           | Note that multiple rounds of cryptographic hashing is not
           | considered sufficient anymore; PBKDF2 and Argon2 are Key
           | Derivation Functions, and those are used instead of hash
           | functions.
           | 
           | "New UUID Formats - IETF Draft"
           | https://news.ycombinator.com/item?id=28088213
           | 
           | draft-peabody-dispatch-new-uuid-format-04 Internet-Draft "New
           | UUID Formats" https://datatracker.ietf.org/doc/html/draft-
           | peabody-dispatch... ; UUID6, UUID7, UUID8
        
           | jongjong wrote:
           | I don't have a problem with UUIDv7 but I wish they had
           | created it as a completely separate standard after UUIDv4...
           | The version number increases imply that previous versions
           | have been superseded but in fact, they just have different
           | priorities...
           | 
           | And TBH I'm concerned that UUIDv1 had a timestamp, then it
           | was removed completely in UUIDv4 and now the timestamp
           | concept is being added back to UUIDv7... There are legitimate
           | use cases where you simply don't want to have timestamps in
           | your IDs.
        
             | ahoka wrote:
             | These are not versions in the sense that one supersedes the
             | other. More like variants.
        
               | jongjong wrote:
               | I hope devs and security consultants will see it that
               | way. I can easily envision a future where security
               | consultants would write reports flagging systems as
               | 'unsecure' on the grounds that they use 'an outdated UUID
               | version'.
               | 
               | Security consultants can be quite blunt in their approach
               | and companies will often yield to their every demand for
               | the sake of easy compliance and to avoid having to
               | explain stuff.
        
               | remram wrote:
               | I wish they could feel that way about IP versions...
        
       | NicoJuicy wrote:
       | Tbh. I use UUID's a lot, perhaps usefull to know that
       | Deterministic UUID exist,
       | 
       | eg. DeterministicGuid.Create(tenantNamespace, "my-tenant-name");
       | 
       | Note: https://github.com/Informatievlaanderen/deterministic-
       | guid-g... for dotnet.
       | 
       | Additionally, know that there are different versions of UUID's.
       | If you want to create chronologic UUID's so that queries are more
       | easily ordered, use the correct type.
        
       | BiteCode_dev wrote:
       | That's... not unexpected. That's been the known technical
       | compromise of UUID for years, and why ULID and UUID 7 are born.
       | 
       | And usually it is still worth it.
        
         | Tehnix wrote:
         | The main problem is probably that you will not find a "regular
         | Engineer" knowing this unless they've had to deal with scaling
         | issues on a DB (e.g. high IOPS etc).
         | 
         | The knowledge is definitely out there, but almost all articles
         | will be reaching for UUIDs, so this is understandably what most
         | people end up with :/
        
       | Timpy wrote:
       | For me, the visual noise is a downside for UUIDs. A lot of time
       | investigating data issues means glancing at query results and
       | deciding if something looks unexpected. I just can't parse a UUID
       | with my eyes that quick. I've come up against this at my current
       | job a little too often and I'm cursing the decision to switch to
       | UUIDs. I know, it's a balancing act of competing concerns. But
       | for us moving to UUIDs was future proofing for a problem that
       | never arose, and the senior dev who made us switch left the
       | company a few years ago. I just don't like the dev experience of
       | UUIDs in a world where you do have to get hands (and eyes) on the
       | data all the time.
        
         | Pet_Ant wrote:
         | Something I saw a longtime ago was hashing UUIDs into a RGB
         | value and then colouring the background of the cell (at least
         | when viewing in a DB IDE). That way you can quickly at a glance
         | tell if it's worth going over character by character.
        
           | adverbly wrote:
           | Holy crap this is brilliant. I'd just change the text color
           | of the uuid itself though - no need to change the
           | background(although you'd need to make sure that the hash
           | function avoided your existing background color to maintain a
           | high enough contrast ratio for readability).
           | 
           | I want this as a vscode extension.
           | 
           | Or maybe as a terminal plugin or something? Is that possible?
           | Could tmux do it maybe?
        
           | Timpy wrote:
           | Wow I really love that idea. Next time I get stuck on a fix
           | tinkering with my setup I'm going to check this out.
        
         | michaelsalim wrote:
         | Interesting take. I do agree there's some visual noise there,
         | but not how you described it. UUID just takes so much screen
         | space compared to its integer counterpart. If you don't have a
         | good UI sorting things out it's quite annoying. Otherwise it's
         | pretty great.
        
           | apelapan wrote:
           | Visual noise and screen space hogging annoy me alot.
           | Especially when you end up with multiple UUID:s on the same
           | log line.
           | 
           | If one doesn't _truly_ need distributed creation of globally
           | unique identifiers, it is so much nicer with base35-encoded
           | integer sequences. Preferably loosely based on a timestamp.
        
           | arp242 wrote:
           | It's kind of a tooling problem; no reason tooling can't do
           | the same as git and display "1509af9" instead of
           | "1509af9af2634d16f8d9b98e01a0166a49185474".
           | 
           | Also I wish these sort of things would get encoded in base-36
           | (0-9 a-z) instead of base-16; that would help too.
        
             | ComputerGuru wrote:
             | Ulid is an ordered UUID canonically rendered in not Base36
             | but Crockford's Base32 (to make it easier to read then
             | write or read aloud or parse and error correct).
        
         | SkyPuncher wrote:
         | Interestingly, my experience has been the opposite. It's
         | incredibly easy to search and filter logs with UUIDs. You can
         | search for just the UUID without any other information.
        
         | [deleted]
        
       | londons_explore wrote:
       | There are downsides to anything ascending/descending in a
       | database too...
       | 
       | They mess with table statistics. If you do a query like "SELECT *
       | from users where creation_date > NOW-1h", the query analyzer
       | doesn't know that there might be thousands of users created in
       | the last hour. It is probably working from day-old statistics
       | that say all users have a creation_date between 2008 and
       | 2023-06-21.
       | 
       | That makes it sometimes pick an exceptionally poor query plan.
       | Ie. instead of your query taking 50 milliseconds, it might take
       | 50 hours and involve an n^2 scan of all data in your database.
        
       | willvarfar wrote:
       | > if you want to use UUID's, try to use a sequential variant.
       | UUID v7 is a good option. Hopefully it's coming to PostgreSQL 17
       | 
       | Yeap, UUIDv7 directly addresses the issues in the article. Warmly
       | recommended.
       | 
       | (There is the issue that exposing ULIDs to users may 'leak'
       | information about when things were created etc, but that is
       | usually not a problem.)
        
         | rollcat wrote:
         | I also like UUIDv8. It's broadly similar to v7 but the vendor
         | is free to define how the timestamps are represented, how many
         | bits to split between timestamps/randomness, define own
         | packing, etc.
         | 
         | So we've started making UUIDs that encode the current ISO8601
         | date+time in a human-readable format:
         | YYYYMMDD-HHMM-VRRR-RRRR-RRRRRRRRR
         | 
         | This is especially useful for things you have few of (no more
         | than a couple per minute), that you regularly need to cross-
         | reference with other systems (e.g. files on S3).
        
           | pgaddict wrote:
           | The one problem with this "perfectly" sequential UUIDs is
           | that it can easily lead to index bloat. Imagine you have such
           | sequential UUIDs generated over a year, for example. And then
           | you delete e.g. 99% of old data (say, everything except some
           | records that you're required to keep for audit purposes or
           | whatever).
           | 
           | If there was an index, the "old" part will be 99% empty. For
           | regular UUIDs this would be fine, because new entries would
           | get routed to this part of the index and the space would be
           | reused. Not so for sequential UUIDs (v7/v8).
           | 
           | This is mostly why year ago I wrote "sequential-uuids"
           | extension, doing roughly what v7/v8 do, but wrapping the
           | timestamp once in a while.
           | 
           | Of course, if you don't delete data, this is not an issue and
           | v7/v8 will work fine.
        
             | rollcat wrote:
             | Different UUID generators for different use cases! I'm also
             | a fan of v5 for generating stable identifiers, when
             | tracking things such as DNS names or URLs. Useful in web
             | crawlers, host inventories, etc.
        
           | JdeBP wrote:
           | Enough iterations in this direction, and people will come
           | full circle to how message IDs operated on Usenet.
           | <datetime.processID.sequencenumber@domain> for example.
        
       | pfooti wrote:
       | Weirdly, and this may be a side effect of writing code that uses
       | Cloud Spanner, but I tend to think of uuidv4's random
       | distribution through keyspace as a _good_ thing.
       | 
       | My instinct is to not use sequential-ish indices/primary keys,
       | because I don't want to hotspot one part of the storage with all
       | my writes for today in the same tablet.
        
       ___________________________________________________________________
       (page generated 2023-06-22 23:02 UTC)