[HN Gopher] PostgreSQL UUID vs. Serial vs. Identity
       ___________________________________________________________________
        
       PostgreSQL UUID vs. Serial vs. Identity
        
       Author : lhenk
       Score  : 133 points
       Date   : 2021-05-31 16:38 UTC (6 hours ago)
        
 (HTM) web link (www.cybertec-postgresql.com)
 (TXT) w3m dump (www.cybertec-postgresql.com)
        
       | pmontra wrote:
       | Meta: this company wrote an impressive number of articles about
       | PostgreSQL since 2013. List at https://www.cybertec-
       | postgresql.com/en/tag/postgresql/
        
         | aidos wrote:
         | I just had to do a double take as I was reading a stack
         | overflow post at the same time and recognised it as the same
         | author.
        
           | lhenk wrote:
           | Laurenz (the author) was Postgres person of the week not too
           | long ago: https://postgresql.life/post/laurenz_albe/
        
         | lhenk wrote:
         | Also, here's a list of blog posts from Laurenz Albe (the author
         | of the OP post): https://www.cybertec-
         | postgresql.com/en/author/cybertec_albe/ His blog posts are a
         | great read, I'd recommend checking them out!
        
       | barrkel wrote:
       | Another point: if there's any temporal locality to your future
       | access patterns - if you're more likely to access multiple rows
       | which were inserted at roughly the same time - then allocating
       | sequential identifiers brings those entries closer together in
       | the primary key index.
       | 
       | I used to work on a reconciliation system which inserted all its
       | results into the database. Only the most recent results were
       | heavily queried, with a long tail of occasional lookups into
       | older results. We never had a problem with primary key indexes
       | (though this was in MySQL, which uses a clustered index on the
       | primary key for row storage, so it's an even bigger benefit); the
       | MD5 column used for identifying repeating data, on the other
       | hand, would blow out the cache on large customers' instances.
        
       | 3pt14159 wrote:
       | > Now, sometimes a table has a natural primary key, for example
       | the social security number of a country's citizens.
       | 
       | You know, you think that, but it's never that simple. The field
       | was added incorrectly and nobody noticed until the value is in
       | countless tables that you now need to simultaneously update or
       | the value is something that's supposed to be semi-secret, so now
       | a low level support staff can't reference the row when dealing
       | with a request. Or the table's requirements change and now you
       | need to track two different kinds of data or data that is missing
       | the field.
       | 
       | Me, I always just have the table make its own ID. It is just
       | simpler, even when you _think_ it is overkill.
        
       | vbsteven wrote:
       | I'm currently prototyping a little database+api+cli todo app and
       | I want identifiers that can be abbreviated in the same way as
       | partial git commit hashes can be used on the command line. What
       | should I use?
       | 
       | I was thinking of generating random character strings and simply
       | retry when the db throws duplicate key error on insert. No
       | sharding is necessary and I'd like to have efficient foreign
       | keys. Any thoughts?
        
         | alexis2b wrote:
         | You could try NanoID[0]? Seems available in many languages.
         | 
         | [0] https://blog.bibekkakati.me/nanoid-alternative-to-uuid
        
         | mutatio wrote:
         | You could use a serial int and just hex-encode when interacting
         | with the CLI? You could then use range queries to match short
         | hashes by zeroing out the remaining bytes and using >=
        
         | adav wrote:
         | Check out linear congruential generators or other pseudorandom
         | number generators. Then map the resulting number to letters.
        
       | conradfr wrote:
       | UUIDs are great when you use the id "publicly" but using an
       | incremental value would be too revealing for different reasons.
       | 
       | So it's good to know that performances are not bad.
        
         | hu3 wrote:
         | On the other hand, if you can get away with incremental ids it
         | makes debugging much easier during development.
        
           | kaliszad wrote:
           | Not really. You should develop better tooling to visualize
           | debugging information. Today's serious systems (this in my
           | opinion includes e.g. collaborative rich text editors) are
           | just too complicated to just eyeball. Pavel, a colleague of
           | mine is developing a new collaborative rich text editor for
           | OrgPad and here is, how we do some testing currently
           | https://www.youtube.com/watch?v=VeVcNmNFzmc We use UUIDs for
           | basically everything. It is simple, we have good tools for
           | working with UUIDs. When we present any IDs to users, those
           | are URL-safe BASE64 encoded UUIDs. In some places, we have
           | "short" links that are about half the length - they cannot
           | really be guessed but are much shorter to type/ Copy&Paste/
           | visually compare for our customers, who are not always super
           | computer literate. :-)
        
             | throwawayboise wrote:
             | You're not wrong, but (as I suspect is the case with a lot
             | of us) the vast majority of my work is CRUD and I don't
             | reach for heavyweight debugging tools unless printf() or
             | the equivalent fails me (which is rare). Integer IDs work
             | great in this situation.
        
         | staticassertion wrote:
         | I think ideally your primary key is whatever makes sense for
         | your performance/data model, and then if you want to delegate
         | authority with UUIDs you do that via a separate mapping.
         | 
         | By separating that out you can get a lot:
         | 
         | 1. You can extend your delegate system by modifying the
         | delegate table, rather than having to muddy your data model
         | with authority information
         | 
         | 2. You can TTL the mappings in the delegate table
         | 
         | 3. You can have many mappings to the same data, but each one
         | can have its own restrictions.
         | 
         | It's a bit more complex but you end up with a system that
         | really hones in on the benefit that you're bringing up.
        
         | tehlike wrote:
         | There's another benefit to UUID - You can generate them
         | anywhere including application side.
         | 
         | Doing this on application side would have tremendous batching
         | benefits or inserting objects with relationships at the same
         | time (Vs waiting first insert to return an ID to be used in the
         | FK).
        
         | dbbk wrote:
         | I've stuck with incremental values internally but use Hashid to
         | convert them when exposed publicly. Seems to work well.
        
         | throwawayboise wrote:
         | I don't think a lot of the argument that integer IDs reveal too
         | much.
         | 
         | Yes, they are guessable but your application should not rely
         | solely on the "secrecy" of the ID to authorize access to a
         | record. If you are worried about someone crawling your public
         | API with wget or curl and an incrementing counter you should
         | re-think whether your data are really public or not, or maybe
         | rate-limit anonymous users, etc.
         | 
         | They also reveal something about the total number of records in
         | your database, I guess that could matter in some contexts but
         | it's never really been an issue in practice for me.
         | 
         | I have definitely used the technique of defining non-
         | overlapping sequences in different shards (with Oracle, not
         | Postgres, but they are very similar in this regard). It worked
         | very well and was easy to reason about.
         | 
         | As a developer, the big issue I have with UUIDs is that they
         | are impossible to read or type. You have to copy/paste and it
         | isn't easy to look at two UUIDs and tell if they are different.
         | 
         | I use integers in general unless the specific use case demands
         | something else.
        
           | friend-monoid wrote:
           | I had this issue in a case I think is interesting; a customer
           | had a database with incremental IDs of a certain product they
           | sold. On a web platform, the product owner in turn could log
           | in and view a list of their products and their status. The id
           | of the product was part of the URL; /product/851. Of course,
           | the product owners could not get any information on IDs they
           | didn't own, but the numbers gave away info on how many
           | devices existed before them. And they wanted to hide that
           | information.
           | 
           | Of course, there are many ways to solve that situation, but
           | UUIDs is one.
        
           | tgv wrote:
           | So we run surveys among general and specialized audiences
           | (among other things), and these surveys link to custom
           | scripting, images, videos, etc. The URLs have to be freely
           | accessible, but if they are sequential, anyone can simply try
           | to guess what's in other surveys, potentially getting
           | information about their competitors.
        
           | sombremesa wrote:
           | > Yes, they are guessable but your application should not
           | rely solely on the "secrecy" of the ID to authorize access to
           | a record
           | 
           |  _Any_ information you give to a potentially malicious actor
           | can help them attack you. If you have a choice between
           | leaking information and not leaking information, I can't
           | imagine why you would ever intentionally go with the former,
           | unless you didn't actually have a choice (feasibility, etc.).
           | 
           | As an example, maybe I needed to CSRF someone but needed
           | their ID (say, in a b2b app where IDs are not generally
           | public) - with sequential IDs I have a decent chance of
           | cracking it with a small number of requests, especially if
           | it's a small userbase. Sure, the CSRF was the main issue, but
           | this is a contrived example to illustrate the point.
           | 
           | Admittedly, IDs are oftentimes public information by
           | necessity - but there's no need to allow them to leak extra
           | information.
        
       | cratermoon wrote:
       | Postgres (and other relational DBs) really need to implement
       | something like snowflake[1] or ksuid[2]
       | 
       | 1 https://blog.twitter.com/engineering/en_us/a/2010/announcing...
       | 
       | 2 https://segment.com/blog/a-brief-history-of-the-uuid/
        
       | staticassertion wrote:
       | Another benefit of using sequential integers is that you can
       | leverage a number of optimizations.
       | 
       | For one thing you can represent a range of data more efficiently
       | by just storing offsets. This means that instead of having to
       | store a 'start' and 'end' at 8 + 8 bytes you can store something
       | like 'start' and 'offset', where offset could be based on your
       | window size, like 2 bytes.
       | 
       | You can leverage those offsets in metadata too. For example, I
       | could cache something like 'rows (N..N+Offset) all have field X
       | set to null' or some such thing. Now I can query my cache for a
       | given value and avoid the db lookup, but I can also store way
       | more data in the cache since I can encode ranges. Obviously which
       | things you cache are going to be data dependent.
       | 
       | Sequential ints make great external indexes for this reason.
       | Maybe I tombstone rows in big chunks to some other data store -
       | again, I can just encode that as a range, and then given a lookup
       | within that range I know to look in the other datastore. With a
       | uuid approach I'd have to tombstone each row individually.
       | 
       | These aren't universal optimizations but if you _can_ leverage
       | them they can be significant.
        
         | lgas wrote:
         | Doesn't the offset approach run into trouble when sequence
         | values get skipped due to rollbacks?
        
           | staticassertion wrote:
           | It's going to be an optimization that assumes some
           | constraints on how you interact with your database.
        
       | ainar-g wrote:
       | I don't think I've ever seen this mentioned anywhere, but if you
       | need a unique ID for an entity with not a lot of records planned
       | (<=10,000,000), why not use a random int64 with a simple for loop
       | on the application side to catch the occasional collisions? Are
       | there any downsides besides making the application side a tiny
       | bit more complex?
        
         | stouset wrote:
         | If you have this situation, it's not even worth the time or
         | effort to not just use UUIDs. You're adding complication for no
         | gain whatsoever.
        
         | staticassertion wrote:
         | Is the goal here to save space?
        
           | ainar-g wrote:
           | The goal is to have a non-sequential ID (for example, to hide
           | the information about the actual size of the data set) in a
           | situation where one needs to support a storage that doesn't
           | have a native support for UUIDs (that isn't just "convert to
           | TEXT") and reuse as much of the schema as possible without
           | [ab]using ORMs. For example, an application that has to
           | support SQLite as well as PostgreSQL and maybe some other
           | storages.
        
         | dragonwriter wrote:
         | > if you need a unique ID for an entity with not a lot of
         | records planned (<=10,000,000), why not use a random int64 with
         | a simple for loop on the application side to catch the
         | occasional collisions?
         | 
         | What's the use case for this where UUIDv4 or sequential ID
         | isn't better? Because it sounds like a solution in search of a
         | problem.
         | 
         | > Are there any downsides besides making the application side a
         | tiny bit more complex?
         | 
         | Are there any upsides to warrant the complexity?
        
           | nirvdrum wrote:
           | I wouldn't say it's a particularly great upside, but I've
           | found tooling doesn't tend to work as well with UUID keys as
           | it does integer keys. E.g., Hasura won't map UUID keys to the
           | GraphQL ID type [1], which makes working with them
           | unnecessarily difficult. Arguably, the issue should be fixed
           | with the tool (or a different tool chosen), but there's only
           | so many battles to pick and sometimes that decision is out of
           | your control.
           | 
           | [1] -- https://github.com/hasura/graphql-engine/issues/3578
        
           | kortex wrote:
           | UUIDv4 has performance implications. But I agree, if you are
           | already coupled to the DB (due to the check loop),
           | _generally_ you might as well use sequential.
           | 
           | Seems too easy to screw up.
        
         | jeff-davis wrote:
         | You can also do the loop on the server side using PL/pgSQL:
         | 
         | https://www.postgresql.org/docs/current/plpgsql-control-stru...
        
           | megous wrote:
           | That's not that trivial. You can't just loop to get a unique
           | ID. Maybe if you lock the whole table for reads first, which
           | is quite drastic.
        
         | NicoJuicy wrote:
         | The point of guids/uuids is no collision... Ever... Including
         | for large datasets. Even when you merge multiple together in 10
         | years.
         | 
         | For your use-case you could use incremental IDs.
        
         | kstrauser wrote:
         | That's the UUID approach, but worse. According to the birthday
         | problem[1], you're 50% likely to get a collision in 65 bit
         | numbers after about 5 billion insertions. That's not an awful
         | lot. Replace that with a 128-bit UUID and you'd have to insert
         | 22,000,000,000,000,000,000 rows to get a 50% chance. That's
         | probably less likely than a cosmic ray flipping a random bit in
         | RAM and corrupting the index that way.
         | 
         | [1]
         | https://en.wikipedia.org/wiki/Birthday_problem#Probability_t...
        
           | jeff-davis wrote:
           | The post qualified as <= 10,000,000 total records. For that
           | number of records, there's about a chance of about 0.00001
           | that you get a collision, assuming good randomness.
        
             | kstrauser wrote:
             | Sure, but stuff always grows, and the experiment gets run a
             | bunch of times. Why not go with the built-in solution and
             | then not have to worry about it?
        
               | jeff-davis wrote:
               | I'm just answering the poster's question directly; but in
               | the general case, I agree with you. The cognitive
               | overhead of dealing with the various "what ifs" usually
               | aren't worth the couple bytes or cycles that you could
               | save.
        
           | deckard1 wrote:
           | for what it's worth, YouTube still uses 11 character base64
           | strings for their video ids, which are assumed to be 64-bit
           | ints. They also allow unlisted videos, which people usually
           | take to mean "semi-private".
           | 
           | It's an interesting tradeoff. The UX of the smaller YouTube
           | video id links is probably of some benefit to them. Plus they
           | have private videos for when you really don't want your video
           | to be viewed, with unlisted being the middle ground of easy
           | sharing but also keeping it exclusive.
        
             | kstrauser wrote:
             | Sure, and it makes sense there: write a service that
             | returns unique 64 bit ints and encapsulate the complexity
             | inside that one location. That's easier than making every
             | `insert` in your app code have to do a `while not unique`
             | loop.
        
           | oconnore wrote:
           | Getting a collision with this approach doesn't matter -- the
           | whole point is to loop if you do get a collision. The only
           | issue is getting a long string of sequential collisions,
           | which is highly unlikely.
        
             | kstrauser wrote:
             | But now you've tried code complexity for a few bytes if
             | storage. That's just not worth it.
        
               | sa46 wrote:
               | 8 extra bytes per row and per foreign key reference
               | relative to an int64 can add up quickly especially if the
               | row is small. I agree it's not typically the right trade
               | off but it's not as absolute as you claim.
        
               | throwawayboise wrote:
               | This is why we need 2TB drives now, when we used to get
               | by with 2GB.
        
         | leephillips wrote:
         | I have a web service where I need to generate access tokens of
         | five characters. I just generate a random one and check to see
         | if it's already in use. Been working fine for years.
        
       | [deleted]
        
       | simonw wrote:
       | Something I really like about integer incrementing IDs is that
       | you can run ad-how "select * from table order by id desc limit
       | 10" queries to see the most recently inserted elements. I end up
       | doing this a lot when I'm trying to figure out how my
       | applications are currently being used.
       | 
       | Strictly incrementing UUIDs can offer the same benefit.
        
       | topspin wrote:
       | I just started a little side project and chose to use UUID for
       | Postgresql keys. The schema is highly generic and I anticipate
       | the possibility of merging instances. UUID precludes collisions
       | in such a case.
        
         | RedShift1 wrote:
         | That includes foreign keys?
        
       | magicpointer wrote:
       | About UUID as Primary Key and performance, the following article
       | has some insights and benchmarks as well:
       | https://www.2ndquadrant.com/en/blog/sequential-uuid-generato...
       | 
       | Essentially, they observed sizeable performance improvements by
       | using UUID generators that are tweaked to get more sequentia
       | resultsl. It results in better indexes. The articles compares
       | sequences, random UUIDs and 2 kinds of sequentialish UUID
       | generators.
        
         | Matrixik wrote:
         | There is also CUID:
         | 
         | https://github.com/ericelliott/cuid
        
         | tehlike wrote:
         | Mentioned this in a sibling comment:
         | 
         | There's another benefit to UUID - You can generate them
         | anywhere including application side. Doing this on application
         | side would have tremendous batching benefits or inserting
         | objects with relationships at the same time (Vs waiting first
         | insert to return an ID to be used in the FK).
        
           | zepolen wrote:
           | Another benefit to UUIDs is merging tables/databases without
           | key conflict.
        
           | jeltz wrote:
           | You can just use PostgreSQL's writeable CTEs to get the same
           | batching benefits plus the benefits from using serials. So,
           | no, I do not think batching is a good reason for using UUIDs.
        
           | Benjammer wrote:
           | I think you can probably get the same "batching benefits" if
           | you use a global ID generation service of some sort, with
           | sequential IDs to improve indexing. Using sequential IDs
           | doesn't necessitate using _auto-generated_ sequential IDs.
        
             | tehlike wrote:
             | correct, but global service need local caching. Some "HiLo"
             | type identity generation could work. Sequential-UUID
             | simplifies it as you don't need that service, so it's
             | preferable.
             | 
             | HiLo explanation below:
             | 
             | A client simply gets a range of ids to be used, exclusive
             | to them. Then can use it without any roundtrip to server.
             | 
             | This can be achieved with a "hi" stored on service, and a
             | batchSize that's constant in the system.
             | 
             | Each time a "hi" is requested, it's incremented by 1. Now
             | the client can generate (hi * batchsize, (hi+1) * batchsize
             | - 1).
        
           | megous wrote:
           | There's nothing that prevents you from fetching a batch of N
           | IDs from the server. Server just does +N on the sequence and
           | you can do whatever you like with the IDs client side.
           | 
           | You can also use one sequence for everything on the server,
           | and then you can also pre-create ID based relationships
           | client side.
        
           | inopinatus wrote:
           | Caveat programmer: this could be problematic, not in the
           | sense it doesn't work, but in the sense that someone working
           | on backend code may have a preconceived expectation that
           | UUIDs are also effectively a keyspace i.e. they're hard to
           | guess. The validity of that is already challenged by variants
           | defining temporal or logical order, and evaporates completely
           | if you let clients declare their own that you accept at face
           | value. Applications may have potentially guessable/gameable
           | object identifiers sloshing around inside as a consequence,
           | which is modestly ironic given that one benefit many folks
           | expect from adopting UUIDs in the first place is hardening up
           | the attack surface of trivially enumerable sequences.
           | 
           | There are a few mitigations but my favourite is the "casino
           | chips" approach: pregenerate them server side, and allocate
           | to clients on demand, including en masse if need be ("here
           | kid, have a few million UUIDs to get you started"). Verify
           | with whatever simple signature scheme comes with your
           | application server framework, or at small scale just toss
           | them in a crude LRU store.
           | 
           | Or, remember where the UUID came from, and apply their
           | organisational scope to any trust you place upon it. This
           | might work particularly for multi-tenanted SaaS. However it
           | requires that all usage is tenant-bounded end-through-end
           | throughout your application. This may be in conflict with a)
           | your framework, b) your zenlike contemplation of simplicity
           | in data management, or c) programmers in a hurry forgetting
           | to scope their queries properly.
           | 
           | Ultimately, relying on UUIDs as intrinsically unguessable
           | security tokens is probably not a great idea, but it's one
           | that remains thoroughly embedded in the programming
           | zeitgeist. As ever, nothing useful comes without a
           | compromise. Keep your eyes open to the systemic consequences
           | of design choices, and don't leave traps for your fellow
           | developers.
        
             | NicoJuicy wrote:
             | He's not saying clients can create their own ids. The
             | applications can.
             | 
             | The concepts he's talking about are required for cqrs.
             | Which is a popular concept applied with mostly DDD or
             | microservices.
        
               | inopinatus wrote:
               | I getcha, but these days the ambit reach of "application"
               | extends to Javascript executing client-side in an
               | environment that's basically overrun with
               | lions/tigers/bears, and I'll suggest that's particularly
               | a consideration when the front-end is a SPA participating
               | in a CQRS/event-sourced overall application architecture.
        
               | lazide wrote:
               | There definitely are people out there in this thread
               | proposing clients be able provide UUIDs. I've seen it
               | elsewhere too.
               | 
               | I've also personally experienced UUID collisions due to
               | badly set up VM environments under Windows. It isn't a
               | good idea to blindly trust any value - and that includes
               | supposedly 'never collide' id's like UUID.
               | 
               | For what it's worth, I also had the joy of debugging
               | someone's distributed hash table that was using md5 as
               | the hash bucket key (this was... 2 decades ago?) and had
               | no way to handle collisions because obviously that is
               | impossible.
        
             | kortex wrote:
             | Should you ever use a plain token (where you just check if
             | it exists in some authed_users table) vs, I dunno, some
             | sort of signed/HMAC type thing, where you have to call some
             | function on it? I genuinely don't know but I know enough to
             | generally leave authentication up to those that do know.
             | 
             | Maybe I'm just thinking of OAuth where there are multiple
             | hops involved?
        
               | inopinatus wrote:
               | The comparison to OAuth is quite reasonable. Perhaps the
               | most obvious parallel is the use of a state parameter
               | during the three-legged exchange, without which it's
               | exposed to a CSRF clickjacking attack.
        
         | Scarbutt wrote:
         | Careful with leaking sensitive information with semi-sequential
         | UUIDs though.
        
           | magicpointer wrote:
           | To alleviate the issue of having a sequential part, they make
           | it wrap around so that you cannot tell the order between two
           | UUIDs. It's already some protection, and the random part is
           | still large.
        
         | hermanradtke wrote:
         | I use a ulid[1] as a uuidv4 replacement:
         | 
         | https://github.com/ulid/spec
        
           | edoceo wrote:
           | I was hoping someone would mention these. Its really quite
           | nice. 128bit, sortable, client or server generated, no
           | collision (well almost zero: 80bits are random per
           | millisecond)
        
           | j-pb wrote:
           | They almost got it right, a better implementation would
           | overflow regularly to make use of the entire key space, and
           | counter untuitively more resistant to overflows.
           | 
           | Clocks aren't reliable enough for timestamps anyways so
           | garbage collection is the only thing you kinda wanna rely on
           | them for.
           | 
           | A good sweet spot seems to be, 32bit milliseconds + 96bit of
           | entropy. This overflows appeoximately every 50 days, allowing
           | for 50 day rolling data retention.
        
           | fatsdomino001 wrote:
           | I was debating between using ULID vs just using the same
           | sequential bigint sharding strategy that Instagram
           | uses[1][2].
           | 
           | I ended up deciding to use sharded bigints because it enables
           | better indexing, even though there are drawbacks when first
           | writing the instances; The benefit of faster search was more
           | important for me.
           | 
           | [1]: https://instagram-engineering.com/sharding-ids-at-
           | instagram-...
           | 
           | [2]: http://www.livinginthepast.org/2016/02/24/sharding-into-
           | bigi...
        
           | aidos wrote:
           | They seem like what I'm looking for. I was looking at moving
           | to something more random because our current UUIDs (uuid1)
           | are too easy to mistake for one another at a glance.
        
       | rossmohax wrote:
       | Another alternative is ULID, which can be stored as UUID on a
       | Postgres side, but is more b-tree friendly.
        
       | zzzeek wrote:
       | > You are well advised to choose a primary key that is not only
       | unique, but also never changes during the lifetime of a table
       | row. This is because foreign key constraints typically reference
       | primary keys, and changing a primary key that is referenced
       | elsewhere causes trouble or unnecessary work.
       | 
       | in one sense I agree with the author that things are generally
       | just easier when you use surrogate primary keys, however they
       | really should note here that the FOREIGN KEY constraint itself is
       | not a problem at all as you can just use ON UPDATE CASCADE.
        
       ___________________________________________________________________
       (page generated 2021-05-31 23:00 UTC)