[HN Gopher] PostgreSQL UUID vs. Serial vs. Identity
___________________________________________________________________
PostgreSQL UUID vs. Serial vs. Identity
Author : lhenk
Score : 133 points
Date : 2021-05-31 16:38 UTC (6 hours ago)
(HTM) web link (www.cybertec-postgresql.com)
(TXT) w3m dump (www.cybertec-postgresql.com)
| pmontra wrote:
| Meta: this company wrote an impressive number of articles about
| PostgreSQL since 2013. List at https://www.cybertec-
| postgresql.com/en/tag/postgresql/
| aidos wrote:
| I just had to do a double take as I was reading a stack
| overflow post at the same time and recognised it as the same
| author.
| lhenk wrote:
| Laurenz (the author) was Postgres person of the week not too
| long ago: https://postgresql.life/post/laurenz_albe/
| lhenk wrote:
| Also, here's a list of blog posts from Laurenz Albe (the author
| of the OP post): https://www.cybertec-
| postgresql.com/en/author/cybertec_albe/ His blog posts are a
| great read, I'd recommend checking them out!
| barrkel wrote:
| Another point: if there's any temporal locality to your future
| access patterns - if you're more likely to access multiple rows
| which were inserted at roughly the same time - then allocating
| sequential identifiers brings those entries closer together in
| the primary key index.
|
| I used to work on a reconciliation system which inserted all its
| results into the database. Only the most recent results were
| heavily queried, with a long tail of occasional lookups into
| older results. We never had a problem with primary key indexes
| (though this was in MySQL, which uses a clustered index on the
| primary key for row storage, so it's an even bigger benefit); the
| MD5 column used for identifying repeating data, on the other
| hand, would blow out the cache on large customers' instances.
| 3pt14159 wrote:
| > Now, sometimes a table has a natural primary key, for example
| the social security number of a country's citizens.
|
| You know, you think that, but it's never that simple. The field
| was added incorrectly and nobody noticed until the value is in
| countless tables that you now need to simultaneously update or
| the value is something that's supposed to be semi-secret, so now
| a low level support staff can't reference the row when dealing
| with a request. Or the table's requirements change and now you
| need to track two different kinds of data or data that is missing
| the field.
|
| Me, I always just have the table make its own ID. It is just
| simpler, even when you _think_ it is overkill.
| vbsteven wrote:
| I'm currently prototyping a little database+api+cli todo app and
| I want identifiers that can be abbreviated in the same way as
| partial git commit hashes can be used on the command line. What
| should I use?
|
| I was thinking of generating random character strings and simply
| retry when the db throws duplicate key error on insert. No
| sharding is necessary and I'd like to have efficient foreign
| keys. Any thoughts?
| alexis2b wrote:
| You could try NanoID[0]? Seems available in many languages.
|
| [0] https://blog.bibekkakati.me/nanoid-alternative-to-uuid
| mutatio wrote:
| You could use a serial int and just hex-encode when interacting
| with the CLI? You could then use range queries to match short
| hashes by zeroing out the remaining bytes and using >=
| adav wrote:
| Check out linear congruential generators or other pseudorandom
| number generators. Then map the resulting number to letters.
| conradfr wrote:
| UUIDs are great when you use the id "publicly" but using an
| incremental value would be too revealing for different reasons.
|
| So it's good to know that performances are not bad.
| hu3 wrote:
| On the other hand, if you can get away with incremental ids it
| makes debugging much easier during development.
| kaliszad wrote:
| Not really. You should develop better tooling to visualize
| debugging information. Today's serious systems (this in my
| opinion includes e.g. collaborative rich text editors) are
| just too complicated to just eyeball. Pavel, a colleague of
| mine is developing a new collaborative rich text editor for
| OrgPad and here is, how we do some testing currently
| https://www.youtube.com/watch?v=VeVcNmNFzmc We use UUIDs for
| basically everything. It is simple, we have good tools for
| working with UUIDs. When we present any IDs to users, those
| are URL-safe BASE64 encoded UUIDs. In some places, we have
| "short" links that are about half the length - they cannot
| really be guessed but are much shorter to type/ Copy&Paste/
| visually compare for our customers, who are not always super
| computer literate. :-)
| throwawayboise wrote:
| You're not wrong, but (as I suspect is the case with a lot
| of us) the vast majority of my work is CRUD and I don't
| reach for heavyweight debugging tools unless printf() or
| the equivalent fails me (which is rare). Integer IDs work
| great in this situation.
| staticassertion wrote:
| I think ideally your primary key is whatever makes sense for
| your performance/data model, and then if you want to delegate
| authority with UUIDs you do that via a separate mapping.
|
| By separating that out you can get a lot:
|
| 1. You can extend your delegate system by modifying the
| delegate table, rather than having to muddy your data model
| with authority information
|
| 2. You can TTL the mappings in the delegate table
|
| 3. You can have many mappings to the same data, but each one
| can have its own restrictions.
|
| It's a bit more complex but you end up with a system that
| really hones in on the benefit that you're bringing up.
| tehlike wrote:
| There's another benefit to UUID - You can generate them
| anywhere including application side.
|
| Doing this on application side would have tremendous batching
| benefits or inserting objects with relationships at the same
| time (Vs waiting first insert to return an ID to be used in the
| FK).
| dbbk wrote:
| I've stuck with incremental values internally but use Hashid to
| convert them when exposed publicly. Seems to work well.
| throwawayboise wrote:
| I don't think a lot of the argument that integer IDs reveal too
| much.
|
| Yes, they are guessable but your application should not rely
| solely on the "secrecy" of the ID to authorize access to a
| record. If you are worried about someone crawling your public
| API with wget or curl and an incrementing counter you should
| re-think whether your data are really public or not, or maybe
| rate-limit anonymous users, etc.
|
| They also reveal something about the total number of records in
| your database, I guess that could matter in some contexts but
| it's never really been an issue in practice for me.
|
| I have definitely used the technique of defining non-
| overlapping sequences in different shards (with Oracle, not
| Postgres, but they are very similar in this regard). It worked
| very well and was easy to reason about.
|
| As a developer, the big issue I have with UUIDs is that they
| are impossible to read or type. You have to copy/paste and it
| isn't easy to look at two UUIDs and tell if they are different.
|
| I use integers in general unless the specific use case demands
| something else.
| friend-monoid wrote:
| I had this issue in a case I think is interesting; a customer
| had a database with incremental IDs of a certain product they
| sold. On a web platform, the product owner in turn could log
| in and view a list of their products and their status. The id
| of the product was part of the URL; /product/851. Of course,
| the product owners could not get any information on IDs they
| didn't own, but the numbers gave away info on how many
| devices existed before them. And they wanted to hide that
| information.
|
| Of course, there are many ways to solve that situation, but
| UUIDs is one.
| tgv wrote:
| So we run surveys among general and specialized audiences
| (among other things), and these surveys link to custom
| scripting, images, videos, etc. The URLs have to be freely
| accessible, but if they are sequential, anyone can simply try
| to guess what's in other surveys, potentially getting
| information about their competitors.
| sombremesa wrote:
| > Yes, they are guessable but your application should not
| rely solely on the "secrecy" of the ID to authorize access to
| a record
|
| _Any_ information you give to a potentially malicious actor
| can help them attack you. If you have a choice between
| leaking information and not leaking information, I can't
| imagine why you would ever intentionally go with the former,
| unless you didn't actually have a choice (feasibility, etc.).
|
| As an example, maybe I needed to CSRF someone but needed
| their ID (say, in a b2b app where IDs are not generally
| public) - with sequential IDs I have a decent chance of
| cracking it with a small number of requests, especially if
| it's a small userbase. Sure, the CSRF was the main issue, but
| this is a contrived example to illustrate the point.
|
| Admittedly, IDs are oftentimes public information by
| necessity - but there's no need to allow them to leak extra
| information.
| cratermoon wrote:
| Postgres (and other relational DBs) really need to implement
| something like snowflake[1] or ksuid[2]
|
| 1 https://blog.twitter.com/engineering/en_us/a/2010/announcing...
|
| 2 https://segment.com/blog/a-brief-history-of-the-uuid/
| staticassertion wrote:
| Another benefit of using sequential integers is that you can
| leverage a number of optimizations.
|
| For one thing you can represent a range of data more efficiently
| by just storing offsets. This means that instead of having to
| store a 'start' and 'end' at 8 + 8 bytes you can store something
| like 'start' and 'offset', where offset could be based on your
| window size, like 2 bytes.
|
| You can leverage those offsets in metadata too. For example, I
| could cache something like 'rows (N..N+Offset) all have field X
| set to null' or some such thing. Now I can query my cache for a
| given value and avoid the db lookup, but I can also store way
| more data in the cache since I can encode ranges. Obviously which
| things you cache are going to be data dependent.
|
| Sequential ints make great external indexes for this reason.
| Maybe I tombstone rows in big chunks to some other data store -
| again, I can just encode that as a range, and then given a lookup
| within that range I know to look in the other datastore. With a
| uuid approach I'd have to tombstone each row individually.
|
| These aren't universal optimizations but if you _can_ leverage
| them they can be significant.
| lgas wrote:
| Doesn't the offset approach run into trouble when sequence
| values get skipped due to rollbacks?
| staticassertion wrote:
| It's going to be an optimization that assumes some
| constraints on how you interact with your database.
| ainar-g wrote:
| I don't think I've ever seen this mentioned anywhere, but if you
| need a unique ID for an entity with not a lot of records planned
| (<=10,000,000), why not use a random int64 with a simple for loop
| on the application side to catch the occasional collisions? Are
| there any downsides besides making the application side a tiny
| bit more complex?
| stouset wrote:
| If you have this situation, it's not even worth the time or
| effort to not just use UUIDs. You're adding complication for no
| gain whatsoever.
| staticassertion wrote:
| Is the goal here to save space?
| ainar-g wrote:
| The goal is to have a non-sequential ID (for example, to hide
| the information about the actual size of the data set) in a
| situation where one needs to support a storage that doesn't
| have a native support for UUIDs (that isn't just "convert to
| TEXT") and reuse as much of the schema as possible without
| [ab]using ORMs. For example, an application that has to
| support SQLite as well as PostgreSQL and maybe some other
| storages.
| dragonwriter wrote:
| > if you need a unique ID for an entity with not a lot of
| records planned (<=10,000,000), why not use a random int64 with
| a simple for loop on the application side to catch the
| occasional collisions?
|
| What's the use case for this where UUIDv4 or sequential ID
| isn't better? Because it sounds like a solution in search of a
| problem.
|
| > Are there any downsides besides making the application side a
| tiny bit more complex?
|
| Are there any upsides to warrant the complexity?
| nirvdrum wrote:
| I wouldn't say it's a particularly great upside, but I've
| found tooling doesn't tend to work as well with UUID keys as
| it does integer keys. E.g., Hasura won't map UUID keys to the
| GraphQL ID type [1], which makes working with them
| unnecessarily difficult. Arguably, the issue should be fixed
| with the tool (or a different tool chosen), but there's only
| so many battles to pick and sometimes that decision is out of
| your control.
|
| [1] -- https://github.com/hasura/graphql-engine/issues/3578
| kortex wrote:
| UUIDv4 has performance implications. But I agree, if you are
| already coupled to the DB (due to the check loop),
| _generally_ you might as well use sequential.
|
| Seems too easy to screw up.
| jeff-davis wrote:
| You can also do the loop on the server side using PL/pgSQL:
|
| https://www.postgresql.org/docs/current/plpgsql-control-stru...
| megous wrote:
| That's not that trivial. You can't just loop to get a unique
| ID. Maybe if you lock the whole table for reads first, which
| is quite drastic.
| NicoJuicy wrote:
| The point of guids/uuids is no collision... Ever... Including
| for large datasets. Even when you merge multiple together in 10
| years.
|
| For your use-case you could use incremental IDs.
| kstrauser wrote:
| That's the UUID approach, but worse. According to the birthday
| problem[1], you're 50% likely to get a collision in 65 bit
| numbers after about 5 billion insertions. That's not an awful
| lot. Replace that with a 128-bit UUID and you'd have to insert
| 22,000,000,000,000,000,000 rows to get a 50% chance. That's
| probably less likely than a cosmic ray flipping a random bit in
| RAM and corrupting the index that way.
|
| [1]
| https://en.wikipedia.org/wiki/Birthday_problem#Probability_t...
| jeff-davis wrote:
| The post qualified as <= 10,000,000 total records. For that
| number of records, there's about a chance of about 0.00001
| that you get a collision, assuming good randomness.
| kstrauser wrote:
| Sure, but stuff always grows, and the experiment gets run a
| bunch of times. Why not go with the built-in solution and
| then not have to worry about it?
| jeff-davis wrote:
| I'm just answering the poster's question directly; but in
| the general case, I agree with you. The cognitive
| overhead of dealing with the various "what ifs" usually
| aren't worth the couple bytes or cycles that you could
| save.
| deckard1 wrote:
| for what it's worth, YouTube still uses 11 character base64
| strings for their video ids, which are assumed to be 64-bit
| ints. They also allow unlisted videos, which people usually
| take to mean "semi-private".
|
| It's an interesting tradeoff. The UX of the smaller YouTube
| video id links is probably of some benefit to them. Plus they
| have private videos for when you really don't want your video
| to be viewed, with unlisted being the middle ground of easy
| sharing but also keeping it exclusive.
| kstrauser wrote:
| Sure, and it makes sense there: write a service that
| returns unique 64 bit ints and encapsulate the complexity
| inside that one location. That's easier than making every
| `insert` in your app code have to do a `while not unique`
| loop.
| oconnore wrote:
| Getting a collision with this approach doesn't matter -- the
| whole point is to loop if you do get a collision. The only
| issue is getting a long string of sequential collisions,
| which is highly unlikely.
| kstrauser wrote:
| But now you've tried code complexity for a few bytes if
| storage. That's just not worth it.
| sa46 wrote:
| 8 extra bytes per row and per foreign key reference
| relative to an int64 can add up quickly especially if the
| row is small. I agree it's not typically the right trade
| off but it's not as absolute as you claim.
| throwawayboise wrote:
| This is why we need 2TB drives now, when we used to get
| by with 2GB.
| leephillips wrote:
| I have a web service where I need to generate access tokens of
| five characters. I just generate a random one and check to see
| if it's already in use. Been working fine for years.
| [deleted]
| simonw wrote:
| Something I really like about integer incrementing IDs is that
| you can run ad-how "select * from table order by id desc limit
| 10" queries to see the most recently inserted elements. I end up
| doing this a lot when I'm trying to figure out how my
| applications are currently being used.
|
| Strictly incrementing UUIDs can offer the same benefit.
| topspin wrote:
| I just started a little side project and chose to use UUID for
| Postgresql keys. The schema is highly generic and I anticipate
| the possibility of merging instances. UUID precludes collisions
| in such a case.
| RedShift1 wrote:
| That includes foreign keys?
| magicpointer wrote:
| About UUID as Primary Key and performance, the following article
| has some insights and benchmarks as well:
| https://www.2ndquadrant.com/en/blog/sequential-uuid-generato...
|
| Essentially, they observed sizeable performance improvements by
| using UUID generators that are tweaked to get more sequentia
| resultsl. It results in better indexes. The articles compares
| sequences, random UUIDs and 2 kinds of sequentialish UUID
| generators.
| Matrixik wrote:
| There is also CUID:
|
| https://github.com/ericelliott/cuid
| tehlike wrote:
| Mentioned this in a sibling comment:
|
| There's another benefit to UUID - You can generate them
| anywhere including application side. Doing this on application
| side would have tremendous batching benefits or inserting
| objects with relationships at the same time (Vs waiting first
| insert to return an ID to be used in the FK).
| zepolen wrote:
| Another benefit to UUIDs is merging tables/databases without
| key conflict.
| jeltz wrote:
| You can just use PostgreSQL's writeable CTEs to get the same
| batching benefits plus the benefits from using serials. So,
| no, I do not think batching is a good reason for using UUIDs.
| Benjammer wrote:
| I think you can probably get the same "batching benefits" if
| you use a global ID generation service of some sort, with
| sequential IDs to improve indexing. Using sequential IDs
| doesn't necessitate using _auto-generated_ sequential IDs.
| tehlike wrote:
| correct, but global service need local caching. Some "HiLo"
| type identity generation could work. Sequential-UUID
| simplifies it as you don't need that service, so it's
| preferable.
|
| HiLo explanation below:
|
| A client simply gets a range of ids to be used, exclusive
| to them. Then can use it without any roundtrip to server.
|
| This can be achieved with a "hi" stored on service, and a
| batchSize that's constant in the system.
|
| Each time a "hi" is requested, it's incremented by 1. Now
| the client can generate (hi * batchsize, (hi+1) * batchsize
| - 1).
| megous wrote:
| There's nothing that prevents you from fetching a batch of N
| IDs from the server. Server just does +N on the sequence and
| you can do whatever you like with the IDs client side.
|
| You can also use one sequence for everything on the server,
| and then you can also pre-create ID based relationships
| client side.
| inopinatus wrote:
| Caveat programmer: this could be problematic, not in the
| sense it doesn't work, but in the sense that someone working
| on backend code may have a preconceived expectation that
| UUIDs are also effectively a keyspace i.e. they're hard to
| guess. The validity of that is already challenged by variants
| defining temporal or logical order, and evaporates completely
| if you let clients declare their own that you accept at face
| value. Applications may have potentially guessable/gameable
| object identifiers sloshing around inside as a consequence,
| which is modestly ironic given that one benefit many folks
| expect from adopting UUIDs in the first place is hardening up
| the attack surface of trivially enumerable sequences.
|
| There are a few mitigations but my favourite is the "casino
| chips" approach: pregenerate them server side, and allocate
| to clients on demand, including en masse if need be ("here
| kid, have a few million UUIDs to get you started"). Verify
| with whatever simple signature scheme comes with your
| application server framework, or at small scale just toss
| them in a crude LRU store.
|
| Or, remember where the UUID came from, and apply their
| organisational scope to any trust you place upon it. This
| might work particularly for multi-tenanted SaaS. However it
| requires that all usage is tenant-bounded end-through-end
| throughout your application. This may be in conflict with a)
| your framework, b) your zenlike contemplation of simplicity
| in data management, or c) programmers in a hurry forgetting
| to scope their queries properly.
|
| Ultimately, relying on UUIDs as intrinsically unguessable
| security tokens is probably not a great idea, but it's one
| that remains thoroughly embedded in the programming
| zeitgeist. As ever, nothing useful comes without a
| compromise. Keep your eyes open to the systemic consequences
| of design choices, and don't leave traps for your fellow
| developers.
| NicoJuicy wrote:
| He's not saying clients can create their own ids. The
| applications can.
|
| The concepts he's talking about are required for cqrs.
| Which is a popular concept applied with mostly DDD or
| microservices.
| inopinatus wrote:
| I getcha, but these days the ambit reach of "application"
| extends to Javascript executing client-side in an
| environment that's basically overrun with
| lions/tigers/bears, and I'll suggest that's particularly
| a consideration when the front-end is a SPA participating
| in a CQRS/event-sourced overall application architecture.
| lazide wrote:
| There definitely are people out there in this thread
| proposing clients be able provide UUIDs. I've seen it
| elsewhere too.
|
| I've also personally experienced UUID collisions due to
| badly set up VM environments under Windows. It isn't a
| good idea to blindly trust any value - and that includes
| supposedly 'never collide' id's like UUID.
|
| For what it's worth, I also had the joy of debugging
| someone's distributed hash table that was using md5 as
| the hash bucket key (this was... 2 decades ago?) and had
| no way to handle collisions because obviously that is
| impossible.
| kortex wrote:
| Should you ever use a plain token (where you just check if
| it exists in some authed_users table) vs, I dunno, some
| sort of signed/HMAC type thing, where you have to call some
| function on it? I genuinely don't know but I know enough to
| generally leave authentication up to those that do know.
|
| Maybe I'm just thinking of OAuth where there are multiple
| hops involved?
| inopinatus wrote:
| The comparison to OAuth is quite reasonable. Perhaps the
| most obvious parallel is the use of a state parameter
| during the three-legged exchange, without which it's
| exposed to a CSRF clickjacking attack.
| Scarbutt wrote:
| Careful with leaking sensitive information with semi-sequential
| UUIDs though.
| magicpointer wrote:
| To alleviate the issue of having a sequential part, they make
| it wrap around so that you cannot tell the order between two
| UUIDs. It's already some protection, and the random part is
| still large.
| hermanradtke wrote:
| I use a ulid[1] as a uuidv4 replacement:
|
| https://github.com/ulid/spec
| edoceo wrote:
| I was hoping someone would mention these. Its really quite
| nice. 128bit, sortable, client or server generated, no
| collision (well almost zero: 80bits are random per
| millisecond)
| j-pb wrote:
| They almost got it right, a better implementation would
| overflow regularly to make use of the entire key space, and
| counter untuitively more resistant to overflows.
|
| Clocks aren't reliable enough for timestamps anyways so
| garbage collection is the only thing you kinda wanna rely on
| them for.
|
| A good sweet spot seems to be, 32bit milliseconds + 96bit of
| entropy. This overflows appeoximately every 50 days, allowing
| for 50 day rolling data retention.
| fatsdomino001 wrote:
| I was debating between using ULID vs just using the same
| sequential bigint sharding strategy that Instagram
| uses[1][2].
|
| I ended up deciding to use sharded bigints because it enables
| better indexing, even though there are drawbacks when first
| writing the instances; The benefit of faster search was more
| important for me.
|
| [1]: https://instagram-engineering.com/sharding-ids-at-
| instagram-...
|
| [2]: http://www.livinginthepast.org/2016/02/24/sharding-into-
| bigi...
| aidos wrote:
| They seem like what I'm looking for. I was looking at moving
| to something more random because our current UUIDs (uuid1)
| are too easy to mistake for one another at a glance.
| rossmohax wrote:
| Another alternative is ULID, which can be stored as UUID on a
| Postgres side, but is more b-tree friendly.
| zzzeek wrote:
| > You are well advised to choose a primary key that is not only
| unique, but also never changes during the lifetime of a table
| row. This is because foreign key constraints typically reference
| primary keys, and changing a primary key that is referenced
| elsewhere causes trouble or unnecessary work.
|
| in one sense I agree with the author that things are generally
| just easier when you use surrogate primary keys, however they
| really should note here that the FOREIGN KEY constraint itself is
| not a problem at all as you can just use ON UPDATE CASCADE.
___________________________________________________________________
(page generated 2021-05-31 23:00 UTC)