[HN Gopher] Unexpected downsides of UUID keys in PostgreSQL
___________________________________________________________________
Unexpected downsides of UUID keys in PostgreSQL
Author : alexzeitler
Score : 227 points
Date : 2023-06-22 10:10 UTC (12 hours ago)
(HTM) web link (www.cybertec-postgresql.com)
(TXT) w3m dump (www.cybertec-postgresql.com)
| robertlagrant wrote:
| Is this a downside of UUID keys in general, or of using them as
| indexes? Would it be possible to create a primary index on a
| combination of inserted date and uuid and get opaque UUIDs _and_
| good indexing?
| exyi wrote:
| Yes, it's a downside of using them in b-tree indexes. However,
| the purpose of having an ID is usually to use it for lookup
| (from foreign keys, query string arguments, ...). If you create
| an index on two columns, you won't be able to lookup using only
| the second value, at least not in Postgres.
| robertlagrant wrote:
| That's interesting - I don't know enough about this! If it's
| lookup from a foreign key, won't it just go straight to the
| record? What's the index for there?
| bandrami wrote:
| The alternative is having two required-unique columns only one
| of which is definitionally unique. It solves some problems and
| introduces others.
| rjh29 wrote:
| You can use UUIDv7 or ULID for that. One value with both the
| timestamp and random data inside.
| bandrami wrote:
| Sorry, right; I meant that's the collation alternative to a
| multi-field record like v7.
| MPSimmons wrote:
| I don't find myself dealing with databases in any depth very
| often, so this was news to me, and I thought it was well written,
| for what it's worth. Thanks for posting.
| zero_iq wrote:
| Article should be called "Totally Expected Downsides..."
|
| If you want temporal locality, use ULIDs instead.
| rjh29 wrote:
| I was going to use UUID with the time portion at the start
| (also known as UUIDv7) but this looks better.
| gmhafiz wrote:
| One potential downside is ULID does not have an RFC, unlike
| UUID V7
| samokhvalov wrote:
| Hopefully, it will soon have it:
| https://datatracker.ietf.org/doc/html/draft-ietf-uuidrev-
| rfc...
| OJFord wrote:
| It's only mentioned there in 'motivation', a study of
| trends?
| heipei wrote:
| I've been thinking about converting a legacy application
| which uses UUIDv4 to use ULIDs instead going forward, but
| then to represent these ULIDs in a format that is compliant
| with a UUIDv4. I have not thought through the possible
| downsides, but I think it should be a pretty straightforward
| change. Of course old records will remain true random
| UUIDv4s, but at least new records will be time-order and as
| such will create less stress on the B-Tree index when writing
| them.
| ilyt wrote:
| How is any of that "unexpected" ? What did the author expected ?
| daneel_w wrote:
| Nothing unexpected about it at all. PostgreSQL's default index in
| this case is a B-tree, and they don't index disorderly data very
| well which is the nature of all UUID versions.
|
| On the topic of how important it is for B-trees to be "orderly":
| https://news.ycombinator.com/item?id=34404641
|
| PostgreSQL can do (among other types) 32-bit hash indices which
| work out better for certain use cases. Personally I would avoid
| B-trees for any UUID unless I really had to do partial or ranged
| scans on them.
| marginalia_nu wrote:
| > On the topic of how important it is for B-trees to be
| "orderly": https://news.ycombinator.com/item?id=34404641
|
| I don't think this is the same phenomenon at all, if anything
| that looks like an implementation problem rather than something
| that's somehow inherent in B-trees.
|
| Depending on how you construct the tree, it's however still
| possible to end up with something that's very fragmented and
| inefficient, but you can always construct a dense b-tree with a
| space complexity of O(N/(B-1)). To see why you can just lay out
| the data in a list and manually create the index layers. Each
| layer will be bounded by N/B, N/B2, N/B3, ... (and the sum of
| 1/B^n over all n is 1/(B-1) )
|
| Considering an actual tree, for N=10, B=3 1 2 3
| 4 5 6 7 8 9 10 data <3 | <6| <9 | <10
| index 1 <6 |<10 | index 2
| <10 root
|
| Looking at this, I think it should be apparent that if you
| change this ordered sequence 1-10 to ten UUIDs ordered in the
| same way, there would be no change in the structure of the
| tree.
|
| It would be a bit bigger on disk because UUIDs are bigger than
| integers, and you may end up with an additional data layer
| because of that (because you want to align with the disk block
| size).
|
| One reason you might see a difference even when the
| implementation makes the appropriate assumptions is that
| B-trees can do very clever things with somewhat sequential data
| when it comes to joins, where you can get linear nearly
| runtimes for the operation. But as mentioned, this only works
| with relatively ordered data.
| croes wrote:
| Can't you simply add a additional timestamp column for time based
| sorting? No need to raise the collision probability.
| plagiarist wrote:
| If size isn't an issue it seems so. I know of one
| implementation that uses wall clock to get a "close enough"
| sorting https://github.com/segmentio/ksuid
| damethos wrote:
| Why not just use auto-increment integers and expose them with
| something like this: https://www.npmjs.com/package/hashids (The
| linked library is just an example). Yes, you will need to
| transform the IDs before returning them to an external caller
| which is a little bit annoying but it's probably not that a big
| of a deal.
|
| Also another approach I've seen, which personally I find it a bit
| complicated, is to use auto-increment primary keys for the
| internal system and UUIDs for the public facing interactions.
| insanitybit wrote:
| Another (potentially!) significant problem with uuids is that
| they're inefficient to store. Compared to an 8 byte value
| obviously they double the storage size. But it's actually worse -
| if you have loosely ordered integers, even with gaps, you can
| compress those down to even less, like practically 1 byte on
| average. That makes uuids ~16x worse for disk storage, ~2x worse
| for memory storage (cache).
|
| That + Losing locality can be, for some workloads, a significant
| loss. Where UUIDs reign supreme for performance is in terms of
| generation - if you have a super high write load and really high
| latency requirements it may not be viable to have a single
| integer counter.
|
| Although, in my own testing, I've found Postgres is more than
| capable of hundreds of thousands of increments per second and if
| you're willing to allow for gaps/ interleaving in your counter
| (slight hit to locality) you're really unlikely to hit a
| bottleneck.
|
| Sequentual uuids help and are desirable but they still can't
| compete on storage :)
| fabian2k wrote:
| I'm not aware of any situation where Postgres (without
| extensions) would compress ints that way? That is more of a
| column-oriented or time-series DB thing to me.
|
| Are there any cases in Postgres where this actually plays a
| role and sequential ints get compressed?
| insanitybit wrote:
| Sorry, let me clarify. I used Postgres to generate the
| counters as external IDs into other data stores (so instead
| of `give me a new uuid` I said `hey postgres give me a new
| counter`. Postgres was only used to maintain the counters
| themselves in a tiny database where compression isn't
| relevant anyways.
|
| The other data stores where I used this were a combination of
| Parquet on S3 (where you get compression) and ScyllaDB (where
| you also get compression).
| insanitybit wrote:
| Oh yeah, this also makes pagination trivial. You can basically
| just give your clients a literal number indicating where they
| are. So good, you basically get paging APIs for free.
| osigurdson wrote:
| On the one hand, a serial base id like
| https://example.com/user/items/55, seems less secure than a UUID
| based approach as if all else fails the UUID is still hard to
| guess. On the other, some systems (github private repos for
| example), don't seem to worry about how easily a URL can be
| guessed. Is it perhaps the case that probability of successful
| attack is not materially reduced in with a hard to guess url in
| the final analysis?
| moi2388 wrote:
| int key for internal use, guid for external purposes
| seanhunter wrote:
| I'll genuinely never understand why people benchmark in this way.
| They are doing a thing that is slow and inefficient in a
| fundamental way and then producing a blogpost about it being slow
| and inefficient in a much more trivial respect. If you care about
| performance, select count(uuid) from records;
|
| Makes absolutely no sense. You know the uuid is unique so you are
| deliberately selecting a value and then throwing it away just to
| count it. select count(1) from records;
|
| Would give the same value and could be answered from the index
| without requiring a scan of the table.
|
| This is true of any unique column in the table no matter what
| type (it doesn't have to be a uuid). The fact that uuids are a
| bit slower than other types of keys when you scan the entire
| table unnecessarily seems beside the point.
| insanitybit wrote:
| I can't imagine that the author is saying "don't count uuids
| this way!" but instead is trying to explain how locality works
| in btree indices, using a trivial example to demonstrate it.
| seanhunter wrote:
| I should add for people who are not aware, the same concept is
| very important to bear in mind if there is a WHERE EXISTS or
| WHERE NOT EXISTS subclause to your query. Something like
| select u.username as good_user from
| users u where not exists (
| select 1 from naughty_users n where n.id = u.id )
|
| ...is generally going to be much much faster than what a lot of
| people instinctively do which is to select the username or user
| id from the inner query. Since you're just checking for
| (non)existence it doesn't matter what the inner query returns
| so select 1 will answer the query from the index (assuming the
| id is indexed which if it isn't you have bigger problems).
|
| If you spend a while looking at explain plans for your queries
| you will spot common patterns like this where you can avoid a
| table scan often.
| neurostimulant wrote:
| CouchDB uses sequential UUID by default:
| https://docs.couchdb.org/en/stable/config/misc.html#uuids-co...
|
| > Impact of UUID choices: the choice of UUID has a significant
| impact on the layout of the B-tree, prior to compaction.
|
| > For example, using a sequential UUID algorithm while uploading
| a large batch of documents will avoid the need to rewrite many
| intermediate B-tree nodes. A random UUID algorithm may require
| rewriting intermediate nodes on a regular basis, resulting in
| significantly decreased throughput and wasted disk space space
| due to the append-only B-tree design.
|
| > It is generally recommended to set your own UUIDs, or use the
| sequential algorithm unless you have a specific need and take
| into account the likely need for compaction to re-balance the
| B-tree and reclaim wasted space.
| hn_throwaway_99 wrote:
| The new/upcoming UUID v7 really solves pretty much all of these
| problems:
|
| 1. They're ordered by a timestamp, so they preserve DB locality
|
| 2. They include a total of 74 bits of random data. This isn't
| enough to be used as unguessable keys, but it _does_ offer some
| good protection if you have other bugs that could otherwise lead
| to IDOR vulnerabilities.
|
| 3. They're still 16 bytes, but IMO any minor hit to storage/time
| is completely worth it for the benefits they provide, especially
| since the other option is usually to have an integer primary key
| AND another "public ID" column that stores a UUID.
| [deleted]
| [deleted]
| VWWHFSfQ wrote:
| My go-to pattern for many years now is to use a plain bigint
| autoincrement column for internal database relations and then a
| uuid for application-level identifiers and natural keys.
| Basically never use the uuid as the actual primary key because
| they're enormous and now the DBMS has to copy that gigantic
| number to every side of the relation. Don't do it
| sumtechguy wrote:
| This is a good answer. You want your primary/cluster key to
| usually be something that makes sense in your data. If you use
| something like UUID the placement can be basically random. That
| is usually in SQL is not a desirable trait. If you use it for a
| primary key you are basically saying you are fine with a memory
| cmp vs a register compare for you finding the right data. That
| can be undesirable too. UUIDs do have very desirable traits
| such as uniqueness. But that does come at a cost and usually
| you can fix that by deciding what process decides the source of
| truth in a different way.
| zzzeek wrote:
| Same, I'm surprised nobody is mentioning this. All the foreign
| key columns become uuids also and the DB becomes miserable to
| deal with.
| londons_explore wrote:
| Note that if you use something like spanner, you _never want data
| locality_ , because it will prevent your data being spread
| effectively between nodes.
|
| So your id's of top level entities should be something random.
| samokhvalov wrote:
| We discussed some of these problems here during a Postgres.tv
| session, and developed a patch to support ULID / UUIDv7:
| https://www.youtube.com/watch?v=YPq_hiOE-N8
|
| And then Andrey proposed a patch in the pgsql-hackers mailing
| list: https://www.postgresql.org/message-
| id/flat/CAAhFRxitJv%3DyoG...,
| https://commitfest.postgresql.org/43/4388/
|
| Everyone who can help (test, discuss, etc.) - please participate
| in that discussion.
|
| The standard is not finalized yet, but there some expectations
| that it will be, if it happens, it would be great to have this in
| future Postgres 17.
| coding123 wrote:
| Use flakeids
| penciltwirler wrote:
| Have a look at ULIDs
| pachico wrote:
| How can this be unexpected? Isn't this what Percona discussed for
| MySQL already several years ago?
|
| Or am I missing something?
| Tehnix wrote:
| The amount of people with actual experience scaling databases
| is _very_ small compared to the amount of e.g. Backend
| Engineers that end up implementing the solutions.
|
| I think a root cause is that there's almost never a case for
| preferring UUIDv4 over ULID/UUIDv7, but we still see almost all
| learning material reach for the former. So people only find out
| once they have a table that exhibits weird/low performance and
| suddenly need to google for specific performance problems
| they've never encountered before.
|
| Similarly, efficiently using indexes is also a very common
| thing I see people not know much about, since with SQL
| "everything performs well until it doesn't" (i.e. you reach
| scale) :)
| indymike wrote:
| > How can this be unexpected?
|
| If you routinely work with junior developers this will come up
| because UUID4 seems miraculous at first. There's always a bit
| of surprise when the question gets asked about why we have so
| many sequential keys when we could just use UUID4...
| bguebert wrote:
| They suggest to use UUID v7 at the end for more sequential
| keys. Maybe that's the new part since v7 is relatively recent.
| mattmanser wrote:
| MSSQL has had sequential GUIDs for over a decade to solve
| this very problem, surprised it wasn't done sooner.
| evntdrvn wrote:
| their format is pretty funky though lol
| frodowtf wrote:
| With some basic knowledge about how a db stores data, there is
| nothing really "unexpected" about this behavior. It's weird
| though how little devs actually know or care about it.
| JoeAltmaier wrote:
| The correlation of locality and sequential keys was a
| coincidence. It's unfortunate that folks rely on this coincidence
| for performance. Databases can be made with better algorithms now
| that UUIDs are common. It would be a mistake to enshrine
| sequential indexes as some special optimization, just to be blown
| away later when better algorithms get used.
| aarondf wrote:
| I did a video covering choosing a primary key type for MySQL [1],
| including UUIDs.
|
| Long story short: you should probably use a bigint unless you
| have a pretty good reason not to. Good reasons exist! If you do
| use a UUID, make sure you use a version that is time-sorted.
|
| 1: https://planetscale.com/courses/mysql-for-
| developers/indexes...
| xmdx wrote:
| That's what created_at and updated_at are for surely
| sam_lowry_ wrote:
| From the ActiveRecord fame ;-)
| yesimahuman wrote:
| I was going to ask, when would you only rely on a sequential id
| for temporal locality, wouldn't most people be indexing and
| sorting on `created_at` or something equivalent?
| Tehnix wrote:
| You don't need to need temporal elements, the simple fact
| that UUIDs are a bad fit for B-Trees means that simply
| inserting a bunch of data that relies on a UUIDv4 for its
| unique ID, will also run into these same problems.
|
| You'd typically see this as either lower performance of a
| table than you expect, or higher IOPS usage of your database
| (which gets expensive at scale).
| gregwebs wrote:
| UUID is also used to avoid leaking information about the
| underlying system. This includes temporal information that could
| be used to infer the size of the dataset for all users.
|
| If this isn't a concern, then using a timestamp based approach as
| recommended in this article is a good approach. That is the
| default in MongoDB.
|
| If it is a concern, one approach is to use random UUIDs to give
| to end users but then internally to have something like auto
| increment ids.
| gmhafiz wrote:
| Expanding on temporal information leak
|
| 1. Size: If a client receives a record with id=10004578, they
| can guess that 4578 orders have been made.
|
| 2. Rate of growth: Receiving two different orders means they
| can track the growth rate of record insertion.
|
| And also
|
| Iteration attack: If your API endpoints do not have
| authorization, an attacker can try to access with GET
| /api/users/1, GET /api/users/2, GET /api/users/3, etc. UUID
| makes this next to impossible.
| WeAddValue wrote:
| Or use hashids.org. Use it to obfuscate your integer auto
| increment ids anywhere a user might see it.
| VWWHFSfQ wrote:
| It's a good option purely for cosmetics but don't rely on it
| for any kind of serious obscurity since it's trivially
| reversed. I've used it to great effect in the past to encode
| multiple integer values like start/end ID, sort order (0/1),
| etc. for cursor-based pagination. But that's only because
| there's nothing secret in those numbers. Just purely for
| convenience.
| willvarfar wrote:
| I once heard of some fancy data science that inferred that
| the company was the supplier's major customer from the gaps
| in the sequential ids they saw when using that supplier's
| API. That was useful in negotiations.
|
| And so the company that did that data science realised they
| too were susceptible to exactly the same 'attack'. So they
| created a system to obscure the ids they were themselves
| exposing to their customers, using some cheap cut-down
| tea64 encryption iirc. My memory is it never went live,
| though.
| theluketaylor wrote:
| Not that fancy (or at least not that novel). In WWII the
| allies used sequential serial numbers on destroyed and
| captured tanks to have a frighteningly accurate picture
| of German industrial output.
|
| https://en.wikipedia.org/wiki/German_tank_problem
| code_biologist wrote:
| Salting your hash should work?
|
| You could also use a 32 or 64 bit block cipher like skip32
| if you want to prevent reversal. Or at least, it makes
| reversing non-trivial.
| VWWHFSfQ wrote:
| This is a common misconception. hashids is _not_ an
| encryption algorithm. It's just an encoding with a tiny
| bit of obscurity layered in. hashids.org:
|
| > Do not encode sensitive data. This includes sensitive
| integers, like numeric passwords or PIN numbers. This is
| not a true encryption algorithm. There are people that
| dedicate their lives to cryptography and there are plenty
| of more appropriate algorithms: bcrypt, md5, aes, sha1,
| blowfish.
| babbledabbler wrote:
| I actually have been using a combination of a numeric integer
| ID as the PK and a UUID as a lookup field to do routing lookups
| etc. for this purpose in Postgres backed app I'm working on.
|
| I found this approach to be more trouble than it's worth and
| plan on switching to a UUID PK key and doing away with the
| integer sequence.
|
| Here are the complications I ran into:
|
| The libraries I'm using for the ORM and API are designed to
| work with a primary key for single record get access. For
| example they want you to do `resource.get(ID)` where ID is the
| primary key, however, I now have to do `resource.find({ where:
| { uuid: 'myuuid' }})`. This is for all resources on all pages.
|
| In Postgres, the integer PK sequence has a state that keeps
| track of what number it's at. In certain circumstances it can
| get out of sequence and this can trip up migrations.
|
| We already have created_at and updated_at fields and these are
| probably better for ordering than the sequencing.
|
| Had I to do it again, I would just use UUIDv4 until it runs
| into issues and either date fields or a sequence where
| necessary. If anyone has better ideas I would be most grateful
| as this is something I go back and forth on.
| bavell wrote:
| As someone who has been on the fence about this for awhile
| now, thank you for your valuable anecdote!
| moonchrome wrote:
| That sounds like a very simplistic framework and I'm sure you
| could do some metaprogramming to abstract boilerplate. Like
| you couldn't do database multitenancy with those constraints.
|
| I've used the int keys and UUID public keys on multiple
| projects - it wasn't an issue for EF core or RoR
| babbledabbler wrote:
| I'm simplifying a bit for brevity and we can do some
| abstraction to handle it so it's not that the framework is
| simplistic. I'm just having trouble justifying adding this
| complexity with two types of IDs.
| gregwebs wrote:
| I personally find the numeric id extremely valuable for
| internal data analysis and sharing. I can refer to rows by
| the numeric id, including a range of rows, and seeing the ids
| gives exactly that intuitive information about it's relation
| in the set that we are hiding from end users. Numeric ids can
| also be used for the same reason in an admin-only UI.
|
| On the efficiency side, joining and querying by id is
| generally more efficient on CPU usage for querying, but you
| do have to pay the cost of having the additional column and
| index.
| babbledabbler wrote:
| This was generally the reason I went with numeric ID as PK
| originally. It makes working with and analyzing the data as
| well as cross referencing relations easier.
|
| For all my tables I have a base schema that looks something
| like this.
|
| id: integer sequence PK uuid: uuidV4 created_at: datetime
| updated_at: datetime
|
| The concern I have is when I have to distribute my system
| when scaling. Those numeric IDs will have to be replaced
| with the UUIDs so I figure I might as well do it now.
| gregwebs wrote:
| Everything breaks at scale. In my experience most tables
| don't end up with more than a few million rows and will
| work fine with this. If you did want to transition a
| large table to be UUID only, the nice thing about this
| approach is that you could do it with no down time. If
| you are using a DB that only scales writes vertically
| though (most DBs, including distributed DBs) then how are
| you actually going to scale the DB layer horizontally?
| Pretty much just CRDB (PG) or TiDB (MySQL) are the
| options there- look at their docs for how to setup your
| ids.
| babbledabbler wrote:
| I'm not so much concerned with figuring out scaling in
| terms of volume as I expect to be able to handle millions
| of rows in a single DB and that would be an
| implementation detail and fine tuning. I'm more concerned
| about scaling in terms of complexity and keeping the
| system easy to reason about when more people, tech are
| involved.
|
| Lets say I have a <CAR>-[1:N]-<TRIP> in two tables in a
| relational DB. This works fine at first even for millions
| of rows as you say.
|
| At some point in the future it makes sense to have these
| two entities managed by different team/services/db. Let's
| say TRIP becomes a whole feature laden thing with fares,
| hotels, itinerary, dates. So I need to take this local
| relation and move it to different services and different
| DB.
|
| If I had been using an integer PK/FK this would be a more
| complicated migration than if I used UUIDs.
|
| My assumption is that we would not want to have a
| sequenced integer key used in a distributed system.
|
| In other words it seems safer bet if there's a
| possibility of needing to move to a distributed system to
| use a UUID for the key from the beginning.
| gregwebs wrote:
| I think switching this with zero downtime to do foreign
| key references with UUIDs will be easier than any of the
| pain you would deal with from having to do cross-DB
| joins.
|
| What specific issues are you worried about with the
| integer key? Usually the issue is dupming data into
| something like a staging or development environment
| rather than a production concern. If you attempt to dump
| 2 datasets into one db you will have a conflict. Or if
| you write to an environment and then dump into you will
| have a conflict.
| babbledabbler wrote:
| Mainly portability of data and options for the future.
| I'm all on one postgres instance right now and don't plan
| on breaking it up until necessary. If at some point I
| need to take a table and move it to another type of
| database I want that migration process to be
| straightforward. If I have integer keys with sequencing
| behavior I anticipate having to do that porting. That
| internal key would then become external to do the lookup
| and if it's an external key I want it to be UUID for
| security as well. Integers as IDs are guessable so I want
| to keep them internal.
| damethos wrote:
| > Had I to do it again, I would just use UUIDv4 until it runs
| into issues and either date fields or a sequence where
| necessary
|
| How soon is soon? And how would you handle it if you already
| had all PKs as UUIDs
| babbledabbler wrote:
| Probably when there is some downtime or as the opportunity
| presents.
|
| I would use a date field to do ordering or a sequenced
| field when needed.
| 3pt14159 wrote:
| Well, I'll give another take.
|
| If you do not have two separate forms of identifier AND you
| have a "public" API (including basically any client apps,
| frontend JS, or anything found in query params) then you are
| making compliance with European regulators a massive headache
| when it comes to erasure of PII, since shared identifiers
| must be destroyed one way or another. Trying to merely delete
| the records is complicated by the fact that you need legal
| holds to comply with various federal laws.
|
| Between the work involved in using two identifiers, one for
| joins and one for external lookups, versus the work involved
| in manually coding up all sorts of erasure work arounds,
| something I was in charge of in the past, I would strongly
| consider just using two IDs.
|
| If your ORM gets in the way, just modify the ORM. This is
| easier than you'd think. For example, in Django just make a
| helpers module with something like this:
| class OurModel(Model): def get(...):
|
| And have some sort of programatic way (lint, etc) of ensuring
| that your models.py doesn't use the stock class. It's
| simpler. It ruffles some feathers at first, but if your
| framework is getting in the way of a real use case just
| change the framework and don't worry about it.
| babbledabbler wrote:
| I think you raise some good questions around IDs and PII
| and we definitely will be tackling GDPR sooner or later.
|
| I don't quite follow on the European regulation issues
| raised by using as a UUID in a route and that being the PK
| of the record.
|
| I know you should not expose PII or any information that
| can be used to identify a person, however, in our case any
| route is behind an authed login on an SSL connection which
| encrypts the path (we don't use query params).
|
| The only place that contains data that ties a UUID to a
| person is in the database. This would be the case whether
| we used a PK as an integer or not.
|
| Could you elaborate or share any resources around dual IDs
| DB design for PII compliance? That would be super helpful.
|
| Regarding framework hacking or workarounds, I have a
| principle to not go against the grain of a framework. The
| reason for this is that modifying/hacking adds complexity
| when building on top of it or onboarding other software
| engineers. If necessary I'll do it as a last resort.
| 3pt14159 wrote:
| > any route is behind an authed login on an SSL
| connection which encrypts the path
|
| If your application only services a single user with
| their own resources then you have nothing to fear. But
| few applications meet this definition. If, for example,
| you're running an invoicing application, then at some
| point you'll want to share some resource, say an invoice
| or an expense or a time sheet, with another party. If
| your API exposes the identifiers from one resource to
| another, or even a user's id when potentially adding them
| to a team, then these identifiers are considered PII
| according to European regulators.
|
| I understand that this is frustrating, but it comes from
| a posture that prioritizes right-to-be-forgotten over
| programmer ergonomics. Imagine, for example, API crawlers
| that hit your /search endpoint with email=[some
| predetermined list of emails] and harvest user ids to
| match with future data.
|
| In the end, the best thing you can do is keep join keys
| internal and API keys separated. There are other
| workarounds, but they're so much trouble that they aren't
| really viable alternatives. Now, whether you use UUIDs
| for both identifiers or UUID for external and integer ids
| for join keys is up to you and your performance and
| scaling requirements. Personally, I prefer integer keys
| for internal unless I really expect the database to grow
| to more than 200m rows before the company hits 1000
| people, since int ids mean you do not need secondary
| indexes on things like the created_at fields, but even
| there, it's not such a big deal to have an extra index on
| every table.
|
| > I have a principle to not go against the grain of a
| framework. > hacking adds complexity
|
| Here we essentially agree, but with the right integration
| tests, upgrading and onboarding is a lot easier than
| feared. That said, do not add to the framework unless the
| benefit is worth it.
| jandrewrogers wrote:
| Organizations that care about leaking information encrypt their
| identifiers as UUIDs. UUIDs conveniently have the same size as
| a single AES block, for which there is dedicated silicon on all
| modern CPUs. There is almost no overhead.
|
| Every other issue with random UUIDs etc, which are ignored
| here, are solved by encrypting your identifier. Random UUIDs
| (i.e. UUIDv4) are banned in many places for good reasons.
| hbrn wrote:
| Key type is not the right layer for solving information leak
| (and incorrect joins, mentioned above).
|
| Use hashids to avoid the leak, and as a bonus your client-
| facing keys will be short and easily copy-pastable.
| jmull wrote:
| In my experience internal auto-increment ids and external
| random UUID works well.
| CGamesPlay wrote:
| How could temporal information be used to infer the size of the
| dataset for all users? Specifically, more accurately than "I
| now know that this database was created in 2022".
| iwwr wrote:
| Sequential serial numbers reveal information. See:
| https://en.m.wikipedia.org/wiki/German_tank_problem
| tgv wrote:
| I use uuids in links. Using sequential ids gives people
| access to information they shouldn't see. Unfortunately, the
| link has to be publicly accessible. Then UUIDs become a very
| practical key for your records.
|
| That said, the number of times those records are accessed is
| low, so there are no performance considerations.
| jongjong wrote:
| I can't think of many use cases where I would sacrifice the
| beauty and elegance of UUIDs to optimize access times by a
| millisecond or two. UUID is totally worth the cost.
|
| UUID actually performs much better than I thought based on the
| author's example with a COUNT query. COUNT queries aren't very
| efficient because typically, all records are traversed; here
| we're talking about a 50% slowdown on 10 million records... How
| often do you need to access 10 million contiguous records? Most
| of the time, for user-facing apps, you'll be accessing 100
| contiguous records at most and, in fact, most queries will access
| a single record... I suspect that the average per-query
| performance loss for a typical app is probably less than 5%.
| Also, you could simply index by a separate date/timestamp field
| if you need records to be ordered by time and probably won't
| incur any performance cost.
|
| IMO, unless you're building an app for high-frequency trading,
| auto-incrementing IDs aren't worth the pain and lack of
| flexibility.
| danpalmer wrote:
| > I can't think of many use cases where I would sacrifice the
| beauty and elegance of UUIDs...
|
| I find auto-incrementing IDs far more elegant in many ways.
| They are much easier to make sense of for users, they give you
| some loose metadata (ordering, sometimes a rough time range)
| which can be handy in debugging stuff.
|
| At my previous workplace most things were auto-incrementing
| IDs, and while we got bitten by them a few times, there was
| significant debugging value in seeing a user ID as an integer.
| Internally, order IDs and customer support ticket IDs were also
| auto-incrementing, but we offset them so that even numbers were
| orders and odd numbers were support tickets, and this helped
| quite a bit with debugging or even just non-technical users
| sending each other references. Not that this can't be done with
| prefixes on UUIDs with a bit of extra work.
|
| I wouldn't necessarily recommend auto-incrementing IDs or UUIDs
| over the other, but I don't think one is more elegant at all.
|
| > to optimize access times by a millisecond or two. UUID is
| totally worth the cost.
|
| 2ms * 20 queries is 40ms, which could take a typical page load
| for a web app from 150ms to 190ms, which is quite a regression.
| biztos wrote:
| I'm a big fan of SERIAL and BIGSERIAL, for
| aesthetic/cognitive reasons more than performance, but it can
| be really nice to have a data-generating process generate the
| primary ID of a piece of data, which you can then log before
| you put it in the database and look up the same ID everywhere
| when debugging.
|
| Not to mention the distributed database situation.
|
| In fact I like that enough that I sometimes use ULIDs for
| secondary unique keys just so I have something to log, even
| if the primary key is numeric.
|
| It's nice that you can store ULIDs in Postgres natively as
| UUIDs, and it's really nice to have a timestamp embedded in
| the ULID if you ever need it... but it's also really tempting
| to use it in public-facing stuff and thus leak your creation
| timestamp.
| danpalmer wrote:
| Absolutely, these are also good things to take into
| account. There are really situations for all of these
| things, and this is why I don't like the approach of trying
| to "solve" this problem. Lots of people say "UUIDs solve
| IDs", or "UUIDs cause too many issues to use", or "ULIDs
| solve UUID issues", and in reality this is just another
| place where there are trade-offs, and what works for one
| company/project/dbms/table may not work for another, or may
| even cease to work over time as requirements or scale
| change.
| jongjong wrote:
| Recently, I wrote a client-side wallet app for cryptocurrency
| use case and it required transactions to be created and signed
| entirely on the client-side and so the transaction UUID had to
| be created on the client side (before signing) and I was
| surprised at how elegant and simple the application's logic
| turned out to be (both on the front end and back end).
|
| One of the best things about UUIDs is that if your front end
| tries to create a resource and fails due to a bad connection,
| you can simply resend that exact same resource for creation
| again (with the same UUID) and you don't need to worry about
| duplicate records being inserted into the database since you
| would get an ID collision; it results in much simpler/cleaner
| code which is idempotent and deterministic by default.
|
| With auto-incrementing IDs, if there is a connection issue or
| other failure, it's not possible to cleanly figure out if a
| resource was already created or not since only the database
| knows what the ID of that resource was until that ID is sent
| back and reaches the client across the network.
|
| Just because the client did not receive a successful response
| from the server, it does not mean that the resource was not
| created; and if it was, you cannot know since the client has no
| way of referencing that resource in a reliable way. It just
| seems completely wrong that the client (which created the
| resource) cannot reference it until the database has inserted
| it.
|
| It's a hack to pretend that resource creation starts in the
| database, when in fact, it starts on the front end.
| [deleted]
| gradys wrote:
| You can achieve this without UUID keys if you attach UUIDs to
| the requests. This has the advantage of working for all kinds
| of mutating requests, not just creates, but requires that you
| store these request UUIDs for some period of time as well.
| jongjong wrote:
| It's still not as elegant because it doesn't work in a
| multi-process or multi-server back end; if the request
| fails, you need to make sure that the retry request will
| hit the same server process (assuming you're keeping the
| UUIDs in memory on the server). You can use some in-memory
| data store like Redis to share the request UUIDs across
| multiple processes/hosts but that can add a lot more
| complexity, latency and risks and is just not worth it IMO.
|
| Even in a single-host setup, it can be a problem because
| what happens if your process crashes and restarts just
| after a resource was created in the database but before the
| ID was sent to the client (with success message)? You would
| end up with a duplicate record in your DB after the retry
| since your newly restarted server would not have the UUID
| in its memory (even though the resource was in fact already
| created on the server a few milliseconds before the last
| crash).
|
| With the Redis (or similar) solution, you need to make sure
| that the request UUIDs expire after some time and are
| cleaned up to avoid memory bloat which is a pain... I mean
| that complex solution probably uses up a lot more resources
| than just using UUIDs in the database as IDs.
| ramchip wrote:
| You can use the DB for this as well, just make a table
| e.g. "requests" with two columns, the user and the
| request token. Old entries can be purged on a cron job.
|
| If you add a column to store params as well then you can
| also do better validation:
|
| > Responding when a customer changes request parameters
| on a subsequent call where the client request ID stays
| the same
|
| > We design our APIs to allow our customers to explicitly
| state their intent. Take the situation where we receive a
| unique client request token that we have seen before, but
| there is a parameter combination that is different from
| the earlier request. We find that it is safest to assume
| that the customer intended a different outcome, and that
| this might not be the same request. In response to this
| situation, we return a validation error indicating a
| parameter mismatch between idempotent requests. To
| support this deep validation, we also store the
| parameters used to make the initial request along with
| the client request identifier.
|
| https://aws.amazon.com/builders-library/making-retries-
| safe-...
| jongjong wrote:
| You can but you add an additional database lookup step
| which will cost you more in performance (and latency)
| than just using UUIDs as the ID directly with a single
| table.
| ramchip wrote:
| It's only at creation time. When you lookup the record
| after that you can use the ID directly, and then you
| don't hit the locality problems from the article.
| whack wrote:
| > _The moral of the story is that data locality matters, and it
| can pop up in the most surprising of places. Using random is
| typically the worst thing you can do for locality, so if you
| want to use UUID's, try to use a sequential variant. UUID v7 is
| a good option_
|
| The author isn't suggesting abandoning UUIDs, he is suggesting
| using something like UUIDv7 which preserves locality. For most
| use-cases, this seems like a very reasonable recommendation
| with no significant downsides I can think of.
| [deleted]
| fauigerzigerk wrote:
| I agree that UUIDv7 will usually be the best choice. The one
| downside I can think of is that it may sometimes be necessary
| to let people know the identity of a thing without also
| telling them exactly when that thing was created.
| remram wrote:
| Encrypt it. Block ciphers are bijective so they will turn
| sequential IDs into unique random IDs in a way you can
| easily reverse.
|
| You can also do that to get opaque identifiers from an
| auto-increment primary key.
| fauigerzigerk wrote:
| Seems awfully complicated to me.
| pkilgore wrote:
| COMB UUIDs can help here.
| fauigerzigerk wrote:
| Why do COMB UUIDs not have this problem? Do they not
| contain a timestamp?
| [deleted]
| dewey wrote:
| > sacrifice the beauty and elegance of UUIDs
|
| Out of curiosity, what makes UUIDs more elegant or beautiful
| than just plain old integers?
| maxgee wrote:
| Nothing. Integers are simpler and faster. They're also
| incorrect about COUNTs requiring a full record scan.
| xkqd wrote:
| How are integers simpler? Even generation of unique
| integers is more complicated, let alone migrating them or
| moving them around.
| catlifeonmars wrote:
| 1. You have to handle race conditions if there is ever more
| than one process writing to the DB at one time 2. It's
| ambiguous whether a number is an ID. E.g. when looking at
| logs 3. Numbers leak information about table size, row age,
| and crucially _future_ row IDs.
| phito wrote:
| Nothing? Really? Have you ever thought about that?
| ceeam wrote:
| Their uniqueness of course.
|
| > To be able to generate keys independently of the database
|
| > To move sets of related records between different databases
| without having to deal with renumbering everything
| yes_man wrote:
| And adding greppability of logs in debugging situations.
| You can just search logs in any systems of yours with the
| UUID and find exact hits you are dealing with. Whereas with
| integers you will get all kinds of hits
| phkahler wrote:
| Because UUIDs are _universally_ Unique IDentifiers.
| jandrewrogers wrote:
| Unfortunately, UUIDs aren't as universally unique in
| practice as people think. The standard flavors are banned
| in quite a few organizations that care a lot about data
| integrity for valid reasons. Standard UUIDs work for _some_
| applications but definitely not all.
|
| For most applications you are better off using a sensible
| structured key, like incrementing an integer or similar,
| and encrypting it if you want to obscure it. Encrypting a
| 128-bit value is approximately free on modern CPUs.
| kogus wrote:
| "5" might be the key for a thousand different records in your
| database. A uuid is a key for exactly one. Integer keys
| permit wrong joins to have the appearance of working.
| FroshKiller wrote:
| Crucially, "5" might also be the key to a thousand
| different records in dozens of other development & QA
| databases for applications with integrations configured
| incorrectly. Integers can match coincidentally. UUIDs
| won't.
| magicalhippo wrote:
| While wrong joins is a weak argument IMHO, our customers
| somewhat frequently ask to merge data, either from test to
| prod or from one company to another (after acquisition
| f.ex.).
|
| In a lot of cases that can be a pain due to overlapping
| integer primary keys and large parent-child table sets.
| antifa wrote:
| You can tell all your tables to use the same sequence.
| totalhack wrote:
| You can incorrectly join on any columns you want. That's
| not really the fault of the column types IMO, that's a
| problem in the layers above.
| dpcx wrote:
| True, but you can't incorrectly join on UUID due to its
| anti-collision nature.
| ibash wrote:
| Yes you can, you'll not get the correct result set.
| OJFord wrote:
| As sibling says, you can make exactly the same mistake
| and it not colliding just means no join - but if it's an
| outer join and one of many joins or in a nested query it
| might not be immediately obvious it's happened, or what
| the issue is even if it is.
| smolder wrote:
| We have some app generated internal IDs which have a 1
| character prefix by type and then base64 sequential stuff
| after that. I'll admit it's nice knowing the type of an
| identifier by looking at it. I don't really like that they
| are strings, however. The original architect made some
| weird choices as far as deciding where and where not to
| optimize for performance.
| Phil_Latio wrote:
| [dead]
| allknowingfrog wrote:
| That's not a very constructive response. UUIDs do provide
| global uniqueness, which can occasionally make mistakes
| more obvious. Do you have a counter argument to support
| your strong objection?
| Raz2 wrote:
| > To be able to generate keys independently of the database
|
| This is a must if you don't want to couple your domain logic
| and DB. I want to generate a record with an ID inside my
| domain layer without depending on the DB and doing awkward DB
| save - DB read.
| pgaddict wrote:
| It is not a matter of a couple milliseconds.
|
| The loss of locality for reads is bad, especially for data sets
| that don't fit into cache / RAM (while the active set would).
|
| Where it really bites you is writes, because it can trigger
| pretty massive write amplification.
|
| Imagine you have 128 GB index on UUID column, that's ~16M pages
| (8kB) and insert 1M random values. Congrats! You've probably
| just wrote 8GB to the WAL, because of FPW and stuff like that.
| With serial IDs we'd write a fraction of that. It doesn't take
| much to hit max_wal_size and trigger a checkpoint, starting a
| new cycle with FPWs. Got a replica? Well, now you need to send
| the WAL over network. Is the bandwidth limited (another DC?),
| sorry to hear that. Is the replica sync and you have to wait.
| Well, that's unfortunate.
|
| In other words, the lack of locality seems like a detail but at
| scale it's actually a damn huge deal.
| qaq wrote:
| this sounds horrible until you realise that a modern SSD will
| write that in about 1.5 sec.
| pgaddict wrote:
| That was just an example calculation, to illustrate the
| write amplification factor, of course. You can scale it up
| pretty arbitrarily.
|
| I mentioned only WAL for simplicity, but it also has to
| modify and write out the index pages themselves, and write
| them out eventually. And that's going to be mostly random
| I/O. Flash storage is good at handling that, ofc, but if
| things are adding up like this ...
|
| Not to mention you still have to copy the WAL over network
| to replica, or perhaps to multiple replicas. And if you
| have physical backups with PITR, you gotta keep all the WAL
| somewhere too.
| qaq wrote:
| sure but not many workloads are writing out million
| inserts per second.
| [deleted]
| ok123456 wrote:
| Then just make a generated column that's a 32bit integer hash
| of the uuid for this particular case and create an index on
| that? Use that during expensive queries that blow up your
| cache locality if it matters.
| insanitybit wrote:
| A 32bit integer hash won't have locality either.
| ok123456 wrote:
| Ok. Pick a resolution where you won't get collisions and
| that is a native datatype.
| patrec wrote:
| I feel that you don't understand what locality is about:
| the property of IDs that were generated closely together
| in time to be close together numerically. You don't get
| that by "hashing" the UUID (which makes no sense anyway,
| since you might as well just take some truncation of the
| UUID). In fact the whole idea behind a hash is to destroy
| this property of the input data. The reason the numerical
| locality matters is that it is much more efficient to in-
| sequence-insert several numbers that are close together
| into an index than to in-sequence-insert the identical
| amount of randomly distributed numbers.
|
| The GP was talking about the first graph here:
| https://www.2ndquadrant.com/en/blog/on-the-impact-of-
| full-pa...
| ok123456 wrote:
| I thought you were talking about cache locality within
| the CPU. Real cache locality.
|
| If you care about the insert location, why not just add a
| brin index on a timestamp field and use that instead of
| assuming that the index is sequential.
| patrec wrote:
| How would adding a BRIN index help in any way with
| reducing the discussed problems, such as write
| amplification?
| pgaddict wrote:
| This is a bit confusing, as it mixes two things - BRIN
| index and index on a timestamp.
|
| The main source of write amplification comes from
| updating random pages of the btree index. Imagine
| inserting 10 random UUID values into a large index - it's
| pretty likely those will go into 10 different leaf pages.
| And every first update of a page after a checkpoint
| (which typically happens every 30 minutes or so), we have
| to write a FPI (i.e. the whole 8kB page) to WAL. So
| because btrees are based on ordering, random values end
| up on random leaf pages, causing write amplification.
|
| If you have BRIN index on UUID column, this does not
| happen, because the index is not based on ordering but
| location in the table. If the 10 rows get appended to the
| same table page, that'll be just 1 write, with one FPI.
|
| This is why BRIN does not have the write amplification
| issue. But it's also a bit pointless, because BRIN on
| random data is pretty useless for querying (Well, at
| least the minmax indexes, are. Let's ignore BRIN bloom
| indexes here.)
|
| If you create BRIN on timestamp, that's not going to have
| write amplification problem, and it'll be good for
| querying. The thing is - BTREE would not have write
| amplification problem either, because the timestamps are
| going to be sequential (hence no updates to random leaf
| pages).
| saltcured wrote:
| Right, you just need to avoid having the btree index on
| the UUID field. Similarly, don't pose queries to sort by
| the randomized UUID field either. This is where the often
| maligned hash index type could be useful, to allow lookup
| of individual rows by UUID without all the expense of a
| btree index maintenance. Use other fields for ordering
| that have better write locality.
|
| What this means in practice, of course, is that you
| shouldn't expect to do application driven pagination with
| UUID keys either. You would need to expose some other
| boundary marker with a total order that works well with
| btrees. And this could bring you back to "leaking"
| predictable key material that you were trying to hide by
| adopting UUIDs...
| insanitybit wrote:
| CPU cache locality is not "real cache locality", it is
| just another place where caches exist and locality is
| optimized for. Your solution wouldn't work for CPUs
| either, other than that you could fit more data into a
| cache line - but that's like saying "increase the cache
| size to solve this problem", which obviously can help but
| is not addressing the inherent issue.
| insanitybit wrote:
| Hashes evenly distribute information regardless of the
| input. You will never have locality unless your dataset
| is so small that all of it can fit into a single cache
| line.
|
| Your solution just doesn't really make any sense. The
| time ordered UUID suggested in the blog post makes way
| more sense.
| unshavedyak wrote:
| Out of my scope, but why are UUIDs even discussed? ULIDs
| ~~(and i think Nanoids?)~~ don't suffer these same problems.
| Locality and ordering alone make me[1] think ordered ULIDs
| (and friends) are the only thing worth discussing.
|
| Is there some value to UUIDs over ULIDs that make these
| discussions largely revolve around Autoincrement vs UUIDs
| rather than Autoincrement vs ULIDs(and friends)?
|
| [1]: Again, totally out of my wheel house.
|
| _edit_ : Apparently UUIDv7 exists, which is similar to ULID,
| so my question pertains to UUIDv5 and below, i think.
|
| _edit2_ : I think Nanoids _do_ suffer the same problem. They
| 're just small... i think.
| ljm wrote:
| I've worked with ULIDs a bit but honestly haven't operated
| at a scale where you might run into issues. And the main
| reason for choosing ULID was because I could only find
| experimental support for UUIDv6 or UUIDv7, and KSUID
| (another alternative) only had time precision down to a
| second.
|
| And the reason for that was to have lexicographically
| sortable IDs (even if not monotonic, which would require an
| extra server) so we didn't have to index a timestamp column
| for chronological ordering. The alternative was to have a
| traditional integer sequence but it's not always ideal to
| expose those on your API.
| cratermoon wrote:
| Out of curiosity, why do your keys need time precision
| less than a second? Are they carrying some secondary
| expectations beyond uniqueness and k-sortable?
| letitbeirie wrote:
| Postgres supports UUIDs natively
| unshavedyak wrote:
| Is that worth the pain of dealing with randomized
| inserts? I guess i just don't mind creating a ULID (or i
| guess UUIDv7 is newly proposed and sortable) and
| inserting that.
|
| Native DB support is irrelevant to me for randomized bits
| unless it affects storage, sorting, paging, etc. Does it?
| code-e wrote:
| It does affect storage and sorting. A native UUID type
| uses 16 bytes. The alternative is text encoding (32 bytes
| for hex), or maybe a raw BYTEA. Postgres also has
| SortSupport for the UUID type, which basically means if
| the first 8 bytes only has one matching row, then the
| remaining 8 bytes can be skipped. Combine that with a
| ULID where the most first half is basically a timestamp,
| you'll get performance close to using a single 8byte
| BIGSERIAL.
|
| You can also write a plpgsql function to generate these
| ULIDs in the database.
| lolinder wrote:
| ULIDs are byte-compatible with UUIDs, so the only thing
| Postgres's native support gives you over ULIDs is that
| Postgres can generate new UUIDs for you instead of having
| to do it in the application before insertion.
| letitbeirie wrote:
| It also buys you that support in the driver libraries.
|
| I ran into this last year - storing them as UUIDs works
| great, unless all of the drivers for the language you're
| using (Go, in my case) try to cast/validate those bytes
| _as_ a UUID before you can access them.
| xahrepap wrote:
| ULIDs have made our MySql caches breathe a sigh of relief.
|
| One place we're avoiding ULIDs (and other counters) is in
| publicly-facing IDs. Preferring random to help keep them
| unguessable. (say what you will about security-through-
| obscurity).
|
| So we do ULIDs for private IDs. Random UUIDs for public
| IDs. Seems to work well.
| ilyt wrote:
| I wonder if just encrypting them for public usage would
| be enough here. Then application can convert between
| public and private representation at will.
| smaddox wrote:
| KSUID's are have temporal-lexicographical order plus 128
| bits of entropy, which is more than UUIDv4.
|
| https://github.com/segmentio/ksuid
| catlifeonmars wrote:
| At scale you're probably sharded across multiple DBs and
| you're already operating through replicas. Point being you're
| less likely to hit a warm cache as you scale up anyway as
| your application layer gets load balanced to different DB
| endpoints.
| magicalhippo wrote:
| Reminds me of a talk about ZFS performance, where they
| presented some benchmark results showing that as the number
| of concurrent clints doing pure sequential IO increases,
| the more the load appears as random IO to the filesystem.
|
| So with high enough concurrent load it's effectively all
| random IO and that's the primary thing worth optimizing
| for.
| jongjong wrote:
| Yes exactly, with proper sharding, raw performance is not
| as important; to some extent, you trade it away for
| improved concurrency. In fact, I struggle to see how one
| would implement sharding with auto-incrementing integers
| (you would get ID collisions for different resources across
| different shards/database instances); there needs to be a
| way to uniquely refer to resources across potentially
| multiple databases and UUIDs are one of the best ways to
| achieve that.
|
| Auto-incrementing IDs simply don't scale beyond a single
| host so I don't see how they can be good for scalability.
| Too many devs conflate raw performance with scalability.
| They are not the same at all - In fact, high scalability
| often incurs a performance overhead.
| pgaddict wrote:
| I respectfully disagree with the notion that sharding
| makes resource usage somehow less important. Sure, it
| allows you to overcome the limits that would apply to a
| single node, but if you stop caring about using resources
| efficiently (e.g. memory / disk / network bandwidth),
| it's not very different from building huge and expensive
| single boxes.
|
| Also, I don't think the blog (or me) suggested going back
| to using autoincrement IDs. There are other (better)
| options.
| coredog64 wrote:
| We did auto-increment integers with multiple servers like
| 20 years ago: The caveat is that you have to know how
| many servers are in the set in advance. Each server
| increments by the population size, and their starting
| number is their position within the pool.
|
| Not hyper scale, but good enough for failover or a 3-5
| node setup.
| viraptor wrote:
| Or increment by more than you need. If you think you'll
| only ever have 5 nodes, increment by 20. Lots of space
| for expansion and in practice you'll hit the datatype
| limits just a bit earlier - you'd need to work around
| them anyways.
| ilyt wrote:
| IIRC that's what MySQL Galera is doing by default
| mysql> CREATE TABLE animals ( -> id
| MEDIUMINT NOT NULL AUTO_INCREMENT, ->
| name CHAR(30) NOT NULL, -> PRIMARY KEY
| (id) -> ); Query OK, 0 rows affected
| (0.34 sec) mysql> INSERT INTO animals
| (name) VALUES ->
| ('dog'),('cat'),('penguin'), ->
| ('lax'),('whale'),('ostrich'); Query OK, 6 rows
| affected (0.01 sec) Records: 6 Duplicates: 0
| Warnings: 0 mysql> SELECT * FROM
| animals; +----+---------+ | id | name
| | +----+---------+ | 3 | dog |
| | 6 | cat | | 9 | penguin | | 12 |
| lax | | 15 | whale | | 18 | ostrich
| | +----+---------+ 6 rows in set (0.00
| sec)
| egeozcan wrote:
| Hi/Lo algorithm also works fine for most cases
| uppiiii765 wrote:
| Mmmhhh should that not be highly efficient for flash based
| storage?
| nine_k wrote:
| If you want locality, speed, simplicity, etc above all, use
| an incremented integer and be done with it.
|
| UUIDs belong where you _can 't afford_ that simplicity. Where
| you e.g. cannot coordinate the creation of your primary keys.
| Or where you cannot allow them to be predictable. There you
| pay the price.
|
| In practice I noticed that the size of PKs and their poor
| locality start to play a role only after a huge basket of
| lower-hanging fruit has been collected. There are relatively
| few places where their role is dramatic.
| marcosdumay wrote:
| > Where you e.g. cannot coordinate the creation of your
| primary keys.
|
| You are writing your data into a DBMS. Coordinating the
| creation of primary keys is one of the cheapest tasks
| around, if you can't do that, how is your database still
| online?
|
| > Or where you cannot allow them to be predictable.
|
| You don't need to export your PKs for the rest of the
| world. You can have non-predictable data outside of your
| PK. Yes, a different column will still have some of the
| problems with index maintenance, but it becomes a much
| smaller problem if only one table cares about the value.
| nine_k wrote:
| You can almost always opt for artificial PKs that are
| more performant.
|
| But sometimes you have to make these keys public, e.g. as
| user or other resource IDs. You want to make them UUIDs
| so they won't be predictable. Not having to join
| everywhere with the UUID-to-artificial-PK table may be a
| bigger performance win than the losses from larger size
| of UUIDs.
|
| Sometimes you have a distributed / sharded system, and
| don't want the keys to clash, and also avoid assigning
| ranges. Sometimes you have to accept someone else's ID,
| not originating in your system. In cases like that, large
| random numbers, e.g. UUID v4, work reasonably well.
|
| Of course when you just have one DB, and a relative slow
| stream of new rows, it's easy to fully control PK
| creation. And this covers the majority of practical
| cases.
| jgraettinger1 wrote:
| We use a macaddr8 that embeds a wall-clock timestamp (so
| they're ascending order, achieving data locality) with
| some additional shard and sequence-number bits. It's
| worked really well for us:
|
| https://github.com/estuary/flow/blob/master/supabase/migr
| ati...
|
| we use macaddr8 instead of bigint, because it has a
| postgres serialization / JSON encoding which lossless-ly
| round-trips with browsers and it works well with
| PostgREST. The same CANNOT be said for bigint, which is a
| huge footgun.
| gregmac wrote:
| > UUIDs belong where you can't afford that simplicity.
| Where you e.g. cannot coordinate the creation of your
| primary keys. Or where you cannot allow them to be
| predictable. There you pay the price.
|
| You often don't realize you you have non-simple needs until
| your application is reasonably mature, and in production.
| If you already picked integer keys, you now either forever
| deal with the issues caused by not using UUIDs, or you deal
| with the unknown-but-non-zero pain of converting to UUIDs.
| bfrog wrote:
| Snowflake ids fit this sort of thing better, and maintain
| temporal locality.
|
| I ran into this exact issue when building out a graph like
| database in Postgres many moons ago.
|
| It's probably less of an issue with fast nvme sad drives now, but
| on mechanical drives ordering and locality are massive.
| jamil7 wrote:
| Can UUID v7 still be created independently, on a client for
| instance? Or do they ideally need to be generated in the same
| place?
| jayknight wrote:
| They can be generated anywhere, otherwise they wouldn't be
| universally unique.
| jamil7 wrote:
| I'm not asking about the uniqueness but about ordering across
| different clients.
| roenxi wrote:
| There is probably an impossibility theory somewhere for two
| clients generating ordered ID without communicating with
| each other. What if one client was travelling near the
| speed of light for example? It isn't obvious what order
| would be correct since the two clients would have wildly
| different perceptions of time.
| ahoka wrote:
| Yes, it's impossible in theory. Same as the two generals
| problem, if I'm not mistaken.
| sibit wrote:
| If they're generated on the client you would need to keep
| clock drift[0] in mind since the leading bits are time
| based.
|
| [0]: https://youtu.be/mAyW-4LeXZo (Clock Synchronization in
| Distributed Systems by Martin Kleppmann)
| stevesimmons wrote:
| Yes, UUID v7 can be generated independently. The UUID has three
| parts, which maintains strict chronological ordering for UUIDs
| generated in one process, and to within clock skew when UUIDs
| are generated on different systems.
|
| The three parts are:
|
| - time-based leading bits.
|
| - sequential counter, so that multiple UUID 7s generated very
| rapidly within the same process will be monotonic even if the
| time counter does not increment.
|
| - enough random bits to ensure no collisions and UUIDs cannot
| be guessed.
| jamil7 wrote:
| Thank you.
| westurner wrote:
| > _There are new UUID formats that are timestamp-sortable;
| for when blockchain cryptographic hashes aren 't enough
| entropy._
|
| Note that multiple rounds of cryptographic hashing is not
| considered sufficient anymore; PBKDF2 and Argon2 are Key
| Derivation Functions, and those are used instead of hash
| functions.
|
| "New UUID Formats - IETF Draft"
| https://news.ycombinator.com/item?id=28088213
|
| draft-peabody-dispatch-new-uuid-format-04 Internet-Draft "New
| UUID Formats" https://datatracker.ietf.org/doc/html/draft-
| peabody-dispatch... ; UUID6, UUID7, UUID8
| jongjong wrote:
| I don't have a problem with UUIDv7 but I wish they had
| created it as a completely separate standard after UUIDv4...
| The version number increases imply that previous versions
| have been superseded but in fact, they just have different
| priorities...
|
| And TBH I'm concerned that UUIDv1 had a timestamp, then it
| was removed completely in UUIDv4 and now the timestamp
| concept is being added back to UUIDv7... There are legitimate
| use cases where you simply don't want to have timestamps in
| your IDs.
| ahoka wrote:
| These are not versions in the sense that one supersedes the
| other. More like variants.
| jongjong wrote:
| I hope devs and security consultants will see it that
| way. I can easily envision a future where security
| consultants would write reports flagging systems as
| 'unsecure' on the grounds that they use 'an outdated UUID
| version'.
|
| Security consultants can be quite blunt in their approach
| and companies will often yield to their every demand for
| the sake of easy compliance and to avoid having to
| explain stuff.
| remram wrote:
| I wish they could feel that way about IP versions...
| NicoJuicy wrote:
| Tbh. I use UUID's a lot, perhaps usefull to know that
| Deterministic UUID exist,
|
| eg. DeterministicGuid.Create(tenantNamespace, "my-tenant-name");
|
| Note: https://github.com/Informatievlaanderen/deterministic-
| guid-g... for dotnet.
|
| Additionally, know that there are different versions of UUID's.
| If you want to create chronologic UUID's so that queries are more
| easily ordered, use the correct type.
| BiteCode_dev wrote:
| That's... not unexpected. That's been the known technical
| compromise of UUID for years, and why ULID and UUID 7 are born.
|
| And usually it is still worth it.
| Tehnix wrote:
| The main problem is probably that you will not find a "regular
| Engineer" knowing this unless they've had to deal with scaling
| issues on a DB (e.g. high IOPS etc).
|
| The knowledge is definitely out there, but almost all articles
| will be reaching for UUIDs, so this is understandably what most
| people end up with :/
| Timpy wrote:
| For me, the visual noise is a downside for UUIDs. A lot of time
| investigating data issues means glancing at query results and
| deciding if something looks unexpected. I just can't parse a UUID
| with my eyes that quick. I've come up against this at my current
| job a little too often and I'm cursing the decision to switch to
| UUIDs. I know, it's a balancing act of competing concerns. But
| for us moving to UUIDs was future proofing for a problem that
| never arose, and the senior dev who made us switch left the
| company a few years ago. I just don't like the dev experience of
| UUIDs in a world where you do have to get hands (and eyes) on the
| data all the time.
| Pet_Ant wrote:
| Something I saw a longtime ago was hashing UUIDs into a RGB
| value and then colouring the background of the cell (at least
| when viewing in a DB IDE). That way you can quickly at a glance
| tell if it's worth going over character by character.
| adverbly wrote:
| Holy crap this is brilliant. I'd just change the text color
| of the uuid itself though - no need to change the
| background(although you'd need to make sure that the hash
| function avoided your existing background color to maintain a
| high enough contrast ratio for readability).
|
| I want this as a vscode extension.
|
| Or maybe as a terminal plugin or something? Is that possible?
| Could tmux do it maybe?
| Timpy wrote:
| Wow I really love that idea. Next time I get stuck on a fix
| tinkering with my setup I'm going to check this out.
| michaelsalim wrote:
| Interesting take. I do agree there's some visual noise there,
| but not how you described it. UUID just takes so much screen
| space compared to its integer counterpart. If you don't have a
| good UI sorting things out it's quite annoying. Otherwise it's
| pretty great.
| apelapan wrote:
| Visual noise and screen space hogging annoy me alot.
| Especially when you end up with multiple UUID:s on the same
| log line.
|
| If one doesn't _truly_ need distributed creation of globally
| unique identifiers, it is so much nicer with base35-encoded
| integer sequences. Preferably loosely based on a timestamp.
| arp242 wrote:
| It's kind of a tooling problem; no reason tooling can't do
| the same as git and display "1509af9" instead of
| "1509af9af2634d16f8d9b98e01a0166a49185474".
|
| Also I wish these sort of things would get encoded in base-36
| (0-9 a-z) instead of base-16; that would help too.
| ComputerGuru wrote:
| Ulid is an ordered UUID canonically rendered in not Base36
| but Crockford's Base32 (to make it easier to read then
| write or read aloud or parse and error correct).
| SkyPuncher wrote:
| Interestingly, my experience has been the opposite. It's
| incredibly easy to search and filter logs with UUIDs. You can
| search for just the UUID without any other information.
| [deleted]
| londons_explore wrote:
| There are downsides to anything ascending/descending in a
| database too...
|
| They mess with table statistics. If you do a query like "SELECT *
| from users where creation_date > NOW-1h", the query analyzer
| doesn't know that there might be thousands of users created in
| the last hour. It is probably working from day-old statistics
| that say all users have a creation_date between 2008 and
| 2023-06-21.
|
| That makes it sometimes pick an exceptionally poor query plan.
| Ie. instead of your query taking 50 milliseconds, it might take
| 50 hours and involve an n^2 scan of all data in your database.
| willvarfar wrote:
| > if you want to use UUID's, try to use a sequential variant.
| UUID v7 is a good option. Hopefully it's coming to PostgreSQL 17
|
| Yeap, UUIDv7 directly addresses the issues in the article. Warmly
| recommended.
|
| (There is the issue that exposing ULIDs to users may 'leak'
| information about when things were created etc, but that is
| usually not a problem.)
| rollcat wrote:
| I also like UUIDv8. It's broadly similar to v7 but the vendor
| is free to define how the timestamps are represented, how many
| bits to split between timestamps/randomness, define own
| packing, etc.
|
| So we've started making UUIDs that encode the current ISO8601
| date+time in a human-readable format:
| YYYYMMDD-HHMM-VRRR-RRRR-RRRRRRRRR
|
| This is especially useful for things you have few of (no more
| than a couple per minute), that you regularly need to cross-
| reference with other systems (e.g. files on S3).
| pgaddict wrote:
| The one problem with this "perfectly" sequential UUIDs is
| that it can easily lead to index bloat. Imagine you have such
| sequential UUIDs generated over a year, for example. And then
| you delete e.g. 99% of old data (say, everything except some
| records that you're required to keep for audit purposes or
| whatever).
|
| If there was an index, the "old" part will be 99% empty. For
| regular UUIDs this would be fine, because new entries would
| get routed to this part of the index and the space would be
| reused. Not so for sequential UUIDs (v7/v8).
|
| This is mostly why year ago I wrote "sequential-uuids"
| extension, doing roughly what v7/v8 do, but wrapping the
| timestamp once in a while.
|
| Of course, if you don't delete data, this is not an issue and
| v7/v8 will work fine.
| rollcat wrote:
| Different UUID generators for different use cases! I'm also
| a fan of v5 for generating stable identifiers, when
| tracking things such as DNS names or URLs. Useful in web
| crawlers, host inventories, etc.
| JdeBP wrote:
| Enough iterations in this direction, and people will come
| full circle to how message IDs operated on Usenet.
| <datetime.processID.sequencenumber@domain> for example.
| pfooti wrote:
| Weirdly, and this may be a side effect of writing code that uses
| Cloud Spanner, but I tend to think of uuidv4's random
| distribution through keyspace as a _good_ thing.
|
| My instinct is to not use sequential-ish indices/primary keys,
| because I don't want to hotspot one part of the storage with all
| my writes for today in the same tablet.
___________________________________________________________________
(page generated 2023-06-22 23:02 UTC)