[HN Gopher] New UUID Formats
___________________________________________________________________
New UUID Formats
Author : swyx
Score : 265 points
Date : 2022-06-12 15:17 UTC (7 hours ago)
(HTM) web link (www.ietf.org)
(TXT) w3m dump (www.ietf.org)
| nixpulvis wrote:
| Maybe I'm being slow right now, but can somewhat help me
| understand why Max UUID is ever specifically useful?
| okr wrote:
| My guess is, that it is just a constant, ready to be used for
| bitwise operations.
| nixpulvis wrote:
| Yep... I was just being slow. I thought the specialized
| variant was a new kind of UUID (being inverse of RFC4122),
| but it's just the inverse of RFC4122's Nil UUID; a single
| value.
|
| Sorry for the silly comment. This value is just a bunch of
| binary 1s.
| lewisl9029 wrote:
| I've been using ULID for a while now, which analogous to UUID v7
| but with a different (better IMHO) string representation. They've
| been awesome for using as sort keys in dynamo for instance, since
| they're lexicographically sortable as strings.
|
| But one thing I'm still wary about is exposing these IDs with
| millisecond-precision time components to end users, since I've
| seen multiple discussions here on HN about the potential for
| timing attacks.
|
| How worried should I really be? Do people have useful heuristics
| on the kinds of data where it's safe/unsafe to expose timing
| information, or should I just only expose a separate UUID v4
| externally across the board just to be safe?
| doliveira wrote:
| I came up with a scheme in which the random part of the ULID is
| a slice of the hash of a UUID, which seems to work fine because
| I'm generating it all server side, I guess? It works very well
| for insertion into NoSQL databases, for instance.
|
| So I'd also like to know the threat model for these timing
| attacks.
| kbumsik wrote:
| UUIDv7 looks interesting, but how is it different from ULID [1]
| in practice? I was considering using ULID for a upcoming new
| project because it is lexicographically sortable but it looks
| like UUIDv7 just can replace that.
|
| [1]:
| https://cran.r-project.org/web/packages/ulid/vignettes/intro...
| rekwah wrote:
| As the author of a popular ULID implementation in python[1],
| the spec has no stewardship anymore. The specification repo[2]
| has plenty of open issues and no real guidance or communication
| beyond language implementation authors discussing corner cases
| and the gaps in the spec. The monotonic functionality is
| ambiguous (at best), doesn't consider distributed id
| generation, and is implemented differently per-language [3].
|
| Functionally, UUIDv7 might be the _same_ but the hope would be
| for a more rigid specification for interoperability.
|
| [1]: https://github.com/ahawker/ulid
|
| [2]: https://github.com/ulid/spec
|
| [3]: https://github.com/ulid/spec/issues/11
| RobertRoberts wrote:
| Thank you, I've been using ULID for a while now, and it
| serves my purposes. But I have long term support concerns.
|
| UUIDv7 really seems like the sweet spot between pure
| INT/BIGINT auto incrementing PKs and universally sortable
| universal ids.
| kortex wrote:
| I've bee using ULIDs in python for about a year now and so
| far have been super happy with them, so a) thank you for
| maintaining this! b) I always felt a bit uneasy about the way
| the spec describes the monotonicity component. Personally I
| just rely on the random aspect as I am fortunate enough to
| say that two events in the same millisecond are effectively
| simultaneous.
|
| At that point, it's basically just UUID7 with Crockford
| base32 encoding, more or less.
|
| IMHO the in-process monotonically increasing feature of ULID
| is misguided. As you mention, distributed ids are a pain. The
| instant you start talking distributed, monotonic counters or
| orderable events (two threads count as distributed in this
| case), you need to talk things like Lamport clocks or other
| hybrid clock strategies. It's better to reach for the right
| tools in this case, vs half-baked monotonic-only-in-this-
| process vague guarantee.
| canadiantim wrote:
| How would one go about trying to use UUID 7 in a Postgres
| database / python codebase now?
| Starlevel001 wrote:
| Postgres UUIDs are treated as an opaque 8-byte array, you can
| use anything as long as it's in the textual format.
| oittaa wrote:
| I made a simple Python test library that extends the standard
| UUID class with UUIDv6 and UUIDv7. You might want to check it
| out. https://github.com/oittaa/uuid6-python
|
| The official UUID Draft repository has also some alternatives
| if you'd like to check those out.
| https://github.com/uuid6/prototypes
|
| Postgres supports UUIDs with any version number natively so you
| can then do something like this with it: create
| table data (id uuid, firstname varchar(100)); insert into
| data (id, firstname) values
| ('017f21cf-d130-7cc3-98c4-dc0c0c07398f', 'John'); select
| * from data;
| amluto wrote:
| Fortunately this is a bit less relevant today as Windows loses
| market share in database and server applications, but:
|
| UUIDs have historically massively screwed up endian handling.
| While this new draft discusses sorting UUIDs as strings of octets
| (bytes) and the _text_ of RFC4122 is fairly explicit about most
| significant bytes coming first, the C UUID structure in RFC 4122
| appendix A is entirely misguided: typedef
| struct { unsigned32 time_low; unsigned16
| time_mid; unsigned16 time_hi_and_version;
| unsigned8 clock_seq_hi_and_reserved; unsigned8
| clock_seq_low; byte node[6]; } uuid_t;
|
| Those aren't bytes -- they're integers of various sizes. (Hint:
| do _not_ use integer types in C code for portable data
| structures. ntohl, etc are a mess. Just use arrays of bytes.)
|
| I don't know the whole history, but MS somehow took this
| structure at face value and caused problems like this:
|
| https://github.com/uuid-rs/uuid/issues/277
|
| So, if you want to do anything (e.g. sorting) that depends on the
| representation of a UUID (or even depends on converting between
| string and binary representations), be aware that UUIDs coming
| from Windows may be little-endian. In my book, this is a Windows
| bug, but opinions may differ here.
| Someone wrote:
| > Hint: do not use integer types in C code for portable data
| structures. ntohl, etc are a mess. Just use arrays of bytes.
|
| I don't see how that helps much. If a developer forgets to call
| _ntohl_ on multi-byte integer fields, I don't trust them to
| correctly convert said integers to arrays of bytes, either.
| kevincox wrote:
| If you write it the naive way it works.
| uint8_t bytes[4]; uint32_t = bytes[0] << 24 +
| bytes[1] << 16 + bytes[2] << 8 + bytes[3];
|
| The endianness is whatever you write in the indexing and will
| be the same across architectures.
| e4m2 wrote:
| Unfortunately, the naive way also turns out to be wrong in
| C. uint8_t gets promoted to a signed int when shifting,
| which in turn causes undefined behavior for specific input.
| One way of fixing this is casting to the desired type
| before the shift, thus avoiding surprising conversions.
|
| On a side note, compiler warnings and sanitizers help with
| this kind of stuff greatly, use them if you have the
| option: https://godbolt.org/z/8oq9GTcze
| quotemstr wrote:
| Little endian won. I don't see the point of maintaining
| theoretical compatibility with big endian systems that are at
| best esoteric right now and soon to be extinct. Likewise, every
| reasonable platform aligns data fields on natural alignment
| these days. It's just a waste of effort to make software
| portable to evolutionary dead ends.
| wolverine876 wrote:
| You're saying it won for this application (UUIDs)?
| Universally?
|
| What is the most common remaining use of big endian?
| tech2 wrote:
| TCP/IP might be a reasonable example. There's a reason
| "network byte order" and "big endian" are the same thing.
| wolverine876 wrote:
| > TCP/IP might be a reasonable example.
|
| That's a pretty significant example of widespread usage
| that isn't going away soon. Perhaps the GGP was only
| referring to UUID applications.
| mort96 wrote:
| This isn't about compatibility with big-endian machines. This
| is about compatibility between different uuid libraries,
| potentially on different operating systems, all on little
| endian CPU architectures.
| ntauthority wrote:
| Every existing library follows the standard which specifies
| host byte order, which usually means little endian. The
| Rust library cited a few levels up this chain ignored that,
| somehow assuming big endian, and then had to correct for
| this mistake.
| [deleted]
| [deleted]
| firebird84 wrote:
| It's actually worse than that. The first 3 groupings
| (textually) of the uuid might be little endian while the other
| 2 are big endian. Learning this cost me more time than I care
| to admit.
|
| https://en.wikipedia.org/wiki/Universally_unique_identifier#...
| cpach wrote:
| What the...?
|
| This fact will haunt me in my dreams :-p
| amluto wrote:
| This is consistent with the misguided structure in the RFC.
| The first three fields (the time fields) are multibyte
| integers. The remainder is just bytes. The dashes in the
| textual representation are just there to confuse you.
| throwaway9870 wrote:
| I had to write some EFI/GPT code earlier this year and was
| dumbfounded to learn this. This is right up there with the
| mork file format.
| formerly_proven wrote:
| Making mixed-endian _even more haunted_ is quite the
| achievement. I congratulate whoever did this at Microsoft for
| their lasting contribution.
| secondcoming wrote:
| I thought Windows uses GUIDs, not UUIDs
| takeda wrote:
| It looks like GUID is synonymous with UUID, but the name GUID
| also implies that it could contain that bug/feature
| mentioned[1]
|
| [1] https://en.wikipedia.org/wiki/Universally_unique_identifi
| er#...
| JonathonW wrote:
| UUIDs (as GUIDs) in Windows predate RFC 4122, so I don't think
| it's that unreasonable that they're not compliant with the spec
| (since the exact contents of a UUID aren't typically that
| important, but consistency in how you produce them _is_ ).
|
| The GUID implementation in Windows derives from the DCE RPC
| specification [1]. That's where the multibyte integers in the
| RFC 4122 specification come from (they're stated the same way
| in the DCE RPC spec), and it doesn't explicitly specify their
| endianness. It does call them "NDR integers", but NDR integers
| can be little endian or big endian depending on implementation.
| DCE specifies a mechanism by which you'd indicate which you're
| using in an RPC call, but that data's not included in the UUID
| format-- you get whatever byte order the system's decided to
| use, which, for Windows, is little-endian.
|
| [1]
| https://pubs.opengroup.org/onlinepubs/9629399/apdxa.htm#tagc...
| blueflow wrote:
| This mishap lives forth in the UUID stored in a machines DMI
| data, as well in GPT partition tables, which are required when
| EFI is used. It would be really cool if we had some replacement
| for EFI that would not harbour these kind of painful legacies.
| dtech wrote:
| On the other hand, this is an easy to implement conversion,
| while changing such a fundamental thing from EFI sounds
| pretty hard, making it not worth it.
| blueflow wrote:
| You cannot fix this with an conversion because you do not
| know if your UUID is correct or needs conversion. DMI data
| for example has inconsistent endian-ness depending on the
| vendor. So if you have a UUID sticker on a new server, you
| still have two options which UUID the machine will send
| during PXE, either the printed UUID in big-endian encoding
| or in the microsoft mixed-endian encoding.
|
| Use BIOS boot instead of EFI, it has less legacy to
| implement: PE executables, FAT file system, Win64 ABI
| operator-name wrote:
| With the introduction of UUIDv8, my fun script to generate vanity
| uuids[0] can finally be spec comfortant!
|
| [0]:https://github.com/operator-name/vanity-uuid
| davidjfelix wrote:
| I've been staring at the supposedly readable and memorable uuid
| for like 4 minutes without any idea what it says.
| operator-name wrote:
| Yeah, looking back I could have done a bit more than simple
| substitution. Since the sentences don't gave meaning,
| interpreting between 1 as i vs l is especially difficult.
|
| 5eedbed5-f05e-b055-ada0-d15ab11171e5
|
| seedbeds-fose-boss-adao-disabilities
|
| "Memorable" was definitely tongue in cheek, as was the spec
| bending.
| gigatexal wrote:
| Uuidv7 looks really interesting. I like that the article mentions
| all the other projects that attempted to fix uuid issues for
| different use cases.
| stevesimmons wrote:
| This had appeared a few times before at earlier stages of the
| draft process...
|
| https://news.ycombinator.com/item?id=28088213 [244 comments]
|
| Personally, I really like UUIDv7, except I transform the UUID to
| a 25-character string which has all the same properties except it
| doesn't look like a UUID. The last thing I want is my index of
| time-sortable UUIDs getting contaminated with with some UUIDv4
| fully random ones. Since UUIDs may be generated and persisted in
| a distributed manner, it's a simple way to at least spot this.
| ikornaselur wrote:
| The UUID version is in the actual ID, so you would be able to
| spot the version as it isn't fully random.
|
| Although you light mean just visually you'd spot the difference
| much easier!
| theptip wrote:
| Can't you just check the "version" bits and reject if it's not
| 4/7? Or are you worried about someone generating a completely
| random (non-compliant) set of bits that happens to parse as a
| v4/7
| duxup wrote:
| I'm always amazed how much work goes into these.
|
| Meanwhile 99% of the time i just call a function when "I just
| need something unique here", -calls function-, and it works!
|
| Thanks everyone!
| GiorgioG wrote:
| I used to be a big proponent of using UUIDs for database PKs but
| I've found them inherently difficult to work with. It's much
| easier to remember/recognize an integer based PK when
| troubleshooting a data problem.
|
| This isn't to say you shouldn't use UUIDs at all, but I much
| prefer to use an "ExternalId" column of UUID type if you don't
| want to expose your integer based PKs externally.
| staticassertion wrote:
| I feel like `serial` has got to be pretty fast for writes and
| gives you some nice properties like sorted ordering based on
| insertion time, a smaller key, and a key that can be compressed
| far better than a uuid.
|
| I get the idea of a ULID/UUID7 encoding a timestamp so you get
| sorted order, but I wonder at what scale that beats just using
| serial.
| stormbrew wrote:
| At any scale where you have to have multiple writers
| generating IDs or need to merge results from multiple
| sources, basically. Synchronizing a serial increment across
| hosts and across time is a pain.
|
| Obviously you can composite a timestamp to an incrementing
| id, but then the serial part of it is kind of useless for
| ordering, so you may as well use a random number and avoid
| needing to synchronize at all. And then you've just
| reinvented a time ordered uuid (but to be fair, without the
| endian compatibility bs mentioned elsewhere).
| staticassertion wrote:
| I'm wondering what that performance actually looks like
| though. The closest benchmark I have is a single postgres
| instance with a uuid pkey and a bigint column,
| transactionally incrementing the integer in the column.
|
| I mean I guess mathing it out, updating an atomic integer
| is ~2-100ns, depending on contention. If you need to
| coordinate the writes you have anywhere from ~250ms-10ms.
|
| We can basically throw away the increment at that point
| since the network is hundreds/thousands of times slower.
|
| So at 10ms, that's 100 op/s. At 250ms more like 40kops/s.
|
| That ignores the fact that your db can perform those writes
| concurrently and then batch the writes off to the other
| database, so long as it doesn't pretend that they're all
| committed at once. psql HOT updates would presumably be a
| thing here idk.
|
| If I had to guess, I'd lean towards the "40kop/s" being
| closer than the "100op/s" but idk! I wish we had benchmarks
| but I can't find anything :\
| tpetry wrote:
| That's in my experience also the best approach. I wrote an
| article a few days ago about the exact thing: An integer auto
| incrementing PK with an UUID you use externally:
|
| https://sqlfordevs.io/uuid-prevent-enumeration-attack
| tylerscott wrote:
| Just seconding this as a sane way to use UUIDs IME. Basically
| the sequential integer PK is the "internal ID". IIRC in
| SQLite regardless of type or existence of PK there is a
| private sequential integer. Super handy pattern to use.
| hn_throwaway_99 wrote:
| That is how I used to do (and currently still do) DB design,
| but honestly I think if DBs start supporting UUID v7s well
| that I would use that as the sole primary DB key as well as
| the external ID:
|
| 1. They are still sorted in increasing timestamp order (at
| millisecond granularity), so they should have good DB index
| characteristics.
|
| 2. At the same time, they contain 62 bits of randomness,
| would should pretty much eliminate IDOR attacks if there is a
| bug elsewhere that isn't doing proper access checks. Not good
| enough for secure tokens, but just good defense against
| access permission check bugs.
|
| That is, you should basically get the best of both worlds:
| ordered keys with enough randomness to make ID-increment
| attacks infeasible.
| RobertRoberts wrote:
| Yep, I want UUID v7, because right now I am using ULID and
| it's fantastic, but I'd like more official and wide support
| as well.
| jeremyjh wrote:
| I like the textual representation of ULIDs as well
| though; I wish they'd just adopted ULID as v7. At least
| it is binary compatible with existing UUID types.
| RobertRoberts wrote:
| Is there not an equivalent text representation for
| UUIDv7?
| jeremyjh wrote:
| Sure you could encode it in Crockford's base-32 but if it
| isn't part of the standard then tools won't implement it
| natively, so you couldn't copy a key from a url and look
| it up in postgres without running it through a conversion
| function, for example.
| kortex wrote:
| > It's much easier to remember/recognize an integer based PK
| when troubleshooting a data problem.
|
| How often are you actually relying on memory of an ID to
| troubleshoot a problem? I mean sure, if you are scanning
| visually, it's good to recognize the same ID over and over
| again, but my ability to do so caps around 4-6 characters. So I
| just look at the last 4 chars regardless when fast scanning.
|
| I use copy-paste for any time I need to transport IDs between
| contexts (that isn't just scripted, which is best). Having a
| copy-paste stack (Alfred, Raycast and others have this feature)
| is a huge game changer here.
| marcos100 wrote:
| I can't see a case where an UUID PK is better than an INT (or
| BIGINT). Why would you do that?
| __s wrote:
| If the id appears in a url, you may not want people to guess
| ids. The information leak exists even if you do
| authentication: maybe you don't want someone to be able to
| guess how many records there are, or how quickly records are
| being generated
| tylerscott wrote:
| If for no other reason than simplified debugging I find there
| is value. Maybe I'm old but if you have more than one UUID
| involved in the debugging I'm more likely to trip up than
| just integers.
| lkrubner wrote:
| There were two problems.
|
| One problem was the style that started around 2004, and was
| very popular with Ruby on Rails and WordPress, and then
| Syfmony and Django, where you expose the PK in the URL. If
| your integer starts with 1 and then increments, you may not
| get to a billion, and you'll never get to a trillion. So it
| became ridiculously easy to for outsiders to scan your site:
|
| http://www.example.com/1
|
| http://www.example.com/2
|
| http://www.example.com/3
|
| ...
|
| http://www.example.com/10000000000
|
| That was one problem. Using UUIDs for PKs means outsiders
| can't simply scan your site.
|
| The other problem was that over the years, everyone ran into
| the problem of moving a database, or needing to combine
| multiple databases, in which case having PKs the start with 1
| and then increment, a collision of the PKs, from different
| databases, is 100% guaranteed. This often happens when
| combining WordPress sites, for instance. If you use UUIDs as
| your PK, then such collisions become unlikely.
| madsbuch wrote:
| Some applications have the need to create IDs in a
| distributed manner, eg. Og clients, and use that identity
| _before_ the database returns it. These systems benefit from
| randomly generated IDs.
|
| You could potentially hist use a random number between 0 and
| 2^128-1 and still use ints, though I haven't seen that in
| action, usually pk with ints are centrally generated and
| consequitive.
| chrismorgan wrote:
| I've been working on a robust scheme for encrypted sequential
| IDs, which is done, including library implementations in Rust,
| JavaScript and Python, pending just a smidgeon more writing
| about it and reviewing a decision on naming. You store an
| integer in the database, then encrypt it with a real block
| cipher, and stringify with Base58. I have three modes: one for
| 32-bit IDs, using Speck32/64 and producing 4-6 character IDs;
| one for 64-bit IDs, using Speck64/128 and producing 8-11
| character IDs; and one hybrid, using the 32-bit mode for IDs
| below 232 and the 64-bit mode above that, providing both a
| forwards-compatibility measure and a way of producing short IDs
| as long as possible. Contact me (see my profile) if you're
| interested, or I'll probably publish it in another day or two.
| Trouble is that I've been getting distracted with other related
| concepts, like optimally-short encoding by using encryption
| domains [0, 581), [581, 582), ..., [5810, 264) (this is format-
| preserving encryption; the main reputable and practical choices
| I've found are Hasty Pudding, which I've just about finished
| implementing but would like test vectors for but they're on a
| dead FTP site, and NIST's FF1 and FF3, which are patent-
| encumbered), and ways of avoiding undesirable patterns (curse
| words and such) by skipping integers from the database's ID
| sequence if they encode to what you don't want, and check
| characters with the Damm algorithm. If I didn't keep getting
| distracted with these things, I'd have published a couple of
| weeks ago.
|
| (I am not aware of any open-source library embodying a scheme
| like what I propose--all that I've found have either reduced
| scope or badly broken encryption; https://github.com/yi-
| jiayu/presents encrypts soundly, but doesn't stringify; Hashids
| is broken almost beyond belief and should not be considered
| encryption; Optimus uses an extremely weak encryption.)
|
| UUIDs are crazy overkill in any situation where you can have
| centralised ID allocation. Fully decentralised? Sure, 128 bits
| of randomness or mixed clock and randomness or similar, knock
| yourself out. But got a master database? Nah, you're just
| generating unreasonably long values that take up unnecessary
| space and make for messy URLs and such.
| mappu wrote:
| I've done something similar to obfuscate private DB IDs in a
| large existing application - just ensure they're all
| Skip32-encoded in all query parameters with an app-wide
| secret.
|
| It works well but you have to be very disciplined to catch
| every case individually. Using GUID PKs from the start just
| removes this entire category of problem.
| cameronh90 wrote:
| UUIDs have a few distinct advantages: you'll never run out, you
| don't need a roundtrip to find out what they are after saving
| them, they often make a good partitioning key and it makes
| things easier if you ever need to combine multiple data sources
| together in migration and recovery type scenarios. I also quite
| like how they're unique across all data sources and tables, so
| if you just encounter a random contextless UUID in the wild,
| for example in a support ticket, you can probably still find
| what it refers to.
|
| They are quite unwieldy though. There are a few compact
| representations you can use in URLs which make it a bit less
| ugly, but they can make your database and logs quite bloated,
| in particular if you've got a large number of small records.
| CharlesW wrote:
| > _There are a few compact representations you can use in
| URLs which make it a bit less ugly..._
|
| Any thoughts on where to find best-practices guidance? I need
| to create an external ID scheme for several million items.
| hashids (hashids.org) seems interesting, but I have anxiety
| about choosing a solution with weaknesses that I can't
| identify given my current level of experience in regards to
| this.
| munawwar wrote:
| I took a shot at the math behind this at
| https://www.codepasta.com/databases/2020/09/10/shorter-
| uniqu...
|
| Using the equation listed in the article I couldn't
| generate a collision so far. Yet, I still check (in code)
| for id collision, and pick new id, just to be 100% sure.
| deepsun wrote:
| The best strategy is to have integer primary keys for internal
| purposes, and some form of uuid for external (think
| permalinks).
| zbuf wrote:
| "Unique IDs" _can_ be super really easy to work with if they're
| not so baffling complicated.
|
| A random string generated using quality randomness can be
| adjusted to length to suit the quantity of data (negligible
| probability of a collision) which in most cases is very short.
|
| It's easy to increase the length as you get more data.
|
| They are visually very different for each item of data.
|
| They're evenly spread which means they hash/index well.
|
| You can tune a subset of characters if you want to decrease
| ambiguity eg. when exchanged by voice (no zero vs. letter O,
| upper/lower case etc.)
|
| And a final bonus, when working with user input only a a short
| prefix is needed to uniquely identify an item (in contrast, it
| seems like UUIDs deliberately share a common prefix)
|
| I'm very happy to concede I must be missing something here, and
| would be interested to know. But the above approach has served
| me well in a range of uses.
|
| I can see how UUIDs work, and perhaps "looks like a UUID" is a
| useful feature. But reading the URL above and a bit of
| Wikipedia doesn't give me much to go on as to _why_ any of this
| is happening, and why the hyphens aim to retain meaning to what
| is ostensibly a 'unique' number.
| jxcole wrote:
| We used almost this exact scheme for app id indices and the
| curious problem we had to design against was inadvertent
| profanity. At some point we decided to just never use vowels
| to avoid ever having a complaint about 12f*ck if in the URL
| jasonwatkinspdx wrote:
| Another approach is to use something like EFF's dice words
| lists. One of the smaller lists in particular is
| interesting as it's 6^4 words, filtered for profanity, and
| where all words have both a unique 3 letter prefix and an
| edit distance of 3. That makes them robust for the use case
| of someone reading out the phrase to someone typing or
| such.
|
| Never using vowells is a smart idea I wish I'd used in the
| past. Previously when I've needed something like this I've
| used other dictionary lists vs EFF's, and those were not
| curated sufficiently to avoid some really unfortunate
| combinations.
| manigandham wrote:
| Use integer IDs and a library like Hashids for friendly
| alphanumeric representations: https://hashids.org/
|
| This particular implementation is available in dozens of
| languages.
| jasonwatkinspdx wrote:
| Using purely random ids in your database destroys locality.
| They mention this in the introduction:
|
| > Non-time-ordered UUID versions such as UUIDv4 have poor
| database index locality. Meaning new values created in
| succession are not close to each other in the index and thus
| require inserts to be performed at random locations. The
| negative performance effects of which on common structures
| used for this (B-tree and its variants) can be dramatic.
|
| The V7 ids work similarly to what you like, as they're just a
| unix timestamp and 74 bits of pseudorandom data (they present
| several different schemes you could use to generate this
| randomness, but the basic birthday bound says we'd need to be
| above 100 billion id's generated in a single millisecond to
| worry about collisons. Obviously most systems are nowhere
| near that territory.
|
| So using these id's gives you the practical advantages of
| random uinique ids, but with the performance of autoincrement
| ids.
| zbuf wrote:
| Thanks for drawing my attention to that, it's the useful
| answer I was looking for; my use cases haven't been bound
| by write performance in this manner. However, I'd still be
| considering carefully before making use of these UUID
| schemes.
| jasonwatkinspdx wrote:
| If you want something like this, and need a "I just want
| it to work, require no central coordination, and to have
| vanishingly small probability of collision" then just use
| the same concepts but wider than the 128 bit footprint
| limit of this scheme. This limit makes sense for the RFC
| in the post, as there backwards compatibility is an
| explicit goal. But if you used say a 64 bit nanosecond
| counter (to preserve best case precision on a single
| machine) along with 128+ bits of random data and you'll
| need to worry more about gamma ray bursts than
| collisions.
| jandrewrogers wrote:
| 100 billion UUIDs per millisecond is the 50% collision
| probability threshold. Achieving an acceptable collision
| probability for most applications would limit the UUID
| generation rate to more like thousands of UUIDs per
| millisecond.
|
| Even if one was not generating millions of UUIDs per second
| on average, the risk of spiky temporal distributions when
| generating UUIDs would still need to be considered.
| jasonwatkinspdx wrote:
| I'm aware of what the bound I quoted is, and am just too
| lazy to type the refinement into Wolfram for a discussion
| like this. But just knowing the value off the top of my
| head for 64 bits is 200k for 1e-9 probability of
| collision, I'm pretty happy with 72 bits.
|
| And although this standard obviously wants to stick
| within the existing UUID footprint, if you were say doing
| some IoT software that would run on billions of nodes
| simultaneously, just add another 32/64/whatever bits of
| random data and deal with the minor annoyance of longer
| ids and lack of UUID RFC compatibility. But even then,
| you can truncate and reformat these these to v4 UUIDs
| trivially without meaningfully impacting the collision
| resistance, for unsorted external ids in systems that
| need the compatability.
|
| The actual big risk with this sort of scheme is vm
| initialization. You need to be sure the CSPRNG you're
| using is initialized, which can be slightly tricky in
| cloud environments with configuration/control layer stuff
| that's racy. Mess this up and two nodes hydrated from the
| same snapshot may overlap in sequence as they start
| generating ids, and of course the time component cannot
| be trusted to save you in this instance.
| manigandham wrote:
| > _" They're evenly spread which means they hash/index
| well."_
|
| What do you mean by this? Why would you hash it further? Hash
| distribution is primarily down to the hashing algorithm, not
| the input data.
|
| Also indexes are better with somewhat ordered and smaller
| data. A 64-bit int sequential counter is much faster and half
| the size, and compatible everywhere without the annoyances of
| a UUID.
| jwilk wrote:
| Old-school formatting of the same document:
|
| https://datatracker.ietf.org/doc/html/draft-peabody-dispatch...
| [deleted]
| Zamicol wrote:
| Why isn't there an option with a strong cryptographic hash like
| SHA-256?
| staticassertion wrote:
| It would be slow and doesn't really serve a purpose since you
| either have uuids that are totally random or uuids that need to
| preserve their structure.
| jwilk wrote:
| Or UUIDs derived from other identifiers:
|
| https://datatracker.ietf.org/doc/html/rfc4122#section-4.3
| kstenerud wrote:
| It's a shame they didn't finally drop big endian. Little endian
| won out and our newer protocols should reflect that for
| efficiency's sake.
| wyager wrote:
| I guarantee you that byte order swapping is not the limiting
| factor on the efficiency of any system involving UUIDs on the
| wire.
| amluto wrote:
| No, they should have tried to find a way to drop _little_
| endian. A cycle or two when generating UUIDs is irrelevant, and
| almost all UUID implementations, and most of the RFC 4122 text,
| are big-endian.
| zuzun wrote:
| It's necessary to keep them sortable at byte level.
| finnh wrote:
| But network byte order is big endian, so I can see arguments
| both ways (as it were)
| chrismorgan wrote:
| Big-endianness is mandatory for opaque bytewise sorting (or
| lexicographic ordering in string form), which is a very
| desirable property of UUIDv6 and UUIDv7.
| ripe wrote:
| Side note: I love the HTML format of these IETF RFCs, as in TFA.
| Over the decades, I was used to seeing the old text format (which
| I like), but this one is particularly easy on the eye, especially
| on my Android phone.
| [deleted]
| hoten wrote:
| Only mobile issue I see is there is a really long link that
| overflows the intended document width, which introduces an
| horizontal scrollbar. That makes scrolling a bit finicky on
| mobile.
| sureglymop wrote:
| I love the design and would like to copy it for my blog!
| Especially the code blocks with the nice 'figure' description
| below them.
| bearjaws wrote:
| UUIDv6 coming in just about 10 years late. Guess its something.
| [deleted]
___________________________________________________________________
(page generated 2022-06-12 23:00 UTC)