[HN Gopher] New UUID Formats - IETF Draft
___________________________________________________________________
New UUID Formats - IETF Draft
Author : anuragsoni
Score : 363 points
Date : 2021-08-06 15:09 UTC (7 hours ago)
(HTM) web link (datatracker.ietf.org)
(TXT) w3m dump (datatracker.ietf.org)
| atonse wrote:
| TLDR they present 3 new versions that each have their own trade-
| offs, but are also taking into account being able to use these as
| DB primary keys, which is really great.
|
| Another thing I was quite curious about (looks like you can use
| them in existing DB columns etc):
|
| > The UUID length of 16 octets (128 bits) remains unchanged. The
| textual representation of a UUID consisting of 36 hexadecimal and
| dash characters in the format 8-4-4-4-12 remains unchanged for
| human readability. In addition the position of both the Version
| and Variant bits remain unchanged in the layout.
| lytedev wrote:
| Doesn't the textual representation not matter for using them in
| DB columns? The DB just uses the actual bytes, no?
| dnautics wrote:
| depends on the DB. And it depends on your framework. Last I
| checked Django from about 5 years ago? still shipped uuids as
| text fields to the database.
| Macha wrote:
| DB supports from "None" (SQLite) to "We'll give you some
| functions to work on text/ints that are supposed to be
| UUIDs" (MySQL) to "It's an actual native data type"
| (Postgres) among the 3 most common databases used by Django
| users, so it's hard to blame Django for going with a lowest
| common denominator approach.
| dnautics wrote:
| I have since switched to Elixir/Ecto, and it just "does
| the right thing" in all of the above databases. Also, I
| think SQLite is capabale of supporting uuid with one of
| the plugins that you can add into the amalgamation with a
| compile switch, in either case ideally the framework
| would support it transparently.
| Macha wrote:
| SQL is a textual format. Your DB driver may expect a UUID to
| be provided them in the standard string format. If your
| language lacks a standard UUID type, the drivers may expect
| you to provide UUIDs as strings even when using parameter
| binding.
|
| e.g. MySQL accepts only "no dashes" and the standard textual
| format:
| https://dev.mysql.com/doc/refman/8.0/en/miscellaneous-
| functi...
|
| Postgres is a bit more flexible, but still expects hyphens to
| be after multiples of four characters if present:
| https://www.postgresql.org/docs/9.1/datatype-uuid.html
| lytedev wrote:
| This is a good point. Thank you!
| gregoriol wrote:
| The important part is the number of bits used for storage;
| the textual version is only for humans
| weird-eye-issue wrote:
| Application level code validates UUIDs based on a specific
| format not to mention what the specific database might expect
| depending on the data type
| lytedev wrote:
| This sounds like a mistake? When/why should the application
| validate the format of a UUID?
| cratermoon wrote:
| Why shouldn't they?
| athenot wrote:
| For human readability, I've always prefered encoding UUIDs in
| base 64 and swapping a couple characters. If I'm going to have
| some ugly identifier in a URL, I might as well make it more
| concise.
|
| I think youtube does something similar with their video IDs.
| jjice wrote:
| I'm a bit confused as to what the use case for UUIDs are compare
| to incrementing integer IDs. I assume there are some big upsides
| (aside from the primary key issue which seems to have quite a few
| solutions at this point), but I'm just not aware of them.
| tomschlick wrote:
| Because they are unique on generation, you dont need to talk to
| the database before generating one. This is huge for
| distributed systems, or systems that have offline components
| (think mobile apps that sync). The ids can be generated client
| side and passed up to be stored in the database without worry.
| tschellenbach wrote:
| incrementing ids requires a centralized server of some sort to
| hand out IDs when you get to a certain scale. you don't have
| that issue with UUIDs. for small apps just incrementing ids
| works fine though
|
| this approach also allows you to generate ids client side.
| which if you're working with a local db on ios/android etc this
| is very convenient
| dangerlibrary wrote:
| From the "Introduction" section of the linked document:
|
| "The motivation for using UUIDs as database keys stems
| primarily from the fact that applications are increasingly
| distributed in nature. Simplistic "auto increment" schemes with
| integers in sequence do not work well in a distributed system
| since the effort required to synchronize such numbers across a
| network can easily become a burden. The fact that UUIDs can be
| used to create unique and reasonably short values in
| distributed systems without requiring synchronization makes
| them a good candidate for use as a database key in such
| environments. "
| lolinder wrote:
| One big one is distributed computing: for example, an app that
| can create To Do items offline and periodically syncs with a
| server.
|
| If you have your To Do items identified by an incrementing
| integer, you have to come up with some _other_ unique
| identifier to use for To Do items that have yet to be synced
| with the server 's database (since you can't know which ID to
| use next).
|
| On the other hand, if you use UUIDs, your app can just create a
| new UUID for the To Do item, which will continue to be its ID
| for subsequent updates.
| marcus_holmes wrote:
| The huge one for me is that if I make a mistake in code or
| admin and accidentally run "update users set permission =
| 'root' where id=$1" using an account id instead of a user id,
| it does nothing. If I'd used integers I'd have given root
| permission to a random user.
| medstrom wrote:
| So, it's like creating a special data type for each "counter-
| style" identifier. For example suppose a StudentsId which is
| not an integer, it's a StudentsInteger, so it cannot be mixed
| up with UsersId. Admittedly that seems more clunky than just
| using uuids.
| marcus_holmes wrote:
| Yeah you could do that if you _really_ want to use
| integers. But I 'm not sure I'd trust the database layer to
| not just silently convert the type since they're both based
| on integers.
| spyspy wrote:
| One non-performance reason is that incrementing IDs inherently
| leak information about your application and/or your company. If
| you create an account somewhere and you see that your user ID
| is 452 you now know how many users the company has. In the
| company's eyes they'd probably prefer to keep that obscured.
| rjsw wrote:
| Also known as the German tank problem [1].
|
| [1] https://en.wikipedia.org/wiki/German_tank_problem
| stavros wrote:
| Not exactly, as seeing a fresh ID lets you pretty
| accurately tell how many items there are.
| thamer wrote:
| One of the most important is that you can't iterate over all
| the UUIDs. How many data leaks have come from people
| incrementing an ID in a URL?
|
| This is so common, I've experienced this myself as a student
| with a summer job: the company I was working for used an online
| service for their payroll, I accessed my pay slip one day and
| changed ?employee=123 to 124 in the URL out of curiosity and it
| loaded the data for a different person in a different company.
| They fixed it incredibly quickly once I told them about it :)
| alex_duf wrote:
| Incrementing an ID requires complete knowledge in a single
| place, which at scale becomes a challenge. This would typically
| be done in the database, but doesn't bode well with distributed
| systems as multiple servers might thing "5" is following "4"
| leading to multiple rows identified by the same ID => this is a
| collision.
|
| So UUIDs are great because multiple servers can generate
| plethora of them without having to worry about collisions.
|
| This is all explained in the first few pages of the document,
| although not in simple terms.
|
| Also I haven't spotted anything about UUIDv6 trying to avoid
| collisions, but I haven't read very far.
| bjt wrote:
| The last 48 bits should do it.
|
| " node: 48-bit pseudo-random number used as a spatially
| unique identifier Occupies bits 80 through 127 (octets
| 10-15)"
| remus wrote:
| As well as the other use cases mentioned, they're harder to
| guess than incrementing integer IDs so are a useful defense in
| depth to help stop people guessing URLs for resources.
| bob1029 wrote:
| v7 sounds compelling for keying entities in our product. We use
| v4 right now.
|
| What I am trying to determine is if 62 bits of entropy combined
| with a timestamp gives me better or worse collision resistance as
| with 122 bits in a purely-random format. Most of our keys only
| live within the scope of a single computer, but ~5% of them might
| be generated/shared between other systems.
|
| Being able to order our keys by time would be really nice, and I
| like that they would compress/index better.
|
| Maybe I could do a hybrid between V4 and V7 keys depending on if
| the type would be shared with external parties. There are many
| types that only ever scope to a single box.
|
| I probably couldn't play with the idea of adding a "machine id"
| because the space of all possible machines is difficult to
| anticipate/control right now.
|
| I think I just talked myself into sticking with v4.
| NyxWulf wrote:
| v7 allows to you trade time precision for randomness. Adding
| time precision allows the surface area for conflicts to
| decrease, so the loss of random bits to higher time precision
| is worth it if you are generating a lot of ids. Unless it is a
| security application, I would use v7 for most things.
| chociej wrote:
| Trying to wrap my head around how this works when
| distributed. If v7 is used with a high time precision, it's
| important that each node in the distributed system have a
| very tightly synchronized time, right?
|
| If so, with v7, I'd feel the need to find the right balance
| of time precision, time synchronization, node identification,
| and randomness. Just some thoughts out loud.
| jffry wrote:
| Probability wise, nearly doubling your entropy will
| significantly reduce the risk of a collision in your generated
| IDs, but there's lots of other considerations for you to make.
| But for the raw odds perspective, from
| https://en.wikipedia.org/wiki/Birthday_attack we can get a
| rough idea of collision chance. You can plug this into Wolfram
| Alpha [1][2] to play around with it.
|
| For a one-in-a-million chance of having at least one collision,
| you'd need to generate 3.04e6 random 62-bit values or 3.26e15
| random 122-bit values. For a one-in-a-thousand chance, those
| numbers would be 9.6e7 or 1.03e17 respectively.
|
| In other words: If you're sustained allocating v7 IDs (with no
| sub-second portion of the timestamp) at a rate of 3.04 million
| per second, there's a 1-in-a-million chance each second of
| having at least one collision. Taking 1 - 0.999999^86400, that
| gives an 8.3% chance of at least one collision after a day,
| 45.4% after a week, and 92.5% after a month.
|
| If you instead were allocating v4 IDs at the same rate for a
| month, you'd have allocated (3.04e6)x86400x30 IDs, and you'd
| have a probability of 5.84e-12 of at least one collision among
| all of those IDs.
|
| [1] Collision probability given number of draws:
| https://www.wolframalpha.com/input/?i=n%3D3.04e6%2C+H%3D2%5E...
|
| [2] Min number of draws to get a given collision probability:
| https://www.wolframalpha.com/input/?i=p%3D0.000001%2C+H%3D2%...
|
| edit: If you are storing milliseconds in subsec_a and
| generating IDs at the same rate, you'd have 3040 ids per
| millisecond, which have a probability 1.00164e-12 of a
| collision in any given millisecond. Taking
| 1-(1-1.00164e-12)^(86400*1000) gives you a probability of
| 0.0087% of a collision in a day, 0.061% in a week, or 0.26% in
| a month. So as long as your event generation is relatively
| constant across that second, you will indeed have a much lower
| collision chance if you are using millisecond subdivisions. If
| you have even higher precision time or are using a sequence
| counter, that can be pushed down even further.
| [deleted]
| LukeShu wrote:
| As a quick reference for readers:
|
| The UUID specs use the terms "variant" and "version" a little
| funny; "variant" is essentially the revision (so all modern UUIDs
| have variant=0b10 to specify RFC 4122 UUIDs), and "version" is a
| 4-bit number identifying the sub-type within that variant:
| UUIDv1 time-based UUIDv2 legacy DCE security thing, not
| wideley used UUIDv3 name-based with md5 hashing
| UUIDv4 randomness-based UUIDv5 name-based with sha1
| hashing
|
| This draft registers a few new sub-types:
| UUIDv6 sortable time-based, Gregorian calendar UUIDv7
| sortable time-based, Unix time UUIDv8 sortable time-
| based, custom time
| [deleted]
| paulddraper wrote:
| tl;dr
|
| Use UUIDv4 and get on with your day.
| ygra wrote:
| Well, not if you need database primary keys with sensible
| index locality. Which is what this draft is about.
| phkahler wrote:
| Aren't indexes sorted? I suppose using time as the MSBs
| means new batches are always appended (after sorting just
| the new ones).
| mianos wrote:
| Yes they are. But, consider the common query, for records
| based on an time index on the last day. This index will
| point to a bunch of primary keys and those, being random,
| will be spread across a much wider range of blocks. If
| the primary key was just the time, those items would be
| on the same or close blocks.
| OliverJones wrote:
| If the epoch in this proposal were changed to something closer to
| the present day (2000-01-01 or the UNIX epoch) the format could
| easily recover some bits from the time fields to put in the PRNG-
| generated "node" fields. I wonder why they chose the Gregorian
| epoch?
| pjscott wrote:
| The Gregorian epoch was chosen for UUID v6 in order to make it
| as similar to v1 as possible, for ease of updating existing
| software. There are two more types of UUID proposed in this
| draft RFC: v7 which uses the Unix epoch and allows more time
| precision, and v8 which can use any monotonic timestamp source.
| gopalv wrote:
| > The machineID MUST NOT be an IEEE 802 MAC address.
|
| > MAC addresses pose inherent security risks and MUST not be used
| for node generation.
|
| Interesting concern in the distributed generation pathway.
|
| I've used MAC addresses in the past for absolutely unique
| identifiers, but this is calling out that as a security risk,
| because the time + arp data might be known to predict a future
| UUID from a machine?
| WorldMaker wrote:
| UUIDv1 IDs with MAC addresses have been used for tracking
| people down. There are especially a lot of classic examples
| from older Word documents where v1 UUIDs were often embedded in
| many places inside the documents. Groups used (and sometimes
| still use) the MAC addresses in such Word documents to track
| down specific authors (for FBI investigations on the more
| [extra-]legal side and for "doxxing" and such on the far less
| legal side).
|
| Whether you consider this specific example threat a privacy
| risk versus a security risk here depends on your personal
| threat model, of course. But general consensus at this point is
| to label it a security risk.
| jackpirate wrote:
| That makes sense, but that seems like it's a problem with any
| machine-unique id, not just the MAC address. Is there some
| specific reason MAC addresses are a worse choice than (e.g.)
| CPU serial number?
| WorldMaker wrote:
| This RFC has reason to call out MAC addresses directly as a
| security risk because they were standardized way back in
| UUIDv1. (To generate a standards compliant v1 UUID you are
| still required to use a MAC address, which is why everyone
| suggests you use UUIDv4 today, and exactly why UUIDv6
| exists in the RFC, it's basically exactly UUIDv1 without
| the MAC address.)
|
| The RFC leaves finding a machine-unique ID that isn't a
| security risk as an exercise to the user. (Mentioned
| directly in the v7 description where some bytes are
| optionally allocated for "flake"-like machine IDs.)
| ComputerGuru wrote:
| See ulid for similar prior art. We generate Ulids in code and
| then store them in uuid columns in Postgres to realize the
| compact (binary) size benefits.
|
| https://github.com/ulid/spec
| WorldMaker wrote:
| I put in the work to store ULIDs in SQL Server's interesting
| UUIDv1-based sort order for uniqueidentifier columns. It seems
| to be providing the clustered index benefits I was hoping for
| so far (far less clustered index churn behavior than v4 Random
| UUIDs).
|
| An interesting digression to bring up with respect to this RFC
| because it is interesting to note that none of the proposed v6,
| v7, v8 UUIDs match the SQL Server sort order behavior and would
| still have a lot of clustered index churn unless new sort
| behaviors were allowed to be opted in. That might be something
| that this RFC would help to address by also making sort orders
| a bit more standard, though it does not directly make
| recommendations on that front.
|
| For what it is worth, it is the SQL Server sort order that is
| weird: it sorts the last group of the UUID string form first,
| which in v1 UUIDs corresponded to the MAC Address. Clustering
| by machine probably made sense by whoever decided on that sort
| order, at the time, but even with v1 UUIDs they probably would
| have got better results sorting by timestamp than by machine.
|
| (The ULID timestamp fits exactly into that "MAC address" group
| and it's just a matter of swapping the front 6 bytes to the
| back of the UUID when storing to uniqueidentifier.)
|
| Fun reference on UUID sort orders:
| https://devblogs.microsoft.com/oldnewthing/20190426-00/?p=10...
| ComputerGuru wrote:
| It's not just a question of whether you want to cluster by
| MAC address, it's more of a fundamental disagreement as to
| the endianness of the UUID groups themselves, which is
| probably the biggest difference between a GUID and a UUID in
| the wild. Microsoft has used GUIDs to brand OS intervals for
| a long, long time - long enough that there's a chance they
| came before UUID was a thing, but I haven't dug into the
| history of it.
|
| Unfortunately converting between them without any db
| extensions is a PITA in the absence of native support. I've
| written about it at length here (SQLite in the example, but
| the fundamentals apply regardless):
| https://neosmart.net/blog/2018/converting-a-binary-blob-
| guid...
| WorldMaker wrote:
| That was the question specific to SQL Server storage sort
| order for clustered indexing and most relevant to ULID to
| uniqueidentifier "conversion"/storage, but yes there are
| endianness issues in the sort orders _as well_ , which is
| why I made sure to include the full Raymond Chen blog post
| on many of the most common sort orders for GUIDs because it
| is quite the rabbit hole. (The "update" footnote link at
| the bottom of the article I posted to the Java order is
| especially wild if you fall into the rabbit hole.)
|
| (ETA: Also to answer your indirect question, Microsoft was
| definitely an early adopter of GUID/UUID. Though the
| inventor was Apollo Computer as a part of their complex
| network design:
| https://en.wikipedia.org/wiki/Apollo_Computer)
| ComputerGuru wrote:
| Thanks for clarifying - perhaps another reader will find
| that link for in-db conversion useful. And thanks for the
| interesting information and link to the Apollo Computer;
| I have some fodder for bedtime reading now!
| WorldMaker wrote:
| No problem. Yeah, for what it is worth (getting further
| aside), I gave up on supporting the ULID conversion
| directly in-db for that same reason: Base-32 math (which
| ULID uses for canonical string format) is tough enough in
| SQL before you add in the endian issues. The endian
| issues aren't as big of a deal in my storage sort order
| situation because the MAC address section that I use for
| the timestamp data most critical to the sort order
| doesn't suffer from the endian issue and the remainder of
| a ULID is entirely random and I'm not too concerned if
| the random section doesn't sort identically to the ULID
| sort order. (It would be great if it did, but it's not as
| critical as the more dominant timestamp order.) That
| said, .NET's GUID ToByteArray() handles the endian order
| for me and I'm somewhat confident it's correct so long as
| it is my .NET code doing ULID-to-uniqueidentifier
| conversions and not the scripts I tried to use directly
| in-db.
| mgkimsal wrote:
| What's a bit interesting to me is having read all the
| negativity/criticism around ULID, and now seeing these
| proposals by IETF...
| ComputerGuru wrote:
| Most ulid criticism surrounds the ill-advised (and optional!)
| part of the spec that tries to achieve sub-ms ordering at the
| cost of requiring a singleton or cross-thread atomics to try
| and generate random values for the random portion of the blob
| from a range that would end up providing additional sort
| order bits. The reference implementation in JS doesn't even
| offer this (although that's likely because JS doesn't even
| support multiple execution threads within the same context).
|
| Apart from that little misadventure (last I checked, there
| was a request for comments on a proposal to drop that from
| the Ulid spec altogether) there's really nothing else that's
| not same and straightforward in the spec, and little else to
| complain about.
| roberto wrote:
| They list ULID and 15 other implementations in the document.
| ComputerGuru wrote:
| Ah, so they do. Thanks.
| Waterluvian wrote:
| I find it peculiar that the Introduction section on these drafts
| is always just boilerplate.
| cratermoon wrote:
| I prefer ksuid https://github.com/segmentio/ksuid
| jrochkind1 wrote:
| The IETF draft mentions ksuid in the "prior art consulted"
| section.
|
| Do you mean you prefer ksuid to this new draft? Your comment
| would be more helpful if you said what about it you prefer.
| cratermoon wrote:
| I prefer ksuids because
|
| they have 128 bits of pseudorandom data
|
| Both text and binary representations are lexicographically
| sortable
|
| They don't have the awkward dash ('-') to mess with url
| encoding
| leo_bloom wrote:
| > They don't have the awkward dash ('-') to mess with url
| encoding
|
| Can you explain what you mean by that? I have experienced
| issues with " " as "+" vs. "%20", but dashes... ?
| cratermoon wrote:
| Not in URLs per se, but in HTML URL Encoded Multipart
| Forms, i.e. enctype="multipart/form-data"
|
| ETA: also, when you put a UUID as text into a lot of
| search engines, it treats the dashes as token delimiters.
| jrochkind1 wrote:
| Which of those things are not true of these new UUID
| formats proposed?
|
| Let's see if I can figure it out (correct me if I'm wrong).
|
| I think maybe these new UUIDs proposed ARE lexigraphicaly
| sortable (in timestamp order, i think is the implication)
| in text and binary, these newly proposed UUIDs above? I got
| the impression that was one of the main things they
| provide, compared to older UUID formats, one of the main
| reasons for their proposal? Am I wrong?
|
| I know the new UUIDs look the same as the ones we're used
| to, where they often have dashes (although there's
| certainly no requirement you display them with dashes, you
| can strip the dashes out before putting them in a URL if
| you want, although yeah it's an extra step).
|
| [If we're talking about URL-encoding we're talking about
| putting them in URLs? ksuid's 160 bits insead of UUID's 128
| , making a longer URL, seems like a downside?]
|
| Not sure how many bits of psuedorandom/entropy they have...
| Looks like... UUIDv6 has 48 bits of psuedorandom in
| addition to timestamp, to reduce timestamp collisions.
| UUIDv7 is variable and can let the application decide (am I
| understanding right?), but supports up to 62 bits of
| pseudorandom data. [I guess you prefer a longer ID with
| more bytes of psuedorandom... just to further minimize
| possibility of timestamp collission, or other reasons? I am
| not sure how often existing UUID timestamp collisions
| happen in practice, anyone know? Anyone ever had one?]
| cratermoon wrote:
| > Which of those things are not true of these new UUID
| formats proposed?
|
| Well, the entire UUID is 128, so obviously not 128 bits
| of pseudorandom data.
|
| > What about 128 bits of pseudorandom data is preferable
| to you?
|
| Same as with any hash: collision resistance. That's why
| we don't use SHA-1 for anything secure any more. The dash
| problem arises not in the URL but the multipart/form-data
| encoding, as well as the following:
|
| "when UUIDs are indexed by a search engine, where the
| dashes will likely be interpreted as token delimiters.
| The base62 encoding avoids this pitfall and retains the
| lexicographic ordering properties of the binary
| encoding."
|
| https://segment.com/blog/a-brief-history-of-the-uuid/
| jrochkind1 wrote:
| OK, thanks for explaining!
|
| I have never had or thought of an application where a
| search engine interpreting the dashes as token delimiters
| was a problem. But if you have, and it was a problem, ok,
| that explains your preference I guess!
|
| Curious if anyone reading has run into collisions with
| existing v1 UUIDs, or knows anyone who has, how often.
|
| Pretty sure these new UUIDs are lexigraphically sortable
| (in timestamp order) in their hex representation (with or
| without dashes) as well as binary. Lots of people agree
| this is important, that in fact seems to be one of the
| main motivations of these new UUID formats in the draft
| we're discussing, I think?
| cratermoon wrote:
| Yeah, they've solved one issue with the UUID format by
| giving them k-sortability, but the other issues are still
| there. Ultimately, if you need some kind of UUID-
| compatible representation (eg for Vertica UUID columns),
| then these are good approaches. If you need to solve the
| same problems but don't have a reason to be locked into
| UUIDs, pick something else.
| tfehring wrote:
| Other than the additional pseudorandom bits (which also have
| drawbacks), what's the advantage of ksuid over UUIDv6 as
| defined in the draft?
| cratermoon wrote:
| Both text and binary representations are lexicographically
| sortable
|
| They don't have the awkward dash ('-') to mess with url
| encoding
| shortstuffsushi wrote:
| Could you explain your concern with dash in url? To my
| knowledge, that shouldn't be in conflict with anything else
| like a slash, space, or plus, what sort of issues do you
| see?
| cratermoon wrote:
| No in the URL specifically, but in forms, with the
| enctype="multipart/form-data"
| shortstuffsushi wrote:
| Oh, are you referring to some sort of conflicts with the
| "boundary" separators? I would be interested to hear more
| about this, I haven't personally had encoding issues with
| dashes, or thought to try to encode them (or w/e) to
| avoid any issues
| sudhirj wrote:
| For those interested in time based UUIDs, I've written libraries
| in Ruby and Go to move quickly between them:
|
| https://github.com/sudhirj/uulid.go
| https://github.com/sudhirj/shortuuid.rb
| https://github.com/sudhirj/shortuuid.go
| stavros wrote:
| Those look suspiciously similar to the Python one!
| sudhirj wrote:
| Probably is quite similar, the converting numbers between
| bases is a textbook algo, and the UUID format is a standard.
| Which python one?
| stavros wrote:
| This one: https://pypi.org/project/shortuuid/
| sudhirj wrote:
| Oh, nice. Won't do one with Python, then :D
|
| But yeah, I've found the ability to move IDs between
| alphabets quite useful.
| gfody wrote:
| if you were about to use a uuid as a primary key in a database,
| wouldn't it always be better to instead use a composite key with
| explicit columns for sequence, timestamp, node id, etc.? if you
| really need to accept client side generated values and there's no
| opportunity to issue them a unique node id before hand, then
| explicitly taking their mac address and region seems better than
| stealthily relying on those things being embedded in a uuid -
| also aren't mac addresses considered PII?
| nabla9 wrote:
| The background section gives reasons derived from looking at
| existing implementations.
|
| ---
|
| Due to the shortcomings of UUIDv1 and UUIDv4 details so far, many
| widely distributed database applications and large application
| vendors have sought to solve the problem of creating a better
| time- based, sortable unique identifier for use as a database
| key. This has lead to numerous implementations over the past 10+
| years solving the same problem in slightly different ways.
|
| - Timestamps MUST be k-sortable. That is, values within or close
| to the same timestamp are ordered properly by sorting algorithms.
|
| - Timestamps SHOULD be big-endian with the most-significant bits
| of the time embedded as-is without reordering.
|
| - Timestamps SHOULD utilize millisecond precision and Unix Epoch
| as timestamp source. Although, there is some variation to this
| among implementations depending on the application requirements.
|
| - The ID format SHOULD be Lexicographically sortable while in the
| textual representation.
|
| - IDs MUST ensure proper embedded sequencing to facilitate
| sorting when multiple UUIDs are created during a given timestamp.
|
| - IDs MUST NOT require unique network identifiers as part of
| achieving uniqueness.
|
| - Distributed nodes MUST be able to create collision resistant
| Unique IDs without a consulting a centralized resource.
| RcouF1uZ4gsC wrote:
| > 48-bit pseudo-random number used as a spatially unique
| identifier Occupies bits 80 through 127 (octets 10-15)
|
| Is 48 bits really enough to be a spatially unique identifier.
| Roughly 16 million entities would have a 50% chance of collision.
|
| If you have a spatially unique identifier collision, it seems it
| might be possible for two independent entities to generate the
| same time stamp and counter codes resulting in an overall UUID
| collision.
| ltbarcly3 wrote:
| While possible, with a high precision clock it should be quite
| rare. In any case, UUID's aren't meant to be universally unique
| to the point you will never see the same one for any reason
| across every use case in the galaxy, although many versions do
| allow you to neglect to code to handle collisions for almost
| every practical use case.
|
| If you need 16 million entries to get a 50% chance of
| collision, 100ns time resolution means you have 10^7 timestamps
| per second, so to actually experience a 50% chance of collision
| requires an instantaneous hash rate of over 10^14 per second,
| which I don't think you are likely to ever see in practice
| before these UUID versions are long obsolete.
| riffic wrote:
| Also see RFC 4122:
|
| https://datatracker.ietf.org/doc/html/rfc4122
| infinityplus1 wrote:
| Does anyone have any opinion on how Firebase push keys compare to
| UUIDs? Firebase push keys are said to be unique and are sortable.
| Here's a link to the push key generator:
| https://gist.github.com/mikelehen/3596a30bd69384624c11
| rootusrootus wrote:
| "UUIDv8 SHOULD only be utilized if an implementation cannot
| utilize UUIDv1, UUIDv6, or UUIDv8." I assume that is a typo.
| LukeShu wrote:
| Indeed, I believe that it should read "... or UUIDv7."
| UUIDv1 non-sortable time-based UUIDv6 sortable time-
| based, Gregorian calendar UUIDv7 sortable time-based,
| Unix time UUIDv8 sortable time-based, custom time
|
| That is: The custom-time version should only be used if you
| cannot use one of the standard time versions.
| Wevah wrote:
| There's another typo in the v7 section where it's erroneously
| referred to as v8.
| [deleted]
| submeta wrote:
| OT: Nicely written and formated ascii document. I wonder what
| tools they used to create the headars / page numbers / refs.
| krinchan wrote:
| There are a lot of tools that let you write a source document
| and compile it into something that fits the RFC style guide.
|
| XML and markdown are the most common source document formats.
|
| https://www.rfc-editor.org/pubprocess/tools/
| pawal wrote:
| There are a number of different tools for writing internet-
| drafts. See here: https://tools.ietf.org/tools/
| sswaner wrote:
| I had the same thought: Interesting topic, but I am very
| distracted by how much I love the format of the document.
| Croftengea wrote:
| To put it simply, the new standard aims to address UUID usage as
| primary keys in distributed systems.
|
| But UUIDv7 is described as: Unix timestamp + fractions of second
| + increasing sequence + random number. Now imagine two processes
| start simultaneously and start generating UUIDs right away. IMHO
| chances of the two processes generating exactly the same UUID
| sequence are pretty high unless the implementation is smart
| enough to feed something like process id as a seed to random
| function.
| renonce wrote:
| Well implemented processes usually seed their random generators
| from a random device, such as /dev/urandom in Linux. Such
| devices hand out different values for each request.
| jozvolskyef wrote:
| What are the advantages of sortable UUIDs with embedded
| timestamps over random 128 bits and a created_at column?
| junon wrote:
| FWIW the linked RFC's "Background" section is really easy to
| understand even for people who aren't well versed in dist-sys
| concepts. I suggest giving it a read - it's quite well written.
| bilinguliar wrote:
| This allows you to keep recent records in DB cache, for
| example. Contrary, random keys would constantly invalidate the
| cache unless all your data fits in memory.
| ape4 wrote:
| Isn't that a possible security issue. MACs used to be
| predictable but then became random.
| lazide wrote:
| I'm pretty sure you mean 'stable' not predictable here
| re:MACs?
|
| The issue with using Mac addresses is it essentially leaked
| tracking information on the machine being used to generate
| the uuid - you could associate every ID it ever generated
| back to that same machine, correlate it to things like open
| WiFi AP logs, whatever. Which is pretty scary sometimes.
|
| Many OS's now randomly generate a new MAC on every
| disconnect/connect for this reason, to at least give some
| privacy.
| jozvolskyef wrote:
| I presume you are referring to UUIDv4/1, which has 122 random
| bits. It is a common misconception that this is insecure. In
| reality, the UUIDv4/1 is just a way to encode 122 random
| bits, which is more than enough for most needs. The property
| of being practically unguessable comes from the random
| generator, not from the encoding.
|
| Secondly, whether or not being predictable is a problem
| depends on the use case. The keys used on this forum are
| predictable and it is not an issue.
| BenoitEssiambre wrote:
| On top of reasons given by others, if the data likely to be
| queried together tends to get stored together because of this,
| things will be much more cache efficient and queries
| potentially much faster since for db loads, cache efficiency
| (not just ram but CPU cache), alleviates some of the most
| important performance bottlenecks.
| masklinn wrote:
| Random primary keys play hell on database indexes.
| cratermoon wrote:
| https://en.wikipedia.org/wiki/Partial_sorting
| mrighele wrote:
| The linked RFC talks about this (it refers to UUIDv4, which is
| not much different from 128 random bits):
|
| " First, most of the existing UUID versions such as UUIDv4 have
| poor database index locality. Meaning new values created in
| succession are not close to each other in the index and thus
| require inserts to be performed at random locations. The
| negative performance effects of which on common structures used
| for this (B-tree and its variants) can be dramatic. As such
| newly inserted values SHOULD be time-ordered to address this."
| bob1029 wrote:
| Note that this totally-random key behavior is actually highly
| desirable when using dynamic programming techniques such as
| with the Splay Tree. It makes it statistically impossible for
| the tree to fall into a degenerate state.
| pjscott wrote:
| Splay trees aren't a dynamic programming thing (and aren't
| used in any database storage backends I know of). They are,
| however, conjectured to be no more than a constant factor
| slower than a dynamically optimal binary search tree.
|
| https://en.wikipedia.org/wiki/Optimal_binary_search_tree
| bob1029 wrote:
| > Splay trees aren't a dynamic programming thing
|
| My mistake - I meant to write "dynamic optimality".
|
| As for the "not used in any database storage backends", I
| have personally started experimenting with them due to
| higher-order effects they can provide relative to their
| optimization technique.
|
| The fact that the most recently accessed node is always
| at the root also means that incremental tree updates are
| well-bounded in terms of # of nodes that would need to be
| rewritten to disk. This allows for much more efficient
| usage of storage-friendly things like append-only log
| structures.
| pjscott wrote:
| Today I learned! That sounds fascinating; is there
| anything written down about it somewhere?
| bob1029 wrote:
| I actually don't think there is any specific reference I
| have ever encountered. It was something that just popped
| into my head one day when I was trying to figure out a
| way to write small batches of key-value data into a log.
|
| Once you put 2+2 together on it, its like a lot of
| concepts seem to snap into place all at once. You also
| get incredible locality of access - each batch can
| sometimes fit within a single I/O block depending on
| system loading and data patterns, so traversing the tree
| to _any_ node modified in the previous flush to disk
| would involve a single I /O.
|
| Edit: If you search "splay tree append only log" on
| google, you will find one of my previous HN comments on
| the first page of results:
| https://news.ycombinator.com/item?id=28010167 This tells
| me there isnt much precedent for this idea.
| LukeShu wrote:
| _> UUIDv4, which is not much different from 128 random bits_
|
| To quantify how different: UUIDv4 has 6 fixed bits and 122
| random bits.
| parhamn wrote:
| it depends what "created_at" means, if it means "persisted_at"
| then they're also a bit different because these sorts of UUIDs
| are generated before being persisted (often long before for
| offline-and sync systems).
| dheera wrote:
| Searching records and returning the results time-sorted in O(n)
| time instead of O(n log n) time.
| gumby wrote:
| That is addressed in the introduction -- random access inserts
| (i.e. specularity) can be quite inefficient depending on your
| backing store.
| causasui wrote:
| I use KSUIDs a ton in DynamoDB; it saves me from having to
| create and fill another index to sort items by creation time.
|
| The only case where I don't use them is if I don't want to
| create time of the item to be known to the user.
| jandrewrogers wrote:
| There are two issues not mentioned in the other comments with a
| random 128-bit UUID in real systems, both related to data
| scale:
|
| - They defeat data compression, which is an integral part of
| many database engines these days for both performance and
| scalability reasons.
|
| - For some very large systems, there is a UUID collision tail
| risk that while small is large enough to be plausible.
| neur4lnet wrote:
| Its compression a legitimate concern for primary keys?
| bob1029 wrote:
| If you are getting good compression on your primary keys,
| this may be cause for alarm.
|
| These should be high-entropy items.
| jandrewrogers wrote:
| The sole constraint on a primary key is that it is
| _unique_. Even in distributed systems, deterministically
| unique keys are almost always much more compressible than
| probabilistically unique keys, which is a double win.
|
| Should you desire a high-entropy key, e.g. for hash
| tables or security, this can be trivially derived from
| the low-entropy key while still guaranteeing
| deterministic uniqueness.
|
| Compressible primary keys are superior in every way.
| jandrewrogers wrote:
| Definitely, I've run into this problem many times. UUIDs
| are pretty bulky data types in database terms, usually
| larger than most other types in a typical table on average
| and possibly larger than rest of the row after including
| the index overhead, which creates cache pressure. Logical
| columns in a table are commonly stored physically as a
| vector, of UUIDs in this case, expressly for the purpose of
| compressing the representation. This saves both disk cache
| and CPU cache. For queries, the benefit of scanning
| compressed vectors is that it reduces the average number of
| page faults per query, which is one of the major
| bottlenecks in databases.
|
| Also, some data models (think graphs) tend to be not much
| more than giant collections of primary keys. Using the
| smallest primary key data type that will satisfy the
| requirements of the data model is a standard performance
| optimization in databases. It is not uncommon for the UUID
| that the user sees to be derived from a stored primary key
| that is a 32-bit or 64-bit integer.
| NyxWulf wrote:
| Uuids are bulky if you store them in text form, but in
| binary form they are only 128 bits.
|
| The main feature of a uuid is it allows distributed
| generation. 32-bit or 64-bit integers are almost always
| sequential numbers. The sequential nature allows
| efficient page filling and index creation, but the
| contention involved in creating a sequence grows rapidly
| with scale.
|
| So while a 128-bit uuid is larger than a 64 bit integer,
| this version allows for the bulk of the benefits of
| sequential integers while reducing the biggest drawback
| of contention at the point of creation.
| jandrewrogers wrote:
| I was assuming binary format. 128-bits is a pretty heavy
| data type in many data models with measurable impact on
| performance versus using something smaller.
|
| You also do not need 128-bits to decentralize unique key
| generation, even though it is quite reasonable if you
| design your keys well. Many do it with 64-bit unique keys
| in massive scale systems.
|
| A subtle point that you may be overlooking is that while
| large-scale distributed databases, including all the ones
| I work on, export globally unique 128-bit keys, in most
| systems I've worked with they are internally represented
| and stored as 64-bits or less even if the key space is
| much larger than 64 bits. There are many different
| techniques for doing key space elision and inference that
| are used inside distributed databases to save space. The
| 128-bit value is only materialized when sending it over
| the wire to some external client system. But you don't
| need to store it.
|
| Literally storing a primary key in a distributed system
| as a 128-bit value is all downside with few benefits. For
| small systems the performance and scaling cost may not
| matter that much but in very large systems it matters a
| lot. It can -- literally! -- cost you millions of dollars
| per year.
| krinchan wrote:
| Twitter snowflakes are unsigned 64-bit ids designed to be
| created in a distributed fashion.
| cogman10 wrote:
| And, to be clear, those benefits are the placement of new
| records in a DB.
|
| If a UUID is completely random it means it can be
| inserted anywhere which can require the DB to reshuffle
| records and pages in order to make room for the new
| record.
|
| Having a sequential element to the UUID makes it a lot
| easier to have an index where each record is inserted at
| the end. Which, like you said, makes page usage more
| efficient and decreases the amount of work a DB has to do
| on insertion.
|
| All this is a compounded problem if you have a DB with
| frequent writes, a lot of indexes, or both.
| manigandham wrote:
| You can generate any size number in a distributed
| fashion. The only difference is that 128-bits gives you
| enough scale that it's practically impossible to have
| collisions when randomly generating.
|
| Unless you need to be completely disconnected, a little
| coordination can drastically improve things. In past
| companies, I used a simple counter with 64-bit integers
| and each distributed process would increment a billion-
| number range to use for IDs. Fast, efficient, compatible
| with everything, naturally ordered, and guaranteed to
| never have a collision.
| [deleted]
| munhitsu wrote:
| CRDT
| orf wrote:
| Imagine you're not using a storage platform that supports that
| access pattern. Think S3, DynamoDB or even a directory of
| files. Being able to encode your sort key in your primary key
| is a pretty nice property.
| datavirtue wrote:
| Hmmm...Now I'm wondering why I never bothered to prepend a
| timestamp to a GUID.
| orf wrote:
| Because then it wouldn't be a UUID in a format that other
| systems might expect or work more efficiently with.
| weird-eye-issue wrote:
| Not having a created_at column and the index bloat that goes
| with it?
| notamy wrote:
| It works better for databases that can only sort on the primary
| key (iirc Cassandra is one of these)
| scrollaway wrote:
| One data point instead of two. Which may or may not have
| advantages depending on your situation.
|
| At any rate it's definitely simpler, and you don't have to
| forego the created_at column either.
| lytedev wrote:
| My understanding is that if the UUID is not sortable, inserting
| it into an index can be a performance hit. So if you have a
| write-heavy table with non-sorted primary keys, you will have a
| lower ceiling (higher overhead).
| srcreigh wrote:
| Know about primary vs secondary indexes?
|
| Generally 1 index contains the data, all the other indexes
| have pointers to that index in their leaf nodes. Usually the
| PK index contains the data
|
| So loading 1000 consecutive rows in a created_at secondary
| index takes two steps: 1) created_at index B-Tree traversal
| (maybe 10 pages from disk, max?) 2) Potentially loading 1000
| randomly sorted pages from disk, dereferencing pointers
| (EXPENSIVE!)
|
| Whereas, if your primary key is sorted by time, loading 1000
| rows is 1) slightly larger primary B-Tree traversal 2) no 2,
| you're done :)
| NyxWulf wrote:
| The lack of sorting is what this rfc is fixing.
| ignoramous wrote:
| > _What are the advantages of sortable UUIDs_
|
| I guess for use-cases like Instagram's: "Generated IDs should
| be sortable by time (so a list of photo IDs, for example, could
| be sorted without fetching more information about the
| photos)... If you use a timestamp as the first component of the
| ID, the IDs remain time-sortable... [but require] more storage
| space (96 bits or higher) to make reasonable uniqueness
| guarantees" [0]
|
| [0] https://archive.is/xiMGZ
| zimpenfish wrote:
| cf also Twitter's Snowflake[1] which is [timestamp, worker
| number, sequence number] to give the same "roughly sortable
| by time" property.
|
| [1] https://blog.twitter.com/engineering/en_us/a/2010/announc
| ing...
| halfmatthalfcat wrote:
| Using the uuid as a PK do to cursor based pagination for one.
| [deleted]
| pjscott wrote:
| A somewhat oversimplified summary of the new UUID formats:
|
| UUID6: a timestamp with a weird epoch and 100 ns precision like
| in UUID1, but in a big-endian order that sorts naturally by time,
| plus some random bits instead of a predictable MAC address.
|
| UUID7: like UUID6, but uses normal Unix timestamps and allows
| more timestamp precision.
|
| UUID8: like UUID7, but relaxes requirements on where the
| timestamp is coming from. Want to use a custom epoch or NTP
| timestamps or something? UUID8 allows it for the sake of
| flexibility and future-proofing, but the downside is that there's
| no standard way to parse the time from one of these -- the time
| source could be anything monotonic.
| paulddraper wrote:
| 8 kinds of UUIDs, because we already had 4 too many.
| pjscott wrote:
| Which of the 5 existing kinds of UUID do you consider the One
| True UUID? No matter which you pick, I guarantee that there's
| _something_ badly wrong with it for at least one common use-
| case that can be improved by switching to either v4 (IMHO the
| only one of the existing types worth using) or to one of the
| new proposed UUID types.
| paulddraper wrote:
| UUID v4
|
| Are you concerned about the 0.0000001% chance of collision
| after generating a 100 trillion UUIDs?
|
| Or are you trying to include metadata in your identifier?
| (Not the worst thing, but it's also not super useful info.)
| pjscott wrote:
| I'm not worried about collisions and I agree that being
| able to put metadata in the UUID is a big meh of a
| feature. The problem with v4 is what happens when people
| try using them as database keys: the random sort order
| can really hurt performance. You might argue that the fix
| for this is to simply _not_ use UUIDs as database keys...
| but so many people are already doing this, and will
| continue to do it, that they should probably be given
| better standard options.
| mdtusz wrote:
| This is a fairly significant issue for mysql, but does
| the same issue exist when using postgres with a UUID pk
| type?
|
| At my last job, this was managed by using two ID's on
| each row - a serial one that was basically only used for
| database optimization purposes, and the "real" UUIDv4 ID.
| It felt gross then, and it feels gross now, but it seemed
| to do the trick for our needs.
| johncolanduoni wrote:
| It doesn't, since Postgres doesn't actively use the
| primary key for clustering.
| paulddraper wrote:
| > random sort order can really hurt performance
|
| I've long thought it was a problem that PostgreSQL
| doesn't have a real clustering key like MySQL or SQL
| Server.
|
| But PostgreSQL users have told me it doesn't really
| matter.
| dtech wrote:
| It's much less important in postgres as in some other
| databases, but it still hurts.
|
| Mainly because indices have to be updated and searched
| all over the place with UUIDs. The locality of the data
| on disk itself is fine because its inserted sequential.
| anarazel wrote:
| > But PostgreSQL users have told me it doesn't really
| matter.
|
| Those users were very wrong, unless they explicitly only
| talked about the case where the uuids are not indexed.
| phkahler wrote:
| >> but the downside is that there's no standard way to parse
| the time from one of these
|
| IMHO that should not be a concern. The goal of UUID is to
| create a _Unique_ Identifier, not to record the time something
| was created. Maybe we should all have quantum random number
| generators, but with that I don 't think 128 bits is enough for
| a UUID. Might be enough for specific applications though.
|
| First rule of database design: Your unique IDs should not have
| a real-world meaning.
| iratewizard wrote:
| > First rule of database design: Your unique IDs should not
| have a real-world meaning.
|
| I see a lot of comments conflating a unique ID with a unique
| column. It's OK for you to have a unique contraint on
| something like an email column. But you wouldn't use the
| email column as the primary key.
| mulmen wrote:
| > First rule of database design: Your unique IDs should not
| have a real-world meaning.
|
| Well, this is an opinion. I would definitely push back on
| making it a rule. Especially the _first_ one.
|
| Natural keys exist, make perfectly good unique identifiers,
| and have inherent meaning.
|
| When using _synthetic_ keys it is difficult (read: requires
| making trade-offs) to guarantee they were created in order
| and /or that they increase monotonically. But if you make
| those trade-offs (or just align expectations) you can still
| associate real world meaning to them.
|
| These new UUIDs standardize more of those trade-offs.
| koolba wrote:
| In theory, theory and practice are the same. In practice
| they're not.
|
| Sortable unique IDs have many performance benefits in real
| world systems.
| wilg wrote:
| If you don't think that's a concern then UUID8 is for you!
| hinkley wrote:
| > UUID6: a timestamp with a weird epoch and 100 ns precision
| like in UUID1, but in a big-endian order that sorts naturally
| by time
|
| So even the IETF hasn't read Things Every Developer Should Know
| About Time.
| [deleted]
| zvrba wrote:
| > with a weird epoch and 100 ns precision
|
| Exactly as used by windows to store file times. And the epoch
| is the start of the modern calendar
| https://en.wikipedia.org/wiki/Gregorian_calendar
|
| Talking about dates before this epoch makes little sense (any
| date before the Gregorian epoch will not resemble the same
| stellar / planet constellation as date after the Gregorian
| epoch).
| jefftk wrote:
| _> Talking about dates before this epoch makes little sense
| (any date before the Gregorian epoch will not resemble the
| same stellar / planet constellation as date after the
| Gregorian epoch)._
|
| That is not the only sense in which we care about dates. For
| example, we might want to talk about which of two events came
| first, and by how much. Historians have lots of uses for
| dates before the introduction of the Gregorian calendar.
| SBArbeit wrote:
| Yes, but we won't need 100ns precision for those historical
| dates.
| contravariant wrote:
| Surely they'd be using some kind of timestamp standard not
| a UUID?
| ygra wrote:
| Windows' FILETIME uses a different epoch, though. It counts
| since 1601 (the then-most-recent 400-year leap-year cycle
| when Windows NT was designed).
| pjscott wrote:
| A precision of 100 ns is not bad, just inflexible; I didn't
| mean to imply otherwise. The epoch _is_ weird, though. Here
| 's a thought experiment: imagine that a hundred programmers
| are each asked to pick an epoch for a timestamp, and will be
| paid $1000 for each other programmer who chooses the same
| epoch, but they can't talk with each other to coordinate.
| Which would you pick? I would pick the Unix epoch, because I
| think others would do the same. Anything else is, in a
| certain important sense, weird.
| zamadatix wrote:
| Yet that thought process wouldn't allow for the Unix epoch
| in the first place.
| FactolSarin wrote:
| That doesn't make the epoch weird, just different. The Unix
| Epoch is itself weird. I personally like it mostly for
| being a neat little historical artifact, but it's not
| interesting in any sense. What epoch would your
| hypothetical programmers settle on if they all had their
| minds wiped of any existing epochs.
|
| I bet you'd see a lot of Jan 1 1900, 2000, or year 1. Very
| few would pick 1970.
|
| Other fun non-Jan 1 ideas might be December 26, 1791
| (Charles Babbage's birthday) or February 5, 1944, (the date
| Colossus came online)
| ericb wrote:
| This is a great example of a Schelling Point!
|
| I'd also argue that you're right, and that as a principle,
| good UI/UX and software architecture should default to
| following natural Schelling Points.
| MrManatee wrote:
| I wouldn't say it makes "little sense". Just like we can talk
| about the year 776 BC even though no one at the time called
| it that, we can extend the Gregorian calendar backwards to
| dates when it wasn't used anywhere. The Wikipedia article on
| the proleptic Gregorian calendar lists some use cases. [1]
|
| And in any case, 15 October 1582 isn't some hard cutoff point
| where we can stop worrying about calendar conversions. Only
| four countries adopted the Gregorian calendar on that day,
| and even in Europe there are several countries that only
| switched in the 20th century. If a piece of software needs to
| support historical dates that go anywhere near 1582, it needs
| to be aware of the existence of different calendars.
|
| [1]
| https://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar
| hn_throwaway_99 wrote:
| To clarify, are those "plus some random bits" parts intended to
| be randomly generated for each individual UUID that is created,
| or randomly generated once per machine or startup?
|
| I ask because a common newbie bug is to use a UUID as a secure
| token, but currently only v4 UUIDs with a cryptographically
| secure PRNG can be used this way. Separately, even if not used
| as a token, using UUIDs with randomness for object IDs can help
| mitigate IDOR vulnerabilities if there are other bugs in code
| that aren't adequately checking object permissions.
| pjscott wrote:
| They're to be generated for each individual UUID. (However,
| this RFC recommends using the entirely random UUID4 for
| anything that's meant to be used in a security-related
| context, presumably with a CSPRNG.)
| hn_throwaway_99 wrote:
| Thanks. To be honest then, that last bit, "this RFC
| recommends using the entirely random UUID4 for anything
| that's meant to be used in a security-related context,
| presumably with a CSPRNG." makes me think this RFC could
| cause some problems.
|
| v6 UUIDs only have 48 bits of randomness, which means that
| some newbies will think that's "good enough" for security,
| when in reality it's not.
|
| I still like these new UUID types because I would use them
| as DB primary keys now (benefit of being time sorted but
| also globally unique and offers some protection against
| IDOR bugs), but important to know where not to use them.
| pjscott wrote:
| Yeah, I'm a bit torn on this security-wise. On the one
| hand, their only real security-relevant ambition here is
| to avoid leaking MAC address info, a classic UUID foot-
| gun, and they do achieve this. Probably a much better set
| of defaults! On the other hand, yeah, the mere presence
| of some random bits could lull people into a false sense
| of safety, especially as the RFC doesn't say anything
| about where those bits should come from except that it
| should have enough entropy to make collisions minimal. On
| the third hand, anybody who thinks that 48 random bits of
| unspecified provenance are enough for a secure token was
| probably always going to mess up _somehow_ , so I'm not
| sure how much difference this makes on the margin.
| aidos wrote:
| They're perfect for client generation of db ids. I was
| looking to do exactly this format, but decided not to do
| anything that was weirdly non-standard.
|
| Uuid6 is the one I'll be switching to now.
| staticassertion wrote:
| I feel like understanding the guarantees is really
| simple. A value can be unique while still being
| guessable, and that's the idea with this latest revision.
| If you want something to be unique and unguessable,
| that's another uuid (or just use random bits, there's no
| benefit to a uuid here, generally).
| dspillett wrote:
| _> only v4 UUIDs with a cryptographically secure PRNG can be
| used this way_
|
| Only they _should_ be used that way. Unfortunately other
| options (other UUID types, v4 with bad RNG) are sometimes
| used that way. At least other UUID variants are more faf to
| hack around than a simple increasing integer, particularly if
| decent request and access validation is in place, though if
| the wrong type of key is being used perhaps hoping for good
| validation elsewhere is a bit much.
| tialaramex wrote:
| After some staring, I can't see why (beyond the obvious, I
| understand HN rules) this is here.
|
| This is clearly an individual draft. OK. And it has previously
| expired without action (last year) but a newer draft was
| submitted this year.
|
| But it doesn't seem to have been adopted by any working group,
| and although the word "dispatch" appears in the title it was not
| discussed at IETF 111's DISPATCH or GENDISPATCH or SECDISPATCH
| last week. If this was in fact dispatched somewhere - perhaps
| before it expired last time - there's no indication where it went
| or what its status is now.
|
| It is the nature of these things that _if_ everybody chooses to
| do exactly this, even if it was only documented in a by-then
| expired draft, or on the back of an envelope, then that 's how it
| is -- the IETF has no enforcement arm. However if you support
| this proposal, or even if you think it'd be a good idea with
| minor tweaks you should find out where (and if) it's being
| developed and get on board with that.
| villasv wrote:
| Maybe the community response here would provide guidance on
| whether this is worth pursuing?
|
| I for one have been waiting for an IETF-standard sortable UUIDs
| for a while, so I'm happy to upvote this even if it is "blog
| post wishlist"-stage of standardization.
| politician wrote:
| I'm really excited to see k-sortable unique identifiers (flakes)
| be submitted as an IETF draft. This will help keep the UUID data
| type relevant.
|
| However, I'd like to see the draft include a mention about the
| practice of embedding machine and data type identifiers into the
| format which helps in distributed applications.
| pjscott wrote:
| Section 7, "Distributed UUID Generation", mentions that you MAY
| hold some of the most significant random bits constant per-
| machine to act as a machine identifier.
| cratermoon wrote:
| the practice of embedding machine and data type identifiers
| into the format is a security risk
| politician wrote:
| It might be, but whether it is depends on your threat model.
|
| If you need to obscure both when in time and where in space,
| then UUIDv4 is a better option.
| emodendroket wrote:
| How do you figure? There's no guarantee guids aren't supposed
| to be predictable, right?
| lazide wrote:
| For one? It allows you to figure out which machine is
| generating UUIDs, how fast, and all UUIDs it has ever
| generated. Which 1) is info leakage that is useful to many
| people in many ways, 2) can be a security risk as it tells
| you who to attack. 3) at a minimum is a privacy risk.
| jhealy wrote:
| The author seems to be developing the draft on GitHub and there's
| a few edits since the v01 version linked here
|
| https://github.com/uuid6/uuid6-ietf-draft
| kokizzu3 wrote:
| meanwhile.. i did make a shorter one :3
| https://github.com/kokizzu/lexid
| darkhorse13 wrote:
| A bit off-topic, but does anyone know how to build webpages that
| look like this? Like is there any way to do it other than doing
| everything manually on a text editor?
| wrs wrote:
| Not sure exactly what you mean by "look like this", but several
| tools for generating RFCs in the standard text format are here:
| https://tools.ietf.org/
| LukeShu wrote:
| Most new RFCs are authored in an XML format that gets rendered
| to the officially-submitted plaintext (well, since 2016 (RFC
| 7990) if you use v3 of the XML schema, that may be submitted
| directly as the canonical version, otherwise the rendered
| plaintext is the canonical version.)
|
| So the authors wrote this XML
| https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u...
|
| That XML got turned in to this plaintext
| https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u...
| by the xml2rfc tool https://xml2rfc.tools.ietf.org/
|
| And then that plaintext got turned in to the linked HTML by the
| datatracker.ietf.org server software, which can be found in SVN
| https://svn.ietf.org/svn/tools/ietfdb/
| Croftengea wrote:
| Maybe just wrap your text into <pre> tags? :)
| mrgleeco wrote:
| OT but related: a reverse-sorted UUID format seems especially
| useful for IoT and event data where typically we want to read
| newest to oldest lexicographically (eg. rowscans starting at
| now). Is there such a standard or OSS that does this?
| geostyx wrote:
| This looks awesome. Can't wait till I can add this to uuid.rocks!
| fhrow4484 wrote:
| > implementations MAY dedicate a portion of the node's most
| significant random bits to a pseudo-random machineID which helps
| identify UUIDs created by a given node. This works to add an
| extra layer of collision avoidance.
|
| > This machine ID MUST be placed in the UUID proceeding [sic] the
| timestamp and sequence counter bits. This position is selected to
| ensure that the sorting by timestamp and clock sequence is still
| possible.
|
| This guarantees uniqueness at a global level, as long as each
| machine doesn't run out of sequence counters within a given
| timestamp.
|
| But why must that machine ID must preceding the timestamp &
| sequence counter? Why not have it after? (or does "proceeding"
| has a meaning I'm not aware of? I read it as a typo for
| "preceding", but I'd assume it should be succeeding, especially
| given what the next sentence says)
|
| My intuition is range requests based on timestamp would work
| better if the machine ID is after, not before... If before, it
| seems it would violate the key requirement in abstract of
| "sortable using the monotonic creation time".
|
| (It's already violated since each machine in distributed system
| doesn't have the same clock so "creation time" is all relative.
| But for purposes of analytics, such as querying the "last 24h",
| having the timestamp be at the beginning seems preferable, since
| range queries can be done easily)
| pjscott wrote:
| Since there's no valid way for part of the node to be put in
| front of the timestamp and sequence counter bits -- that would
| contradict the format specifications in section 4 -- they
| probably meant to write "succeeding" or "right after". (There
| are, alas, still some typos in this draft.)
| NyxWulf wrote:
| I found that confusing as well. If the machine id is before the
| unixts, the primary advantage of using this is lost for me.
| Scanning an index for a 24 hour period is only fast if you can
| easily find the start and stop, and they have locality. Since
| that is the primary problem addressed with this rfc, I hope it
| is just a poorly worded section, rather than a design flaw.
| greggyb wrote:
| > This machine ID MUST be placed in the UUID proceeding the
| timestamp and sequence counter bits.
|
| I read this as a strange word choice, but interpret as follows:
| "This machine ID MUST proceed from the timestamp and sequence
| counter." X proceeding from Y implies that X comes after Y.
|
| If we examine the context, it is absolutely unambiguous
| (emphasis mine):
|
| > This machine ID MUST be placed in the UUID proceeding [sic]
| the timestamp and sequence counter bits. This position is
| selected _to ensure that the sorting by timestamp and clock
| sequence is still possible_.
|
| The proceeding sentence makes the intent clear, that a naive
| sort should preserve time ordering. The only way for this to
| work is for the timestamp and sequence counter bits to precede
| the optional machine ID.
| Wevah wrote:
| I think there was some proceeding/preceding confusion in GP.
| I misread it at first, too.
___________________________________________________________________
(page generated 2021-08-06 23:00 UTC)