hngopher.com

       [HN Gopher] New UUID Formats - IETF Draft
       ___________________________________________________________________
        
       New UUID Formats - IETF Draft
        
       Author : anuragsoni
       Score  : 363 points
       Date   : 2021-08-06 15:09 UTC (7 hours ago)
        
 (HTM) web link (datatracker.ietf.org)
 (TXT) w3m dump (datatracker.ietf.org)
        
       | atonse wrote:
       | TLDR they present 3 new versions that each have their own trade-
       | offs, but are also taking into account being able to use these as
       | DB primary keys, which is really great.
       | 
       | Another thing I was quite curious about (looks like you can use
       | them in existing DB columns etc):
       | 
       | > The UUID length of 16 octets (128 bits) remains unchanged. The
       | textual representation of a UUID consisting of 36 hexadecimal and
       | dash characters in the format 8-4-4-4-12 remains unchanged for
       | human readability. In addition the position of both the Version
       | and Variant bits remain unchanged in the layout.
        
         | lytedev wrote:
         | Doesn't the textual representation not matter for using them in
         | DB columns? The DB just uses the actual bytes, no?
        
           | dnautics wrote:
           | depends on the DB. And it depends on your framework. Last I
           | checked Django from about 5 years ago? still shipped uuids as
           | text fields to the database.
        
             | Macha wrote:
             | DB supports from "None" (SQLite) to "We'll give you some
             | functions to work on text/ints that are supposed to be
             | UUIDs" (MySQL) to "It's an actual native data type"
             | (Postgres) among the 3 most common databases used by Django
             | users, so it's hard to blame Django for going with a lowest
             | common denominator approach.
        
               | dnautics wrote:
               | I have since switched to Elixir/Ecto, and it just "does
               | the right thing" in all of the above databases. Also, I
               | think SQLite is capabale of supporting uuid with one of
               | the plugins that you can add into the amalgamation with a
               | compile switch, in either case ideally the framework
               | would support it transparently.
        
           | Macha wrote:
           | SQL is a textual format. Your DB driver may expect a UUID to
           | be provided them in the standard string format. If your
           | language lacks a standard UUID type, the drivers may expect
           | you to provide UUIDs as strings even when using parameter
           | binding.
           | 
           | e.g. MySQL accepts only "no dashes" and the standard textual
           | format:
           | https://dev.mysql.com/doc/refman/8.0/en/miscellaneous-
           | functi...
           | 
           | Postgres is a bit more flexible, but still expects hyphens to
           | be after multiples of four characters if present:
           | https://www.postgresql.org/docs/9.1/datatype-uuid.html
        
             | lytedev wrote:
             | This is a good point. Thank you!
        
           | gregoriol wrote:
           | The important part is the number of bits used for storage;
           | the textual version is only for humans
        
           | weird-eye-issue wrote:
           | Application level code validates UUIDs based on a specific
           | format not to mention what the specific database might expect
           | depending on the data type
        
             | lytedev wrote:
             | This sounds like a mistake? When/why should the application
             | validate the format of a UUID?
        
               | cratermoon wrote:
               | Why shouldn't they?
        
         | athenot wrote:
         | For human readability, I've always prefered encoding UUIDs in
         | base 64 and swapping a couple characters. If I'm going to have
         | some ugly identifier in a URL, I might as well make it more
         | concise.
         | 
         | I think youtube does something similar with their video IDs.
        
       | jjice wrote:
       | I'm a bit confused as to what the use case for UUIDs are compare
       | to incrementing integer IDs. I assume there are some big upsides
       | (aside from the primary key issue which seems to have quite a few
       | solutions at this point), but I'm just not aware of them.
        
         | tomschlick wrote:
         | Because they are unique on generation, you dont need to talk to
         | the database before generating one. This is huge for
         | distributed systems, or systems that have offline components
         | (think mobile apps that sync). The ids can be generated client
         | side and passed up to be stored in the database without worry.
        
         | tschellenbach wrote:
         | incrementing ids requires a centralized server of some sort to
         | hand out IDs when you get to a certain scale. you don't have
         | that issue with UUIDs. for small apps just incrementing ids
         | works fine though
         | 
         | this approach also allows you to generate ids client side.
         | which if you're working with a local db on ios/android etc this
         | is very convenient
        
         | dangerlibrary wrote:
         | From the "Introduction" section of the linked document:
         | 
         | "The motivation for using UUIDs as database keys stems
         | primarily from the fact that applications are increasingly
         | distributed in nature. Simplistic "auto increment" schemes with
         | integers in sequence do not work well in a distributed system
         | since the effort required to synchronize such numbers across a
         | network can easily become a burden. The fact that UUIDs can be
         | used to create unique and reasonably short values in
         | distributed systems without requiring synchronization makes
         | them a good candidate for use as a database key in such
         | environments. "
        
         | lolinder wrote:
         | One big one is distributed computing: for example, an app that
         | can create To Do items offline and periodically syncs with a
         | server.
         | 
         | If you have your To Do items identified by an incrementing
         | integer, you have to come up with some _other_ unique
         | identifier to use for To Do items that have yet to be synced
         | with the server 's database (since you can't know which ID to
         | use next).
         | 
         | On the other hand, if you use UUIDs, your app can just create a
         | new UUID for the To Do item, which will continue to be its ID
         | for subsequent updates.
        
         | marcus_holmes wrote:
         | The huge one for me is that if I make a mistake in code or
         | admin and accidentally run "update users set permission =
         | 'root' where id=$1" using an account id instead of a user id,
         | it does nothing. If I'd used integers I'd have given root
         | permission to a random user.
        
           | medstrom wrote:
           | So, it's like creating a special data type for each "counter-
           | style" identifier. For example suppose a StudentsId which is
           | not an integer, it's a StudentsInteger, so it cannot be mixed
           | up with UsersId. Admittedly that seems more clunky than just
           | using uuids.
        
             | marcus_holmes wrote:
             | Yeah you could do that if you _really_ want to use
             | integers. But I 'm not sure I'd trust the database layer to
             | not just silently convert the type since they're both based
             | on integers.
        
         | spyspy wrote:
         | One non-performance reason is that incrementing IDs inherently
         | leak information about your application and/or your company. If
         | you create an account somewhere and you see that your user ID
         | is 452 you now know how many users the company has. In the
         | company's eyes they'd probably prefer to keep that obscured.
        
           | rjsw wrote:
           | Also known as the German tank problem [1].
           | 
           | [1] https://en.wikipedia.org/wiki/German_tank_problem
        
             | stavros wrote:
             | Not exactly, as seeing a fresh ID lets you pretty
             | accurately tell how many items there are.
        
         | thamer wrote:
         | One of the most important is that you can't iterate over all
         | the UUIDs. How many data leaks have come from people
         | incrementing an ID in a URL?
         | 
         | This is so common, I've experienced this myself as a student
         | with a summer job: the company I was working for used an online
         | service for their payroll, I accessed my pay slip one day and
         | changed ?employee=123 to 124 in the URL out of curiosity and it
         | loaded the data for a different person in a different company.
         | They fixed it incredibly quickly once I told them about it :)
        
         | alex_duf wrote:
         | Incrementing an ID requires complete knowledge in a single
         | place, which at scale becomes a challenge. This would typically
         | be done in the database, but doesn't bode well with distributed
         | systems as multiple servers might thing "5" is following "4"
         | leading to multiple rows identified by the same ID => this is a
         | collision.
         | 
         | So UUIDs are great because multiple servers can generate
         | plethora of them without having to worry about collisions.
         | 
         | This is all explained in the first few pages of the document,
         | although not in simple terms.
         | 
         | Also I haven't spotted anything about UUIDv6 trying to avoid
         | collisions, but I haven't read very far.
        
           | bjt wrote:
           | The last 48 bits should do it.
           | 
           | " node: 48-bit pseudo-random number used as a spatially
           | unique identifier Occupies bits 80 through 127 (octets
           | 10-15)"
        
         | remus wrote:
         | As well as the other use cases mentioned, they're harder to
         | guess than incrementing integer IDs so are a useful defense in
         | depth to help stop people guessing URLs for resources.
        
       | bob1029 wrote:
       | v7 sounds compelling for keying entities in our product. We use
       | v4 right now.
       | 
       | What I am trying to determine is if 62 bits of entropy combined
       | with a timestamp gives me better or worse collision resistance as
       | with 122 bits in a purely-random format. Most of our keys only
       | live within the scope of a single computer, but ~5% of them might
       | be generated/shared between other systems.
       | 
       | Being able to order our keys by time would be really nice, and I
       | like that they would compress/index better.
       | 
       | Maybe I could do a hybrid between V4 and V7 keys depending on if
       | the type would be shared with external parties. There are many
       | types that only ever scope to a single box.
       | 
       | I probably couldn't play with the idea of adding a "machine id"
       | because the space of all possible machines is difficult to
       | anticipate/control right now.
       | 
       | I think I just talked myself into sticking with v4.
        
         | NyxWulf wrote:
         | v7 allows to you trade time precision for randomness. Adding
         | time precision allows the surface area for conflicts to
         | decrease, so the loss of random bits to higher time precision
         | is worth it if you are generating a lot of ids. Unless it is a
         | security application, I would use v7 for most things.
        
           | chociej wrote:
           | Trying to wrap my head around how this works when
           | distributed. If v7 is used with a high time precision, it's
           | important that each node in the distributed system have a
           | very tightly synchronized time, right?
           | 
           | If so, with v7, I'd feel the need to find the right balance
           | of time precision, time synchronization, node identification,
           | and randomness. Just some thoughts out loud.
        
         | jffry wrote:
         | Probability wise, nearly doubling your entropy will
         | significantly reduce the risk of a collision in your generated
         | IDs, but there's lots of other considerations for you to make.
         | But for the raw odds perspective, from
         | https://en.wikipedia.org/wiki/Birthday_attack we can get a
         | rough idea of collision chance. You can plug this into Wolfram
         | Alpha [1][2] to play around with it.
         | 
         | For a one-in-a-million chance of having at least one collision,
         | you'd need to generate 3.04e6 random 62-bit values or 3.26e15
         | random 122-bit values. For a one-in-a-thousand chance, those
         | numbers would be 9.6e7 or 1.03e17 respectively.
         | 
         | In other words: If you're sustained allocating v7 IDs (with no
         | sub-second portion of the timestamp) at a rate of 3.04 million
         | per second, there's a 1-in-a-million chance each second of
         | having at least one collision. Taking 1 - 0.999999^86400, that
         | gives an 8.3% chance of at least one collision after a day,
         | 45.4% after a week, and 92.5% after a month.
         | 
         | If you instead were allocating v4 IDs at the same rate for a
         | month, you'd have allocated (3.04e6)x86400x30 IDs, and you'd
         | have a probability of 5.84e-12 of at least one collision among
         | all of those IDs.
         | 
         | [1] Collision probability given number of draws:
         | https://www.wolframalpha.com/input/?i=n%3D3.04e6%2C+H%3D2%5E...
         | 
         | [2] Min number of draws to get a given collision probability:
         | https://www.wolframalpha.com/input/?i=p%3D0.000001%2C+H%3D2%...
         | 
         | edit: If you are storing milliseconds in subsec_a and
         | generating IDs at the same rate, you'd have 3040 ids per
         | millisecond, which have a probability 1.00164e-12 of a
         | collision in any given millisecond. Taking
         | 1-(1-1.00164e-12)^(86400*1000) gives you a probability of
         | 0.0087% of a collision in a day, 0.061% in a week, or 0.26% in
         | a month. So as long as your event generation is relatively
         | constant across that second, you will indeed have a much lower
         | collision chance if you are using millisecond subdivisions. If
         | you have even higher precision time or are using a sequence
         | counter, that can be pushed down even further.
        
           | [deleted]
        
       | LukeShu wrote:
       | As a quick reference for readers:
       | 
       | The UUID specs use the terms "variant" and "version" a little
       | funny; "variant" is essentially the revision (so all modern UUIDs
       | have variant=0b10 to specify RFC 4122 UUIDs), and "version" is a
       | 4-bit number identifying the sub-type within that variant:
       | UUIDv1 time-based         UUIDv2 legacy DCE security thing, not
       | wideley used         UUIDv3 name-based with md5 hashing
       | UUIDv4 randomness-based         UUIDv5 name-based with sha1
       | hashing
       | 
       | This draft registers a few new sub-types:
       | UUIDv6 sortable time-based, Gregorian calendar         UUIDv7
       | sortable time-based, Unix time         UUIDv8 sortable time-
       | based, custom time
        
         | [deleted]
        
         | paulddraper wrote:
         | tl;dr
         | 
         | Use UUIDv4 and get on with your day.
        
           | ygra wrote:
           | Well, not if you need database primary keys with sensible
           | index locality. Which is what this draft is about.
        
             | phkahler wrote:
             | Aren't indexes sorted? I suppose using time as the MSBs
             | means new batches are always appended (after sorting just
             | the new ones).
        
               | mianos wrote:
               | Yes they are. But, consider the common query, for records
               | based on an time index on the last day. This index will
               | point to a bunch of primary keys and those, being random,
               | will be spread across a much wider range of blocks. If
               | the primary key was just the time, those items would be
               | on the same or close blocks.
        
       | OliverJones wrote:
       | If the epoch in this proposal were changed to something closer to
       | the present day (2000-01-01 or the UNIX epoch) the format could
       | easily recover some bits from the time fields to put in the PRNG-
       | generated "node" fields. I wonder why they chose the Gregorian
       | epoch?
        
         | pjscott wrote:
         | The Gregorian epoch was chosen for UUID v6 in order to make it
         | as similar to v1 as possible, for ease of updating existing
         | software. There are two more types of UUID proposed in this
         | draft RFC: v7 which uses the Unix epoch and allows more time
         | precision, and v8 which can use any monotonic timestamp source.
        
       | gopalv wrote:
       | > The machineID MUST NOT be an IEEE 802 MAC address.
       | 
       | > MAC addresses pose inherent security risks and MUST not be used
       | for node generation.
       | 
       | Interesting concern in the distributed generation pathway.
       | 
       | I've used MAC addresses in the past for absolutely unique
       | identifiers, but this is calling out that as a security risk,
       | because the time + arp data might be known to predict a future
       | UUID from a machine?
        
         | WorldMaker wrote:
         | UUIDv1 IDs with MAC addresses have been used for tracking
         | people down. There are especially a lot of classic examples
         | from older Word documents where v1 UUIDs were often embedded in
         | many places inside the documents. Groups used (and sometimes
         | still use) the MAC addresses in such Word documents to track
         | down specific authors (for FBI investigations on the more
         | [extra-]legal side and for "doxxing" and such on the far less
         | legal side).
         | 
         | Whether you consider this specific example threat a privacy
         | risk versus a security risk here depends on your personal
         | threat model, of course. But general consensus at this point is
         | to label it a security risk.
        
           | jackpirate wrote:
           | That makes sense, but that seems like it's a problem with any
           | machine-unique id, not just the MAC address. Is there some
           | specific reason MAC addresses are a worse choice than (e.g.)
           | CPU serial number?
        
             | WorldMaker wrote:
             | This RFC has reason to call out MAC addresses directly as a
             | security risk because they were standardized way back in
             | UUIDv1. (To generate a standards compliant v1 UUID you are
             | still required to use a MAC address, which is why everyone
             | suggests you use UUIDv4 today, and exactly why UUIDv6
             | exists in the RFC, it's basically exactly UUIDv1 without
             | the MAC address.)
             | 
             | The RFC leaves finding a machine-unique ID that isn't a
             | security risk as an exercise to the user. (Mentioned
             | directly in the v7 description where some bytes are
             | optionally allocated for "flake"-like machine IDs.)
        
       | ComputerGuru wrote:
       | See ulid for similar prior art. We generate Ulids in code and
       | then store them in uuid columns in Postgres to realize the
       | compact (binary) size benefits.
       | 
       | https://github.com/ulid/spec
        
         | WorldMaker wrote:
         | I put in the work to store ULIDs in SQL Server's interesting
         | UUIDv1-based sort order for uniqueidentifier columns. It seems
         | to be providing the clustered index benefits I was hoping for
         | so far (far less clustered index churn behavior than v4 Random
         | UUIDs).
         | 
         | An interesting digression to bring up with respect to this RFC
         | because it is interesting to note that none of the proposed v6,
         | v7, v8 UUIDs match the SQL Server sort order behavior and would
         | still have a lot of clustered index churn unless new sort
         | behaviors were allowed to be opted in. That might be something
         | that this RFC would help to address by also making sort orders
         | a bit more standard, though it does not directly make
         | recommendations on that front.
         | 
         | For what it is worth, it is the SQL Server sort order that is
         | weird: it sorts the last group of the UUID string form first,
         | which in v1 UUIDs corresponded to the MAC Address. Clustering
         | by machine probably made sense by whoever decided on that sort
         | order, at the time, but even with v1 UUIDs they probably would
         | have got better results sorting by timestamp than by machine.
         | 
         | (The ULID timestamp fits exactly into that "MAC address" group
         | and it's just a matter of swapping the front 6 bytes to the
         | back of the UUID when storing to uniqueidentifier.)
         | 
         | Fun reference on UUID sort orders:
         | https://devblogs.microsoft.com/oldnewthing/20190426-00/?p=10...
        
           | ComputerGuru wrote:
           | It's not just a question of whether you want to cluster by
           | MAC address, it's more of a fundamental disagreement as to
           | the endianness of the UUID groups themselves, which is
           | probably the biggest difference between a GUID and a UUID in
           | the wild. Microsoft has used GUIDs to brand OS intervals for
           | a long, long time - long enough that there's a chance they
           | came before UUID was a thing, but I haven't dug into the
           | history of it.
           | 
           | Unfortunately converting between them without any db
           | extensions is a PITA in the absence of native support. I've
           | written about it at length here (SQLite in the example, but
           | the fundamentals apply regardless):
           | https://neosmart.net/blog/2018/converting-a-binary-blob-
           | guid...
        
             | WorldMaker wrote:
             | That was the question specific to SQL Server storage sort
             | order for clustered indexing and most relevant to ULID to
             | uniqueidentifier "conversion"/storage, but yes there are
             | endianness issues in the sort orders _as well_ , which is
             | why I made sure to include the full Raymond Chen blog post
             | on many of the most common sort orders for GUIDs because it
             | is quite the rabbit hole. (The "update" footnote link at
             | the bottom of the article I posted to the Java order is
             | especially wild if you fall into the rabbit hole.)
             | 
             | (ETA: Also to answer your indirect question, Microsoft was
             | definitely an early adopter of GUID/UUID. Though the
             | inventor was Apollo Computer as a part of their complex
             | network design:
             | https://en.wikipedia.org/wiki/Apollo_Computer)
        
               | ComputerGuru wrote:
               | Thanks for clarifying - perhaps another reader will find
               | that link for in-db conversion useful. And thanks for the
               | interesting information and link to the Apollo Computer;
               | I have some fodder for bedtime reading now!
        
               | WorldMaker wrote:
               | No problem. Yeah, for what it is worth (getting further
               | aside), I gave up on supporting the ULID conversion
               | directly in-db for that same reason: Base-32 math (which
               | ULID uses for canonical string format) is tough enough in
               | SQL before you add in the endian issues. The endian
               | issues aren't as big of a deal in my storage sort order
               | situation because the MAC address section that I use for
               | the timestamp data most critical to the sort order
               | doesn't suffer from the endian issue and the remainder of
               | a ULID is entirely random and I'm not too concerned if
               | the random section doesn't sort identically to the ULID
               | sort order. (It would be great if it did, but it's not as
               | critical as the more dominant timestamp order.) That
               | said, .NET's GUID ToByteArray() handles the endian order
               | for me and I'm somewhat confident it's correct so long as
               | it is my .NET code doing ULID-to-uniqueidentifier
               | conversions and not the scripts I tried to use directly
               | in-db.
        
         | mgkimsal wrote:
         | What's a bit interesting to me is having read all the
         | negativity/criticism around ULID, and now seeing these
         | proposals by IETF...
        
           | ComputerGuru wrote:
           | Most ulid criticism surrounds the ill-advised (and optional!)
           | part of the spec that tries to achieve sub-ms ordering at the
           | cost of requiring a singleton or cross-thread atomics to try
           | and generate random values for the random portion of the blob
           | from a range that would end up providing additional sort
           | order bits. The reference implementation in JS doesn't even
           | offer this (although that's likely because JS doesn't even
           | support multiple execution threads within the same context).
           | 
           | Apart from that little misadventure (last I checked, there
           | was a request for comments on a proposal to drop that from
           | the Ulid spec altogether) there's really nothing else that's
           | not same and straightforward in the spec, and little else to
           | complain about.
        
         | roberto wrote:
         | They list ULID and 15 other implementations in the document.
        
           | ComputerGuru wrote:
           | Ah, so they do. Thanks.
        
       | Waterluvian wrote:
       | I find it peculiar that the Introduction section on these drafts
       | is always just boilerplate.
        
       | cratermoon wrote:
       | I prefer ksuid https://github.com/segmentio/ksuid
        
         | jrochkind1 wrote:
         | The IETF draft mentions ksuid in the "prior art consulted"
         | section.
         | 
         | Do you mean you prefer ksuid to this new draft? Your comment
         | would be more helpful if you said what about it you prefer.
        
           | cratermoon wrote:
           | I prefer ksuids because
           | 
           | they have 128 bits of pseudorandom data
           | 
           | Both text and binary representations are lexicographically
           | sortable
           | 
           | They don't have the awkward dash ('-') to mess with url
           | encoding
        
             | leo_bloom wrote:
             | > They don't have the awkward dash ('-') to mess with url
             | encoding
             | 
             | Can you explain what you mean by that? I have experienced
             | issues with " " as "+" vs. "%20", but dashes... ?
        
               | cratermoon wrote:
               | Not in URLs per se, but in HTML URL Encoded Multipart
               | Forms, i.e. enctype="multipart/form-data"
               | 
               | ETA: also, when you put a UUID as text into a lot of
               | search engines, it treats the dashes as token delimiters.
        
             | jrochkind1 wrote:
             | Which of those things are not true of these new UUID
             | formats proposed?
             | 
             | Let's see if I can figure it out (correct me if I'm wrong).
             | 
             | I think maybe these new UUIDs proposed ARE lexigraphicaly
             | sortable (in timestamp order, i think is the implication)
             | in text and binary, these newly proposed UUIDs above? I got
             | the impression that was one of the main things they
             | provide, compared to older UUID formats, one of the main
             | reasons for their proposal? Am I wrong?
             | 
             | I know the new UUIDs look the same as the ones we're used
             | to, where they often have dashes (although there's
             | certainly no requirement you display them with dashes, you
             | can strip the dashes out before putting them in a URL if
             | you want, although yeah it's an extra step).
             | 
             | [If we're talking about URL-encoding we're talking about
             | putting them in URLs? ksuid's 160 bits insead of UUID's 128
             | , making a longer URL, seems like a downside?]
             | 
             | Not sure how many bits of psuedorandom/entropy they have...
             | Looks like... UUIDv6 has 48 bits of psuedorandom in
             | addition to timestamp, to reduce timestamp collisions.
             | UUIDv7 is variable and can let the application decide (am I
             | understanding right?), but supports up to 62 bits of
             | pseudorandom data. [I guess you prefer a longer ID with
             | more bytes of psuedorandom... just to further minimize
             | possibility of timestamp collission, or other reasons? I am
             | not sure how often existing UUID timestamp collisions
             | happen in practice, anyone know? Anyone ever had one?]
        
               | cratermoon wrote:
               | > Which of those things are not true of these new UUID
               | formats proposed?
               | 
               | Well, the entire UUID is 128, so obviously not 128 bits
               | of pseudorandom data.
               | 
               | > What about 128 bits of pseudorandom data is preferable
               | to you?
               | 
               | Same as with any hash: collision resistance. That's why
               | we don't use SHA-1 for anything secure any more. The dash
               | problem arises not in the URL but the multipart/form-data
               | encoding, as well as the following:
               | 
               | "when UUIDs are indexed by a search engine, where the
               | dashes will likely be interpreted as token delimiters.
               | The base62 encoding avoids this pitfall and retains the
               | lexicographic ordering properties of the binary
               | encoding."
               | 
               | https://segment.com/blog/a-brief-history-of-the-uuid/
        
               | jrochkind1 wrote:
               | OK, thanks for explaining!
               | 
               | I have never had or thought of an application where a
               | search engine interpreting the dashes as token delimiters
               | was a problem. But if you have, and it was a problem, ok,
               | that explains your preference I guess!
               | 
               | Curious if anyone reading has run into collisions with
               | existing v1 UUIDs, or knows anyone who has, how often.
               | 
               | Pretty sure these new UUIDs are lexigraphically sortable
               | (in timestamp order) in their hex representation (with or
               | without dashes) as well as binary. Lots of people agree
               | this is important, that in fact seems to be one of the
               | main motivations of these new UUID formats in the draft
               | we're discussing, I think?
        
               | cratermoon wrote:
               | Yeah, they've solved one issue with the UUID format by
               | giving them k-sortability, but the other issues are still
               | there. Ultimately, if you need some kind of UUID-
               | compatible representation (eg for Vertica UUID columns),
               | then these are good approaches. If you need to solve the
               | same problems but don't have a reason to be locked into
               | UUIDs, pick something else.
        
         | tfehring wrote:
         | Other than the additional pseudorandom bits (which also have
         | drawbacks), what's the advantage of ksuid over UUIDv6 as
         | defined in the draft?
        
           | cratermoon wrote:
           | Both text and binary representations are lexicographically
           | sortable
           | 
           | They don't have the awkward dash ('-') to mess with url
           | encoding
        
             | shortstuffsushi wrote:
             | Could you explain your concern with dash in url? To my
             | knowledge, that shouldn't be in conflict with anything else
             | like a slash, space, or plus, what sort of issues do you
             | see?
        
               | cratermoon wrote:
               | No in the URL specifically, but in forms, with the
               | enctype="multipart/form-data"
        
               | shortstuffsushi wrote:
               | Oh, are you referring to some sort of conflicts with the
               | "boundary" separators? I would be interested to hear more
               | about this, I haven't personally had encoding issues with
               | dashes, or thought to try to encode them (or w/e) to
               | avoid any issues
        
       | sudhirj wrote:
       | For those interested in time based UUIDs, I've written libraries
       | in Ruby and Go to move quickly between them:
       | 
       | https://github.com/sudhirj/uulid.go
       | https://github.com/sudhirj/shortuuid.rb
       | https://github.com/sudhirj/shortuuid.go
        
         | stavros wrote:
         | Those look suspiciously similar to the Python one!
        
           | sudhirj wrote:
           | Probably is quite similar, the converting numbers between
           | bases is a textbook algo, and the UUID format is a standard.
           | Which python one?
        
             | stavros wrote:
             | This one: https://pypi.org/project/shortuuid/
        
               | sudhirj wrote:
               | Oh, nice. Won't do one with Python, then :D
               | 
               | But yeah, I've found the ability to move IDs between
               | alphabets quite useful.
        
       | gfody wrote:
       | if you were about to use a uuid as a primary key in a database,
       | wouldn't it always be better to instead use a composite key with
       | explicit columns for sequence, timestamp, node id, etc.? if you
       | really need to accept client side generated values and there's no
       | opportunity to issue them a unique node id before hand, then
       | explicitly taking their mac address and region seems better than
       | stealthily relying on those things being embedded in a uuid -
       | also aren't mac addresses considered PII?
        
       | nabla9 wrote:
       | The background section gives reasons derived from looking at
       | existing implementations.
       | 
       | ---
       | 
       | Due to the shortcomings of UUIDv1 and UUIDv4 details so far, many
       | widely distributed database applications and large application
       | vendors have sought to solve the problem of creating a better
       | time- based, sortable unique identifier for use as a database
       | key. This has lead to numerous implementations over the past 10+
       | years solving the same problem in slightly different ways.
       | 
       | - Timestamps MUST be k-sortable. That is, values within or close
       | to the same timestamp are ordered properly by sorting algorithms.
       | 
       | - Timestamps SHOULD be big-endian with the most-significant bits
       | of the time embedded as-is without reordering.
       | 
       | - Timestamps SHOULD utilize millisecond precision and Unix Epoch
       | as timestamp source. Although, there is some variation to this
       | among implementations depending on the application requirements.
       | 
       | - The ID format SHOULD be Lexicographically sortable while in the
       | textual representation.
       | 
       | - IDs MUST ensure proper embedded sequencing to facilitate
       | sorting when multiple UUIDs are created during a given timestamp.
       | 
       | - IDs MUST NOT require unique network identifiers as part of
       | achieving uniqueness.
       | 
       | - Distributed nodes MUST be able to create collision resistant
       | Unique IDs without a consulting a centralized resource.
        
       | RcouF1uZ4gsC wrote:
       | > 48-bit pseudo-random number used as a spatially unique
       | identifier Occupies bits 80 through 127 (octets 10-15)
       | 
       | Is 48 bits really enough to be a spatially unique identifier.
       | Roughly 16 million entities would have a 50% chance of collision.
       | 
       | If you have a spatially unique identifier collision, it seems it
       | might be possible for two independent entities to generate the
       | same time stamp and counter codes resulting in an overall UUID
       | collision.
        
         | ltbarcly3 wrote:
         | While possible, with a high precision clock it should be quite
         | rare. In any case, UUID's aren't meant to be universally unique
         | to the point you will never see the same one for any reason
         | across every use case in the galaxy, although many versions do
         | allow you to neglect to code to handle collisions for almost
         | every practical use case.
         | 
         | If you need 16 million entries to get a 50% chance of
         | collision, 100ns time resolution means you have 10^7 timestamps
         | per second, so to actually experience a 50% chance of collision
         | requires an instantaneous hash rate of over 10^14 per second,
         | which I don't think you are likely to ever see in practice
         | before these UUID versions are long obsolete.
        
       | riffic wrote:
       | Also see RFC 4122:
       | 
       | https://datatracker.ietf.org/doc/html/rfc4122
        
       | infinityplus1 wrote:
       | Does anyone have any opinion on how Firebase push keys compare to
       | UUIDs? Firebase push keys are said to be unique and are sortable.
       | Here's a link to the push key generator:
       | https://gist.github.com/mikelehen/3596a30bd69384624c11
        
       | rootusrootus wrote:
       | "UUIDv8 SHOULD only be utilized if an implementation cannot
       | utilize UUIDv1, UUIDv6, or UUIDv8." I assume that is a typo.
        
         | LukeShu wrote:
         | Indeed, I believe that it should read "... or UUIDv7."
         | UUIDv1 non-sortable time-based         UUIDv6 sortable time-
         | based, Gregorian calendar         UUIDv7 sortable time-based,
         | Unix time         UUIDv8 sortable time-based, custom time
         | 
         | That is: The custom-time version should only be used if you
         | cannot use one of the standard time versions.
        
           | Wevah wrote:
           | There's another typo in the v7 section where it's erroneously
           | referred to as v8.
        
         | [deleted]
        
       | submeta wrote:
       | OT: Nicely written and formated ascii document. I wonder what
       | tools they used to create the headars / page numbers / refs.
        
         | krinchan wrote:
         | There are a lot of tools that let you write a source document
         | and compile it into something that fits the RFC style guide.
         | 
         | XML and markdown are the most common source document formats.
         | 
         | https://www.rfc-editor.org/pubprocess/tools/
        
         | pawal wrote:
         | There are a number of different tools for writing internet-
         | drafts. See here: https://tools.ietf.org/tools/
        
         | sswaner wrote:
         | I had the same thought: Interesting topic, but I am very
         | distracted by how much I love the format of the document.
        
       | Croftengea wrote:
       | To put it simply, the new standard aims to address UUID usage as
       | primary keys in distributed systems.
       | 
       | But UUIDv7 is described as: Unix timestamp + fractions of second
       | + increasing sequence + random number. Now imagine two processes
       | start simultaneously and start generating UUIDs right away. IMHO
       | chances of the two processes generating exactly the same UUID
       | sequence are pretty high unless the implementation is smart
       | enough to feed something like process id as a seed to random
       | function.
        
         | renonce wrote:
         | Well implemented processes usually seed their random generators
         | from a random device, such as /dev/urandom in Linux. Such
         | devices hand out different values for each request.
        
       | jozvolskyef wrote:
       | What are the advantages of sortable UUIDs with embedded
       | timestamps over random 128 bits and a created_at column?
        
         | junon wrote:
         | FWIW the linked RFC's "Background" section is really easy to
         | understand even for people who aren't well versed in dist-sys
         | concepts. I suggest giving it a read - it's quite well written.
        
         | bilinguliar wrote:
         | This allows you to keep recent records in DB cache, for
         | example. Contrary, random keys would constantly invalidate the
         | cache unless all your data fits in memory.
        
         | ape4 wrote:
         | Isn't that a possible security issue. MACs used to be
         | predictable but then became random.
        
           | lazide wrote:
           | I'm pretty sure you mean 'stable' not predictable here
           | re:MACs?
           | 
           | The issue with using Mac addresses is it essentially leaked
           | tracking information on the machine being used to generate
           | the uuid - you could associate every ID it ever generated
           | back to that same machine, correlate it to things like open
           | WiFi AP logs, whatever. Which is pretty scary sometimes.
           | 
           | Many OS's now randomly generate a new MAC on every
           | disconnect/connect for this reason, to at least give some
           | privacy.
        
           | jozvolskyef wrote:
           | I presume you are referring to UUIDv4/1, which has 122 random
           | bits. It is a common misconception that this is insecure. In
           | reality, the UUIDv4/1 is just a way to encode 122 random
           | bits, which is more than enough for most needs. The property
           | of being practically unguessable comes from the random
           | generator, not from the encoding.
           | 
           | Secondly, whether or not being predictable is a problem
           | depends on the use case. The keys used on this forum are
           | predictable and it is not an issue.
        
         | BenoitEssiambre wrote:
         | On top of reasons given by others, if the data likely to be
         | queried together tends to get stored together because of this,
         | things will be much more cache efficient and queries
         | potentially much faster since for db loads, cache efficiency
         | (not just ram but CPU cache), alleviates some of the most
         | important performance bottlenecks.
        
         | masklinn wrote:
         | Random primary keys play hell on database indexes.
        
         | cratermoon wrote:
         | https://en.wikipedia.org/wiki/Partial_sorting
        
         | mrighele wrote:
         | The linked RFC talks about this (it refers to UUIDv4, which is
         | not much different from 128 random bits):
         | 
         | " First, most of the existing UUID versions such as UUIDv4 have
         | poor database index locality. Meaning new values created in
         | succession are not close to each other in the index and thus
         | require inserts to be performed at random locations. The
         | negative performance effects of which on common structures used
         | for this (B-tree and its variants) can be dramatic. As such
         | newly inserted values SHOULD be time-ordered to address this."
        
           | bob1029 wrote:
           | Note that this totally-random key behavior is actually highly
           | desirable when using dynamic programming techniques such as
           | with the Splay Tree. It makes it statistically impossible for
           | the tree to fall into a degenerate state.
        
             | pjscott wrote:
             | Splay trees aren't a dynamic programming thing (and aren't
             | used in any database storage backends I know of). They are,
             | however, conjectured to be no more than a constant factor
             | slower than a dynamically optimal binary search tree.
             | 
             | https://en.wikipedia.org/wiki/Optimal_binary_search_tree
        
               | bob1029 wrote:
               | > Splay trees aren't a dynamic programming thing
               | 
               | My mistake - I meant to write "dynamic optimality".
               | 
               | As for the "not used in any database storage backends", I
               | have personally started experimenting with them due to
               | higher-order effects they can provide relative to their
               | optimization technique.
               | 
               | The fact that the most recently accessed node is always
               | at the root also means that incremental tree updates are
               | well-bounded in terms of # of nodes that would need to be
               | rewritten to disk. This allows for much more efficient
               | usage of storage-friendly things like append-only log
               | structures.
        
               | pjscott wrote:
               | Today I learned! That sounds fascinating; is there
               | anything written down about it somewhere?
        
               | bob1029 wrote:
               | I actually don't think there is any specific reference I
               | have ever encountered. It was something that just popped
               | into my head one day when I was trying to figure out a
               | way to write small batches of key-value data into a log.
               | 
               | Once you put 2+2 together on it, its like a lot of
               | concepts seem to snap into place all at once. You also
               | get incredible locality of access - each batch can
               | sometimes fit within a single I/O block depending on
               | system loading and data patterns, so traversing the tree
               | to _any_ node modified in the previous flush to disk
               | would involve a single I /O.
               | 
               | Edit: If you search "splay tree append only log" on
               | google, you will find one of my previous HN comments on
               | the first page of results:
               | https://news.ycombinator.com/item?id=28010167 This tells
               | me there isnt much precedent for this idea.
        
           | LukeShu wrote:
           | _> UUIDv4, which is not much different from 128 random bits_
           | 
           | To quantify how different: UUIDv4 has 6 fixed bits and 122
           | random bits.
        
         | parhamn wrote:
         | it depends what "created_at" means, if it means "persisted_at"
         | then they're also a bit different because these sorts of UUIDs
         | are generated before being persisted (often long before for
         | offline-and sync systems).
        
         | dheera wrote:
         | Searching records and returning the results time-sorted in O(n)
         | time instead of O(n log n) time.
        
         | gumby wrote:
         | That is addressed in the introduction -- random access inserts
         | (i.e. specularity) can be quite inefficient depending on your
         | backing store.
        
         | causasui wrote:
         | I use KSUIDs a ton in DynamoDB; it saves me from having to
         | create and fill another index to sort items by creation time.
         | 
         | The only case where I don't use them is if I don't want to
         | create time of the item to be known to the user.
        
         | jandrewrogers wrote:
         | There are two issues not mentioned in the other comments with a
         | random 128-bit UUID in real systems, both related to data
         | scale:
         | 
         | - They defeat data compression, which is an integral part of
         | many database engines these days for both performance and
         | scalability reasons.
         | 
         | - For some very large systems, there is a UUID collision tail
         | risk that while small is large enough to be plausible.
        
           | neur4lnet wrote:
           | Its compression a legitimate concern for primary keys?
        
             | bob1029 wrote:
             | If you are getting good compression on your primary keys,
             | this may be cause for alarm.
             | 
             | These should be high-entropy items.
        
               | jandrewrogers wrote:
               | The sole constraint on a primary key is that it is
               | _unique_. Even in distributed systems, deterministically
               | unique keys are almost always much more compressible than
               | probabilistically unique keys, which is a double win.
               | 
               | Should you desire a high-entropy key, e.g. for hash
               | tables or security, this can be trivially derived from
               | the low-entropy key while still guaranteeing
               | deterministic uniqueness.
               | 
               | Compressible primary keys are superior in every way.
        
             | jandrewrogers wrote:
             | Definitely, I've run into this problem many times. UUIDs
             | are pretty bulky data types in database terms, usually
             | larger than most other types in a typical table on average
             | and possibly larger than rest of the row after including
             | the index overhead, which creates cache pressure. Logical
             | columns in a table are commonly stored physically as a
             | vector, of UUIDs in this case, expressly for the purpose of
             | compressing the representation. This saves both disk cache
             | and CPU cache. For queries, the benefit of scanning
             | compressed vectors is that it reduces the average number of
             | page faults per query, which is one of the major
             | bottlenecks in databases.
             | 
             | Also, some data models (think graphs) tend to be not much
             | more than giant collections of primary keys. Using the
             | smallest primary key data type that will satisfy the
             | requirements of the data model is a standard performance
             | optimization in databases. It is not uncommon for the UUID
             | that the user sees to be derived from a stored primary key
             | that is a 32-bit or 64-bit integer.
        
               | NyxWulf wrote:
               | Uuids are bulky if you store them in text form, but in
               | binary form they are only 128 bits.
               | 
               | The main feature of a uuid is it allows distributed
               | generation. 32-bit or 64-bit integers are almost always
               | sequential numbers. The sequential nature allows
               | efficient page filling and index creation, but the
               | contention involved in creating a sequence grows rapidly
               | with scale.
               | 
               | So while a 128-bit uuid is larger than a 64 bit integer,
               | this version allows for the bulk of the benefits of
               | sequential integers while reducing the biggest drawback
               | of contention at the point of creation.
        
               | jandrewrogers wrote:
               | I was assuming binary format. 128-bits is a pretty heavy
               | data type in many data models with measurable impact on
               | performance versus using something smaller.
               | 
               | You also do not need 128-bits to decentralize unique key
               | generation, even though it is quite reasonable if you
               | design your keys well. Many do it with 64-bit unique keys
               | in massive scale systems.
               | 
               | A subtle point that you may be overlooking is that while
               | large-scale distributed databases, including all the ones
               | I work on, export globally unique 128-bit keys, in most
               | systems I've worked with they are internally represented
               | and stored as 64-bits or less even if the key space is
               | much larger than 64 bits. There are many different
               | techniques for doing key space elision and inference that
               | are used inside distributed databases to save space. The
               | 128-bit value is only materialized when sending it over
               | the wire to some external client system. But you don't
               | need to store it.
               | 
               | Literally storing a primary key in a distributed system
               | as a 128-bit value is all downside with few benefits. For
               | small systems the performance and scaling cost may not
               | matter that much but in very large systems it matters a
               | lot. It can -- literally! -- cost you millions of dollars
               | per year.
        
               | krinchan wrote:
               | Twitter snowflakes are unsigned 64-bit ids designed to be
               | created in a distributed fashion.
        
               | cogman10 wrote:
               | And, to be clear, those benefits are the placement of new
               | records in a DB.
               | 
               | If a UUID is completely random it means it can be
               | inserted anywhere which can require the DB to reshuffle
               | records and pages in order to make room for the new
               | record.
               | 
               | Having a sequential element to the UUID makes it a lot
               | easier to have an index where each record is inserted at
               | the end. Which, like you said, makes page usage more
               | efficient and decreases the amount of work a DB has to do
               | on insertion.
               | 
               | All this is a compounded problem if you have a DB with
               | frequent writes, a lot of indexes, or both.
        
               | manigandham wrote:
               | You can generate any size number in a distributed
               | fashion. The only difference is that 128-bits gives you
               | enough scale that it's practically impossible to have
               | collisions when randomly generating.
               | 
               | Unless you need to be completely disconnected, a little
               | coordination can drastically improve things. In past
               | companies, I used a simple counter with 64-bit integers
               | and each distributed process would increment a billion-
               | number range to use for IDs. Fast, efficient, compatible
               | with everything, naturally ordered, and guaranteed to
               | never have a collision.
        
         | [deleted]
        
         | munhitsu wrote:
         | CRDT
        
         | orf wrote:
         | Imagine you're not using a storage platform that supports that
         | access pattern. Think S3, DynamoDB or even a directory of
         | files. Being able to encode your sort key in your primary key
         | is a pretty nice property.
        
           | datavirtue wrote:
           | Hmmm...Now I'm wondering why I never bothered to prepend a
           | timestamp to a GUID.
        
             | orf wrote:
             | Because then it wouldn't be a UUID in a format that other
             | systems might expect or work more efficiently with.
        
         | weird-eye-issue wrote:
         | Not having a created_at column and the index bloat that goes
         | with it?
        
         | notamy wrote:
         | It works better for databases that can only sort on the primary
         | key (iirc Cassandra is one of these)
        
         | scrollaway wrote:
         | One data point instead of two. Which may or may not have
         | advantages depending on your situation.
         | 
         | At any rate it's definitely simpler, and you don't have to
         | forego the created_at column either.
        
         | lytedev wrote:
         | My understanding is that if the UUID is not sortable, inserting
         | it into an index can be a performance hit. So if you have a
         | write-heavy table with non-sorted primary keys, you will have a
         | lower ceiling (higher overhead).
        
           | srcreigh wrote:
           | Know about primary vs secondary indexes?
           | 
           | Generally 1 index contains the data, all the other indexes
           | have pointers to that index in their leaf nodes. Usually the
           | PK index contains the data
           | 
           | So loading 1000 consecutive rows in a created_at secondary
           | index takes two steps: 1) created_at index B-Tree traversal
           | (maybe 10 pages from disk, max?) 2) Potentially loading 1000
           | randomly sorted pages from disk, dereferencing pointers
           | (EXPENSIVE!)
           | 
           | Whereas, if your primary key is sorted by time, loading 1000
           | rows is 1) slightly larger primary B-Tree traversal 2) no 2,
           | you're done :)
        
           | NyxWulf wrote:
           | The lack of sorting is what this rfc is fixing.
        
         | ignoramous wrote:
         | > _What are the advantages of sortable UUIDs_
         | 
         | I guess for use-cases like Instagram's: "Generated IDs should
         | be sortable by time (so a list of photo IDs, for example, could
         | be sorted without fetching more information about the
         | photos)... If you use a timestamp as the first component of the
         | ID, the IDs remain time-sortable... [but require] more storage
         | space (96 bits or higher) to make reasonable uniqueness
         | guarantees" [0]
         | 
         | [0] https://archive.is/xiMGZ
        
           | zimpenfish wrote:
           | cf also Twitter's Snowflake[1] which is [timestamp, worker
           | number, sequence number] to give the same "roughly sortable
           | by time" property.
           | 
           | [1] https://blog.twitter.com/engineering/en_us/a/2010/announc
           | ing...
        
         | halfmatthalfcat wrote:
         | Using the uuid as a PK do to cursor based pagination for one.
        
       | [deleted]
        
       | pjscott wrote:
       | A somewhat oversimplified summary of the new UUID formats:
       | 
       | UUID6: a timestamp with a weird epoch and 100 ns precision like
       | in UUID1, but in a big-endian order that sorts naturally by time,
       | plus some random bits instead of a predictable MAC address.
       | 
       | UUID7: like UUID6, but uses normal Unix timestamps and allows
       | more timestamp precision.
       | 
       | UUID8: like UUID7, but relaxes requirements on where the
       | timestamp is coming from. Want to use a custom epoch or NTP
       | timestamps or something? UUID8 allows it for the sake of
       | flexibility and future-proofing, but the downside is that there's
       | no standard way to parse the time from one of these -- the time
       | source could be anything monotonic.
        
         | paulddraper wrote:
         | 8 kinds of UUIDs, because we already had 4 too many.
        
           | pjscott wrote:
           | Which of the 5 existing kinds of UUID do you consider the One
           | True UUID? No matter which you pick, I guarantee that there's
           | _something_ badly wrong with it for at least one common use-
           | case that can be improved by switching to either v4 (IMHO the
           | only one of the existing types worth using) or to one of the
           | new proposed UUID types.
        
             | paulddraper wrote:
             | UUID v4
             | 
             | Are you concerned about the 0.0000001% chance of collision
             | after generating a 100 trillion UUIDs?
             | 
             | Or are you trying to include metadata in your identifier?
             | (Not the worst thing, but it's also not super useful info.)
        
               | pjscott wrote:
               | I'm not worried about collisions and I agree that being
               | able to put metadata in the UUID is a big meh of a
               | feature. The problem with v4 is what happens when people
               | try using them as database keys: the random sort order
               | can really hurt performance. You might argue that the fix
               | for this is to simply _not_ use UUIDs as database keys...
               | but so many people are already doing this, and will
               | continue to do it, that they should probably be given
               | better standard options.
        
               | mdtusz wrote:
               | This is a fairly significant issue for mysql, but does
               | the same issue exist when using postgres with a UUID pk
               | type?
               | 
               | At my last job, this was managed by using two ID's on
               | each row - a serial one that was basically only used for
               | database optimization purposes, and the "real" UUIDv4 ID.
               | It felt gross then, and it feels gross now, but it seemed
               | to do the trick for our needs.
        
               | johncolanduoni wrote:
               | It doesn't, since Postgres doesn't actively use the
               | primary key for clustering.
        
               | paulddraper wrote:
               | > random sort order can really hurt performance
               | 
               | I've long thought it was a problem that PostgreSQL
               | doesn't have a real clustering key like MySQL or SQL
               | Server.
               | 
               | But PostgreSQL users have told me it doesn't really
               | matter.
        
               | dtech wrote:
               | It's much less important in postgres as in some other
               | databases, but it still hurts.
               | 
               | Mainly because indices have to be updated and searched
               | all over the place with UUIDs. The locality of the data
               | on disk itself is fine because its inserted sequential.
        
               | anarazel wrote:
               | > But PostgreSQL users have told me it doesn't really
               | matter.
               | 
               | Those users were very wrong, unless they explicitly only
               | talked about the case where the uuids are not indexed.
        
         | phkahler wrote:
         | >> but the downside is that there's no standard way to parse
         | the time from one of these
         | 
         | IMHO that should not be a concern. The goal of UUID is to
         | create a _Unique_ Identifier, not to record the time something
         | was created. Maybe we should all have quantum random number
         | generators, but with that I don 't think 128 bits is enough for
         | a UUID. Might be enough for specific applications though.
         | 
         | First rule of database design: Your unique IDs should not have
         | a real-world meaning.
        
           | iratewizard wrote:
           | > First rule of database design: Your unique IDs should not
           | have a real-world meaning.
           | 
           | I see a lot of comments conflating a unique ID with a unique
           | column. It's OK for you to have a unique contraint on
           | something like an email column. But you wouldn't use the
           | email column as the primary key.
        
           | mulmen wrote:
           | > First rule of database design: Your unique IDs should not
           | have a real-world meaning.
           | 
           | Well, this is an opinion. I would definitely push back on
           | making it a rule. Especially the _first_ one.
           | 
           | Natural keys exist, make perfectly good unique identifiers,
           | and have inherent meaning.
           | 
           | When using _synthetic_ keys it is difficult (read: requires
           | making trade-offs) to guarantee they were created in order
           | and /or that they increase monotonically. But if you make
           | those trade-offs (or just align expectations) you can still
           | associate real world meaning to them.
           | 
           | These new UUIDs standardize more of those trade-offs.
        
           | koolba wrote:
           | In theory, theory and practice are the same. In practice
           | they're not.
           | 
           | Sortable unique IDs have many performance benefits in real
           | world systems.
        
           | wilg wrote:
           | If you don't think that's a concern then UUID8 is for you!
        
         | hinkley wrote:
         | > UUID6: a timestamp with a weird epoch and 100 ns precision
         | like in UUID1, but in a big-endian order that sorts naturally
         | by time
         | 
         | So even the IETF hasn't read Things Every Developer Should Know
         | About Time.
        
         | [deleted]
        
         | zvrba wrote:
         | > with a weird epoch and 100 ns precision
         | 
         | Exactly as used by windows to store file times. And the epoch
         | is the start of the modern calendar
         | https://en.wikipedia.org/wiki/Gregorian_calendar
         | 
         | Talking about dates before this epoch makes little sense (any
         | date before the Gregorian epoch will not resemble the same
         | stellar / planet constellation as date after the Gregorian
         | epoch).
        
           | jefftk wrote:
           | _> Talking about dates before this epoch makes little sense
           | (any date before the Gregorian epoch will not resemble the
           | same stellar  / planet constellation as date after the
           | Gregorian epoch)._
           | 
           | That is not the only sense in which we care about dates. For
           | example, we might want to talk about which of two events came
           | first, and by how much. Historians have lots of uses for
           | dates before the introduction of the Gregorian calendar.
        
             | SBArbeit wrote:
             | Yes, but we won't need 100ns precision for those historical
             | dates.
        
             | contravariant wrote:
             | Surely they'd be using some kind of timestamp standard not
             | a UUID?
        
           | ygra wrote:
           | Windows' FILETIME uses a different epoch, though. It counts
           | since 1601 (the then-most-recent 400-year leap-year cycle
           | when Windows NT was designed).
        
           | pjscott wrote:
           | A precision of 100 ns is not bad, just inflexible; I didn't
           | mean to imply otherwise. The epoch _is_ weird, though. Here
           | 's a thought experiment: imagine that a hundred programmers
           | are each asked to pick an epoch for a timestamp, and will be
           | paid $1000 for each other programmer who chooses the same
           | epoch, but they can't talk with each other to coordinate.
           | Which would you pick? I would pick the Unix epoch, because I
           | think others would do the same. Anything else is, in a
           | certain important sense, weird.
        
             | zamadatix wrote:
             | Yet that thought process wouldn't allow for the Unix epoch
             | in the first place.
        
             | FactolSarin wrote:
             | That doesn't make the epoch weird, just different. The Unix
             | Epoch is itself weird. I personally like it mostly for
             | being a neat little historical artifact, but it's not
             | interesting in any sense. What epoch would your
             | hypothetical programmers settle on if they all had their
             | minds wiped of any existing epochs.
             | 
             | I bet you'd see a lot of Jan 1 1900, 2000, or year 1. Very
             | few would pick 1970.
             | 
             | Other fun non-Jan 1 ideas might be December 26, 1791
             | (Charles Babbage's birthday) or February 5, 1944, (the date
             | Colossus came online)
        
             | ericb wrote:
             | This is a great example of a Schelling Point!
             | 
             | I'd also argue that you're right, and that as a principle,
             | good UI/UX and software architecture should default to
             | following natural Schelling Points.
        
           | MrManatee wrote:
           | I wouldn't say it makes "little sense". Just like we can talk
           | about the year 776 BC even though no one at the time called
           | it that, we can extend the Gregorian calendar backwards to
           | dates when it wasn't used anywhere. The Wikipedia article on
           | the proleptic Gregorian calendar lists some use cases. [1]
           | 
           | And in any case, 15 October 1582 isn't some hard cutoff point
           | where we can stop worrying about calendar conversions. Only
           | four countries adopted the Gregorian calendar on that day,
           | and even in Europe there are several countries that only
           | switched in the 20th century. If a piece of software needs to
           | support historical dates that go anywhere near 1582, it needs
           | to be aware of the existence of different calendars.
           | 
           | [1]
           | https://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar
        
         | hn_throwaway_99 wrote:
         | To clarify, are those "plus some random bits" parts intended to
         | be randomly generated for each individual UUID that is created,
         | or randomly generated once per machine or startup?
         | 
         | I ask because a common newbie bug is to use a UUID as a secure
         | token, but currently only v4 UUIDs with a cryptographically
         | secure PRNG can be used this way. Separately, even if not used
         | as a token, using UUIDs with randomness for object IDs can help
         | mitigate IDOR vulnerabilities if there are other bugs in code
         | that aren't adequately checking object permissions.
        
           | pjscott wrote:
           | They're to be generated for each individual UUID. (However,
           | this RFC recommends using the entirely random UUID4 for
           | anything that's meant to be used in a security-related
           | context, presumably with a CSPRNG.)
        
             | hn_throwaway_99 wrote:
             | Thanks. To be honest then, that last bit, "this RFC
             | recommends using the entirely random UUID4 for anything
             | that's meant to be used in a security-related context,
             | presumably with a CSPRNG." makes me think this RFC could
             | cause some problems.
             | 
             | v6 UUIDs only have 48 bits of randomness, which means that
             | some newbies will think that's "good enough" for security,
             | when in reality it's not.
             | 
             | I still like these new UUID types because I would use them
             | as DB primary keys now (benefit of being time sorted but
             | also globally unique and offers some protection against
             | IDOR bugs), but important to know where not to use them.
        
               | pjscott wrote:
               | Yeah, I'm a bit torn on this security-wise. On the one
               | hand, their only real security-relevant ambition here is
               | to avoid leaking MAC address info, a classic UUID foot-
               | gun, and they do achieve this. Probably a much better set
               | of defaults! On the other hand, yeah, the mere presence
               | of some random bits could lull people into a false sense
               | of safety, especially as the RFC doesn't say anything
               | about where those bits should come from except that it
               | should have enough entropy to make collisions minimal. On
               | the third hand, anybody who thinks that 48 random bits of
               | unspecified provenance are enough for a secure token was
               | probably always going to mess up _somehow_ , so I'm not
               | sure how much difference this makes on the margin.
        
               | aidos wrote:
               | They're perfect for client generation of db ids. I was
               | looking to do exactly this format, but decided not to do
               | anything that was weirdly non-standard.
               | 
               | Uuid6 is the one I'll be switching to now.
        
               | staticassertion wrote:
               | I feel like understanding the guarantees is really
               | simple. A value can be unique while still being
               | guessable, and that's the idea with this latest revision.
               | If you want something to be unique and unguessable,
               | that's another uuid (or just use random bits, there's no
               | benefit to a uuid here, generally).
        
           | dspillett wrote:
           | _> only v4 UUIDs with a cryptographically secure PRNG can be
           | used this way_
           | 
           | Only they _should_ be used that way. Unfortunately other
           | options (other UUID types, v4 with bad RNG) are sometimes
           | used that way. At least other UUID variants are more faf to
           | hack around than a simple increasing integer, particularly if
           | decent request and access validation is in place, though if
           | the wrong type of key is being used perhaps hoping for good
           | validation elsewhere is a bit much.
        
       | tialaramex wrote:
       | After some staring, I can't see why (beyond the obvious, I
       | understand HN rules) this is here.
       | 
       | This is clearly an individual draft. OK. And it has previously
       | expired without action (last year) but a newer draft was
       | submitted this year.
       | 
       | But it doesn't seem to have been adopted by any working group,
       | and although the word "dispatch" appears in the title it was not
       | discussed at IETF 111's DISPATCH or GENDISPATCH or SECDISPATCH
       | last week. If this was in fact dispatched somewhere - perhaps
       | before it expired last time - there's no indication where it went
       | or what its status is now.
       | 
       | It is the nature of these things that _if_ everybody chooses to
       | do exactly this, even if it was only documented in a by-then
       | expired draft, or on the back of an envelope, then that 's how it
       | is -- the IETF has no enforcement arm. However if you support
       | this proposal, or even if you think it'd be a good idea with
       | minor tweaks you should find out where (and if) it's being
       | developed and get on board with that.
        
         | villasv wrote:
         | Maybe the community response here would provide guidance on
         | whether this is worth pursuing?
         | 
         | I for one have been waiting for an IETF-standard sortable UUIDs
         | for a while, so I'm happy to upvote this even if it is "blog
         | post wishlist"-stage of standardization.
        
       | politician wrote:
       | I'm really excited to see k-sortable unique identifiers (flakes)
       | be submitted as an IETF draft. This will help keep the UUID data
       | type relevant.
       | 
       | However, I'd like to see the draft include a mention about the
       | practice of embedding machine and data type identifiers into the
       | format which helps in distributed applications.
        
         | pjscott wrote:
         | Section 7, "Distributed UUID Generation", mentions that you MAY
         | hold some of the most significant random bits constant per-
         | machine to act as a machine identifier.
        
         | cratermoon wrote:
         | the practice of embedding machine and data type identifiers
         | into the format is a security risk
        
           | politician wrote:
           | It might be, but whether it is depends on your threat model.
           | 
           | If you need to obscure both when in time and where in space,
           | then UUIDv4 is a better option.
        
           | emodendroket wrote:
           | How do you figure? There's no guarantee guids aren't supposed
           | to be predictable, right?
        
             | lazide wrote:
             | For one? It allows you to figure out which machine is
             | generating UUIDs, how fast, and all UUIDs it has ever
             | generated. Which 1) is info leakage that is useful to many
             | people in many ways, 2) can be a security risk as it tells
             | you who to attack. 3) at a minimum is a privacy risk.
        
       | jhealy wrote:
       | The author seems to be developing the draft on GitHub and there's
       | a few edits since the v01 version linked here
       | 
       | https://github.com/uuid6/uuid6-ietf-draft
        
       | kokizzu3 wrote:
       | meanwhile.. i did make a shorter one :3
       | https://github.com/kokizzu/lexid
        
       | darkhorse13 wrote:
       | A bit off-topic, but does anyone know how to build webpages that
       | look like this? Like is there any way to do it other than doing
       | everything manually on a text editor?
        
         | wrs wrote:
         | Not sure exactly what you mean by "look like this", but several
         | tools for generating RFCs in the standard text format are here:
         | https://tools.ietf.org/
        
         | LukeShu wrote:
         | Most new RFCs are authored in an XML format that gets rendered
         | to the officially-submitted plaintext (well, since 2016 (RFC
         | 7990) if you use v3 of the XML schema, that may be submitted
         | directly as the canonical version, otherwise the rendered
         | plaintext is the canonical version.)
         | 
         | So the authors wrote this XML
         | https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u...
         | 
         | That XML got turned in to this plaintext
         | https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u...
         | by the xml2rfc tool https://xml2rfc.tools.ietf.org/
         | 
         | And then that plaintext got turned in to the linked HTML by the
         | datatracker.ietf.org server software, which can be found in SVN
         | https://svn.ietf.org/svn/tools/ietfdb/
        
         | Croftengea wrote:
         | Maybe just wrap your text into <pre> tags? :)
        
       | mrgleeco wrote:
       | OT but related: a reverse-sorted UUID format seems especially
       | useful for IoT and event data where typically we want to read
       | newest to oldest lexicographically (eg. rowscans starting at
       | now). Is there such a standard or OSS that does this?
        
       | geostyx wrote:
       | This looks awesome. Can't wait till I can add this to uuid.rocks!
        
       | fhrow4484 wrote:
       | > implementations MAY dedicate a portion of the node's most
       | significant random bits to a pseudo-random machineID which helps
       | identify UUIDs created by a given node. This works to add an
       | extra layer of collision avoidance.
       | 
       | > This machine ID MUST be placed in the UUID proceeding [sic] the
       | timestamp and sequence counter bits. This position is selected to
       | ensure that the sorting by timestamp and clock sequence is still
       | possible.
       | 
       | This guarantees uniqueness at a global level, as long as each
       | machine doesn't run out of sequence counters within a given
       | timestamp.
       | 
       | But why must that machine ID must preceding the timestamp &
       | sequence counter? Why not have it after? (or does "proceeding"
       | has a meaning I'm not aware of? I read it as a typo for
       | "preceding", but I'd assume it should be succeeding, especially
       | given what the next sentence says)
       | 
       | My intuition is range requests based on timestamp would work
       | better if the machine ID is after, not before... If before, it
       | seems it would violate the key requirement in abstract of
       | "sortable using the monotonic creation time".
       | 
       | (It's already violated since each machine in distributed system
       | doesn't have the same clock so "creation time" is all relative.
       | But for purposes of analytics, such as querying the "last 24h",
       | having the timestamp be at the beginning seems preferable, since
       | range queries can be done easily)
        
         | pjscott wrote:
         | Since there's no valid way for part of the node to be put in
         | front of the timestamp and sequence counter bits -- that would
         | contradict the format specifications in section 4 -- they
         | probably meant to write "succeeding" or "right after". (There
         | are, alas, still some typos in this draft.)
        
         | NyxWulf wrote:
         | I found that confusing as well. If the machine id is before the
         | unixts, the primary advantage of using this is lost for me.
         | Scanning an index for a 24 hour period is only fast if you can
         | easily find the start and stop, and they have locality. Since
         | that is the primary problem addressed with this rfc, I hope it
         | is just a poorly worded section, rather than a design flaw.
        
         | greggyb wrote:
         | > This machine ID MUST be placed in the UUID proceeding the
         | timestamp and sequence counter bits.
         | 
         | I read this as a strange word choice, but interpret as follows:
         | "This machine ID MUST proceed from the timestamp and sequence
         | counter." X proceeding from Y implies that X comes after Y.
         | 
         | If we examine the context, it is absolutely unambiguous
         | (emphasis mine):
         | 
         | > This machine ID MUST be placed in the UUID proceeding [sic]
         | the timestamp and sequence counter bits. This position is
         | selected _to ensure that the sorting by timestamp and clock
         | sequence is still possible_.
         | 
         | The proceeding sentence makes the intent clear, that a naive
         | sort should preserve time ordering. The only way for this to
         | work is for the timestamp and sequence counter bits to precede
         | the optional machine ID.
        
           | Wevah wrote:
           | I think there was some proceeding/preceding confusion in GP.
           | I misread it at first, too.
        
       ___________________________________________________________________
       (page generated 2021-08-06 23:00 UTC)