[HN Gopher] Sortable Collision-Free UUIDs
       ___________________________________________________________________
        
       Sortable Collision-Free UUIDs
        
       Author : kpdemetriou
       Score  : 41 points
       Date   : 2021-05-03 20:14 UTC (2 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | dralley wrote:
       | What is the difference between this and ULID? The latter is
       | implemented in dozens of languages already, and many ULID
       | libraries support conversion into UUID format.
       | 
       | https://github.com/ulid/spec
       | 
       | "Universally Unique Lexicographically Sortable Identifier"
        
       | punnerud wrote:
       | The Python-code behind this:                 ts = int(time() - 16
       | \* 10 \*\* 8)       return ts.to_bytes(4, "big") + os.urandom(12)
        
       | the_mitsuhiko wrote:
       | 4 byte timestamp is brave.                   int(time() - 16 * 10
       | ** 8)
        
       | rekwah wrote:
       | Looks similar to ULID[0] (I am the author of a popular python
       | implementation[1]).
       | 
       | It appears to have a similar constraint that two ID's generated
       | within the same timestamp (ms, ns) have no strong guarantee of
       | ordering. That might not be a deal breaker depending on your use
       | case but something to consider.
       | 
       | * https://github.com/ulid/spec
       | 
       | * https://github.com/ahawker/ulid
        
       | laurent123456 wrote:
       | I think the issue with a lib like this is that people might use
       | it, thinking they don't need the IDs to be secure... until they
       | need to be. But by then plenty would have been generated, and a
       | lot of code would rely on this sortable property.
       | 
       | In fact, I don't see the point of a library like this, it's
       | trying to encode two pieces of information into one string. Why
       | is that? Why not encode geolocation or IP while they're at it. If
       | some data is important, like the order, then a separate database
       | column can easily be used and that would make for a much more
       | durable and reliable solution.
        
         | koolba wrote:
         | Sortable IDs have pleasant properties on insertion in
         | traditional btree indexes as all the new values are on one edge
         | of the tree. Truly random IDs end up with random I/O on the
         | index tree.
         | 
         | With a database like postgres with full page writes enabled, it
         | can blow out quite quickly at scale.
        
           | WorldMaker wrote:
           | Though something to point out that this project misses is
           | that UUID sorting is its whole own weird can of worms. I just
           | went through this exercise trying to get ULID to GUID
           | conversion/round-tripping that matched MS SQL Server's GUID
           | sort order in hopes for the benefits of those small, pleasant
           | properties.
           | 
           | In short, because SQL Server's sort order is based on
           | GUID/UUID v1, the "group" that matters most significantly for
           | sort order (in SQL Server, but other GUID tools are
           | different) is the tail 6 bytes (not the head bytes as these
           | FUUIDs are using which would allow them to sort in unix sort
           | in GUID form). (As the tail 6 bytes are the "machine ID" in
           | v1 UUIDs.)
           | 
           | For anyone curious how deep (and which endian) the rabbit
           | hole goes in GUID sorting, I kept referring to Raymond Chen's
           | somewhat comprehensive description of GUID sort orders across
           | Microsoft products in my efforts: https://devblogs.microsoft.
           | com/oldnewthing/20190426-00/?p=10...
        
           | sroussey wrote:
           | It's worse with clustered indexes -- all of the data
           | experiences this issue, not just the index.
        
             | munk-a wrote:
             | At least in Postgres you have the advantage/disadvantage
             | that real-time clustered tables aren't a feature on offer -
             | so that re-clustering cost is eaten on a periodic basis
             | whenever you invoke a re-cluster operation on the table in
             | question.
             | 
             | Additionally re-reading things closely I believe this
             | package wouldn't benefit databases at all since the
             | "sortable" form appears to be the common string
             | representation while postgres will internally store UUIDs
             | (at least in a UUID column) as their binary values -
             | leading to indexing still being out of order unless you
             | specify a ::TEXT casting.
        
         | Jiejeing wrote:
         | I agree, to an extent, with your comment. But having
         | sortable/chronological IDs on a single column is a really nice
         | feature to have.
         | 
         | Obviously, if you rely on the IDs to be unguessable/brute-
         | forceable as part of your access control, then it is certainly
         | bad to use such IDs. But I have not yet seen a case where I
         | need both sortable and secure IDs. Being collision-free is what
         | we want in a database context, and if it can hold some more
         | information for the order, all the better.
        
         | rndgermandude wrote:
         | Secure? These fuuids are supposed to replace regular uuids,
         | which aren't specified to be secure either. UUIDs are supposed
         | to be quasi-unique (very low probability of collisions), not
         | secure. They are, in my humble opinion, the wrong tool if you
         | want secure.
         | 
         | >Why not encode geolocation or IP while they're at it.
         | 
         | Version 1 (and 2) of UUIDs encoded the MAC-Address + ns
         | timestamp. So there you go.
         | 
         | Version 3 and 5 just hash a namespace string and a name string
         | with md5 and sha1 respectively (same inputs generate same
         | output)
         | 
         | Version 4, which these days is used by almost everything, just
         | uses whatever pseudo-random numbers the implementation chose to
         | use. If you can figure out what PRNG (pseudo-random number
         | generator) was used (if any) and the internal state of that
         | PRNG, then you may be able to guess/forge UUIDs. The spec does
         | not mandate using a cryptographically-secure PRNG, although a
         | lot of implementations out there today will probably use one.
         | 
         | Anyway, thinking UUIDs would be secure is wishful thinking,
         | unless you did the legwork to make sure you control the
         | generator, and the generator you use is using Version 4 with a
         | cryptographically-secure PRNG, and you can guarantee that that
         | will not change in the future; then you get 122 bits of secure
         | random numbers (+ 6 static bits for version and variant).
         | 
         | By comparison, this implementation of "fuuids" gives you 96
         | bits of secure-enough random numbers or 64 bits in the
         | nanosecond case (it uses os.urandom(12) and os.urandom(8),
         | compared to os.urandom(16) that the standard lib in python
         | uses)
        
           | laurent123456 wrote:
           | Version 4 would still be more secure in the sense that it
           | doesn't leak any extra info and as you note it can be
           | generated using a a cryptographically-secure PRNG.
           | 
           | With sortable IDs, you could potentially know when some item
           | was generated, which may or may not be useful info, but why
           | take the risk? I think it makes sense to minimise the data
           | that the app or service exposes. With a regular uuid you can
           | control whether you expose or not the creation time, while
           | with sortable ids you can't.
        
       | swyx wrote:
       | i have a loosely curated list of "State of the Art of UUIDs"
       | here: https://github.com/sw-yx/uuid-list/
       | 
       | just added FUUIDs, thanks OP
        
       | js2 wrote:
       | I'd advise UUID v6 over this which is at least an RFC 4122
       | extension. As coded, this isn't UUID compatible other than being
       | 128 bits.
       | 
       | http://gh.peabody.io/uuidv6/
       | 
       | Also some recent similar submissions:
       | 
       | Timeflake is a 128-bit, roughly-ordered, URL-safe UUID.
       | 
       | https://news.ycombinator.com/item?id=25870482
       | 
       | https://github.com/anthonynsimon/timeflake
       | 
       | ULIDs:
       | 
       | https://news.ycombinator.com/item?id=18768909
       | 
       | Sonyflake:
       | 
       | https://news.ycombinator.com/item?id=25592325
       | 
       | KSUIDs (can't find any discussion here):
       | 
       | https://github.com/segmentio/ksuid
       | 
       | This comment lists other prior art:
       | 
       | https://github.com/bradleypeabody/gouuidv6/issues/3
        
       | halfmatthalfcat wrote:
       | I started looking into TSID/KSUID/ULID in order to support cursor
       | based pagination schemes in GraphQL against non-integer based
       | unique id fields (such as uuids or unique string ids).
       | 
       | A couple of notable Java libs:
       | 
       | https://github.com/f4b6a3/ulid-creator
       | 
       | https://github.com/f4b6a3/tsid-creator
       | 
       | https://github.com/akhawaja/ksuid
        
       | Lazare wrote:
       | First, for something like this, the details matter a lot. How
       | many bits of randomness is this, how many bits used for the
       | timestamp, what's the format, why does it yield the advantages
       | claimed, and most of all, how does it provide collision
       | resistance? Is it using a MAC address (like UUID v1/v6) or a
       | custom namespace (like UUID v5), or what?
       | 
       | In this case...looking at the code.... It exposes two formats.
       | 
       | Format 1: 4 bytes to store the number of seconds since a custom
       | epoch of... Sep 13 2020 12:26:40, and 12 bytes of randomness.
       | Puzzling choice.
       | 
       | Format 2: 8 bytes to store the number of nanoseconds since the
       | same custom epoch, and 8 bytes of randomness.
       | 
       | And the collision resistance...doesn't exist at all; it's just
       | relying on 12 (or 8) bytes of randomness and a time prefix. Which
       | seems like it'll be totally fine in practice, but if you actually
       | care about collisions (and in my experience, almost everyone who
       | thinks they do really doesn't), this is unlikely to be an optimal
       | choice.
       | 
       | Functionally, I think this is closest to KSUIDs
       | (https://github.com/segmentio/ksuid) which are also the result of
       | combining a timestamp (with a custom epoch) and some randomness,
       | and then concatenating them in a way which is not a valid UUID,
       | will work efficiently as a DB index, and is extremely unlikely to
       | collide. But KSUIDs are much more widely adopted and better
       | documented.
        
       | natch wrote:
       | Cool. I don't see the readme warning about this, so:
       | 
       | Use wisely and with full awareness. Revealing creation sequence
       | can leak information which can be an issue in scenarios where
       | such information might compromise security.
        
       | ForHackernews wrote:
       | Hold on, aren't UUIDv1 already generated based on a timestamp?
       | What's the point of this?
        
       | karmakaze wrote:
       | OT: Github READMEs have got to the point where they don't say
       | what it actually does, how it does it, things you can depend on
       | or not. They are now QUICKSTART.md. With the dependency
       | insecurities going around, why would anyone think this is okay?
        
       | sroussey wrote:
       | Is this a time swapped version of UUID v1? Like what MySQL will
       | do when converting v1 uuid to binary for use in an index (when
       | using the swap flag)?
        
       | lilyball wrote:
       | This really needs an explanation of how the "UUID" is generated.
       | What properties does it have? How secure is it against attackers?
        
       | pkulak wrote:
       | Are they sortable as a string or as binary? Once you deploy
       | something like this, that becomes important because you'll end up
       | storing binary and string representations in different spots,
       | unless you're _really_ careful (and DBs can insert way faster if
       | you're always doing so near the end of the primary key sorting).
       | And unless you use a totally custom string function, you can't
       | have both, since base64 has letters before numbers, and ASCII is
       | vice versa. It can be a real pain.
        
         | munk-a wrote:
         | The examples use both letters and numbers in the ID composition
         | so I would guess not - in all honesty though your question
         | needs some clarification because the string representation of
         | UUIDs is pretty fluid - there's the generally expressed version
         | like `01324332-f66a-054a-76e4-fbdc7f772cd1` which appears to be
         | what this generator favors - but for indexing UUIDs are usually
         | considered to be their binary values so it's highly unlikely
         | that a DB would see anything beyond a coincidental marginal
         | benefit from using this package.
         | 
         | It's hard to tell since the package says "sortable using UNIX
         | sort" so it's not a format agnostic sorter - but it's also not
         | conclusive whether the author meant the common human-readable
         | strings can be sorted using UNIX sort or the binary strings can
         | be sorted using UNIX sort.
        
       | edoceo wrote:
       | also this https://news.ycombinator.com/item?id=27004098
       | 
       | w3c distributed id spec/primer
        
       ___________________________________________________________________
       (page generated 2021-05-03 23:01 UTC)