[HN Gopher] Sortable Collision-Free UUIDs
___________________________________________________________________
Sortable Collision-Free UUIDs
Author : kpdemetriou
Score : 41 points
Date : 2021-05-03 20:14 UTC (2 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| dralley wrote:
| What is the difference between this and ULID? The latter is
| implemented in dozens of languages already, and many ULID
| libraries support conversion into UUID format.
|
| https://github.com/ulid/spec
|
| "Universally Unique Lexicographically Sortable Identifier"
| punnerud wrote:
| The Python-code behind this: ts = int(time() - 16
| \* 10 \*\* 8) return ts.to_bytes(4, "big") + os.urandom(12)
| the_mitsuhiko wrote:
| 4 byte timestamp is brave. int(time() - 16 * 10
| ** 8)
| rekwah wrote:
| Looks similar to ULID[0] (I am the author of a popular python
| implementation[1]).
|
| It appears to have a similar constraint that two ID's generated
| within the same timestamp (ms, ns) have no strong guarantee of
| ordering. That might not be a deal breaker depending on your use
| case but something to consider.
|
| * https://github.com/ulid/spec
|
| * https://github.com/ahawker/ulid
| laurent123456 wrote:
| I think the issue with a lib like this is that people might use
| it, thinking they don't need the IDs to be secure... until they
| need to be. But by then plenty would have been generated, and a
| lot of code would rely on this sortable property.
|
| In fact, I don't see the point of a library like this, it's
| trying to encode two pieces of information into one string. Why
| is that? Why not encode geolocation or IP while they're at it. If
| some data is important, like the order, then a separate database
| column can easily be used and that would make for a much more
| durable and reliable solution.
| koolba wrote:
| Sortable IDs have pleasant properties on insertion in
| traditional btree indexes as all the new values are on one edge
| of the tree. Truly random IDs end up with random I/O on the
| index tree.
|
| With a database like postgres with full page writes enabled, it
| can blow out quite quickly at scale.
| WorldMaker wrote:
| Though something to point out that this project misses is
| that UUID sorting is its whole own weird can of worms. I just
| went through this exercise trying to get ULID to GUID
| conversion/round-tripping that matched MS SQL Server's GUID
| sort order in hopes for the benefits of those small, pleasant
| properties.
|
| In short, because SQL Server's sort order is based on
| GUID/UUID v1, the "group" that matters most significantly for
| sort order (in SQL Server, but other GUID tools are
| different) is the tail 6 bytes (not the head bytes as these
| FUUIDs are using which would allow them to sort in unix sort
| in GUID form). (As the tail 6 bytes are the "machine ID" in
| v1 UUIDs.)
|
| For anyone curious how deep (and which endian) the rabbit
| hole goes in GUID sorting, I kept referring to Raymond Chen's
| somewhat comprehensive description of GUID sort orders across
| Microsoft products in my efforts: https://devblogs.microsoft.
| com/oldnewthing/20190426-00/?p=10...
| sroussey wrote:
| It's worse with clustered indexes -- all of the data
| experiences this issue, not just the index.
| munk-a wrote:
| At least in Postgres you have the advantage/disadvantage
| that real-time clustered tables aren't a feature on offer -
| so that re-clustering cost is eaten on a periodic basis
| whenever you invoke a re-cluster operation on the table in
| question.
|
| Additionally re-reading things closely I believe this
| package wouldn't benefit databases at all since the
| "sortable" form appears to be the common string
| representation while postgres will internally store UUIDs
| (at least in a UUID column) as their binary values -
| leading to indexing still being out of order unless you
| specify a ::TEXT casting.
| Jiejeing wrote:
| I agree, to an extent, with your comment. But having
| sortable/chronological IDs on a single column is a really nice
| feature to have.
|
| Obviously, if you rely on the IDs to be unguessable/brute-
| forceable as part of your access control, then it is certainly
| bad to use such IDs. But I have not yet seen a case where I
| need both sortable and secure IDs. Being collision-free is what
| we want in a database context, and if it can hold some more
| information for the order, all the better.
| rndgermandude wrote:
| Secure? These fuuids are supposed to replace regular uuids,
| which aren't specified to be secure either. UUIDs are supposed
| to be quasi-unique (very low probability of collisions), not
| secure. They are, in my humble opinion, the wrong tool if you
| want secure.
|
| >Why not encode geolocation or IP while they're at it.
|
| Version 1 (and 2) of UUIDs encoded the MAC-Address + ns
| timestamp. So there you go.
|
| Version 3 and 5 just hash a namespace string and a name string
| with md5 and sha1 respectively (same inputs generate same
| output)
|
| Version 4, which these days is used by almost everything, just
| uses whatever pseudo-random numbers the implementation chose to
| use. If you can figure out what PRNG (pseudo-random number
| generator) was used (if any) and the internal state of that
| PRNG, then you may be able to guess/forge UUIDs. The spec does
| not mandate using a cryptographically-secure PRNG, although a
| lot of implementations out there today will probably use one.
|
| Anyway, thinking UUIDs would be secure is wishful thinking,
| unless you did the legwork to make sure you control the
| generator, and the generator you use is using Version 4 with a
| cryptographically-secure PRNG, and you can guarantee that that
| will not change in the future; then you get 122 bits of secure
| random numbers (+ 6 static bits for version and variant).
|
| By comparison, this implementation of "fuuids" gives you 96
| bits of secure-enough random numbers or 64 bits in the
| nanosecond case (it uses os.urandom(12) and os.urandom(8),
| compared to os.urandom(16) that the standard lib in python
| uses)
| laurent123456 wrote:
| Version 4 would still be more secure in the sense that it
| doesn't leak any extra info and as you note it can be
| generated using a a cryptographically-secure PRNG.
|
| With sortable IDs, you could potentially know when some item
| was generated, which may or may not be useful info, but why
| take the risk? I think it makes sense to minimise the data
| that the app or service exposes. With a regular uuid you can
| control whether you expose or not the creation time, while
| with sortable ids you can't.
| swyx wrote:
| i have a loosely curated list of "State of the Art of UUIDs"
| here: https://github.com/sw-yx/uuid-list/
|
| just added FUUIDs, thanks OP
| js2 wrote:
| I'd advise UUID v6 over this which is at least an RFC 4122
| extension. As coded, this isn't UUID compatible other than being
| 128 bits.
|
| http://gh.peabody.io/uuidv6/
|
| Also some recent similar submissions:
|
| Timeflake is a 128-bit, roughly-ordered, URL-safe UUID.
|
| https://news.ycombinator.com/item?id=25870482
|
| https://github.com/anthonynsimon/timeflake
|
| ULIDs:
|
| https://news.ycombinator.com/item?id=18768909
|
| Sonyflake:
|
| https://news.ycombinator.com/item?id=25592325
|
| KSUIDs (can't find any discussion here):
|
| https://github.com/segmentio/ksuid
|
| This comment lists other prior art:
|
| https://github.com/bradleypeabody/gouuidv6/issues/3
| halfmatthalfcat wrote:
| I started looking into TSID/KSUID/ULID in order to support cursor
| based pagination schemes in GraphQL against non-integer based
| unique id fields (such as uuids or unique string ids).
|
| A couple of notable Java libs:
|
| https://github.com/f4b6a3/ulid-creator
|
| https://github.com/f4b6a3/tsid-creator
|
| https://github.com/akhawaja/ksuid
| Lazare wrote:
| First, for something like this, the details matter a lot. How
| many bits of randomness is this, how many bits used for the
| timestamp, what's the format, why does it yield the advantages
| claimed, and most of all, how does it provide collision
| resistance? Is it using a MAC address (like UUID v1/v6) or a
| custom namespace (like UUID v5), or what?
|
| In this case...looking at the code.... It exposes two formats.
|
| Format 1: 4 bytes to store the number of seconds since a custom
| epoch of... Sep 13 2020 12:26:40, and 12 bytes of randomness.
| Puzzling choice.
|
| Format 2: 8 bytes to store the number of nanoseconds since the
| same custom epoch, and 8 bytes of randomness.
|
| And the collision resistance...doesn't exist at all; it's just
| relying on 12 (or 8) bytes of randomness and a time prefix. Which
| seems like it'll be totally fine in practice, but if you actually
| care about collisions (and in my experience, almost everyone who
| thinks they do really doesn't), this is unlikely to be an optimal
| choice.
|
| Functionally, I think this is closest to KSUIDs
| (https://github.com/segmentio/ksuid) which are also the result of
| combining a timestamp (with a custom epoch) and some randomness,
| and then concatenating them in a way which is not a valid UUID,
| will work efficiently as a DB index, and is extremely unlikely to
| collide. But KSUIDs are much more widely adopted and better
| documented.
| natch wrote:
| Cool. I don't see the readme warning about this, so:
|
| Use wisely and with full awareness. Revealing creation sequence
| can leak information which can be an issue in scenarios where
| such information might compromise security.
| ForHackernews wrote:
| Hold on, aren't UUIDv1 already generated based on a timestamp?
| What's the point of this?
| karmakaze wrote:
| OT: Github READMEs have got to the point where they don't say
| what it actually does, how it does it, things you can depend on
| or not. They are now QUICKSTART.md. With the dependency
| insecurities going around, why would anyone think this is okay?
| sroussey wrote:
| Is this a time swapped version of UUID v1? Like what MySQL will
| do when converting v1 uuid to binary for use in an index (when
| using the swap flag)?
| lilyball wrote:
| This really needs an explanation of how the "UUID" is generated.
| What properties does it have? How secure is it against attackers?
| pkulak wrote:
| Are they sortable as a string or as binary? Once you deploy
| something like this, that becomes important because you'll end up
| storing binary and string representations in different spots,
| unless you're _really_ careful (and DBs can insert way faster if
| you're always doing so near the end of the primary key sorting).
| And unless you use a totally custom string function, you can't
| have both, since base64 has letters before numbers, and ASCII is
| vice versa. It can be a real pain.
| munk-a wrote:
| The examples use both letters and numbers in the ID composition
| so I would guess not - in all honesty though your question
| needs some clarification because the string representation of
| UUIDs is pretty fluid - there's the generally expressed version
| like `01324332-f66a-054a-76e4-fbdc7f772cd1` which appears to be
| what this generator favors - but for indexing UUIDs are usually
| considered to be their binary values so it's highly unlikely
| that a DB would see anything beyond a coincidental marginal
| benefit from using this package.
|
| It's hard to tell since the package says "sortable using UNIX
| sort" so it's not a format agnostic sorter - but it's also not
| conclusive whether the author meant the common human-readable
| strings can be sorted using UNIX sort or the binary strings can
| be sorted using UNIX sort.
| edoceo wrote:
| also this https://news.ycombinator.com/item?id=27004098
|
| w3c distributed id spec/primer
___________________________________________________________________
(page generated 2021-05-03 23:01 UTC)