[HN Gopher] Type-safe, K-sortable, globally unique identifier in...
       ___________________________________________________________________
        
       Type-safe, K-sortable, globally unique identifier inspired by
       Stripe IDs
        
       Author : dloreto
       Score  : 223 points
       Date   : 2023-06-28 16:28 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | klabb3 wrote:
       | A couple of suggestions:
       | 
       | Lock down the prefix string _now_ before it's too late and
       | document it. I see in Go that it's lowercase ascii, which seems
       | fine except for compound types (like "article-comment"). May be
       | worth looking at allowing a single separator given that many
       | complex projects (and ORMs) can't avoid them.
       | 
       | The Go implementation has no tests. This is very unit-testable.
       | Add tests goddammit!
       | 
       | For Go, I'd align with Googles UUID implementation, with proper
       | parse functions and an internal byte array instead of strings.
       | Strings are for rendering (and in your case, the prefix). Right
       | now, it looks like the parsing is too permissive, and goes into
       | generation mode if the suffix is empty. And the SplitN+index
       | thing will panic if no underscores, no? Anyway, tests will tell.
       | 
       | As for the actual design decisions, I tried to poke holes but I
       | fold! I think this strikes the sweet spot between the different
       | tradeoffs. Well done!
        
         | tomcam wrote:
         | > Add tests goddammit!
         | 
         | Hey, you're pretty smart. How about you add them?
        
           | klabb3 wrote:
           | I'm by no means a test police. I'm in fact opposed to a lot
           | of mindless testing for the sake of it. But there are places
           | where unit tests shine, and this is one of them.
           | 
           | If you mean that criticism is only allowed if you are willing
           | to commit labor, I disagree with that. I always welcome
           | critique myself - it may be something that I've missed. The
           | maintainers always has the last word. As long as there are no
           | hidden expectations, it's all good.
        
         | avgcorrection wrote:
         | > The Go implementation has no tests. This is very unit-
         | testable. Add tests goddammit!
         | 
         | Yep. The readme asks people to provide other implementations.
         | Having a test suite would be good for third-party code.
        
         | dloreto wrote:
         | Thanks for the feedback!
         | 
         | We have tests for the base32 encoding which is the most
         | complicated part of the implementation
         | (https://github.com/jetpack-io/typeid-
         | go/blob/main/base32/bas...) but your point stands. We'll add a
         | more rigorous test suite (particularly as the number of
         | implementations across different languages grows, and we want
         | to make sure all the implementations are compatible with each
         | other)
         | 
         | Re: prefix, is the concern that I haven't defined the allowed
         | character set as part of the spec?
        
       | ajanuary wrote:
       | Neat, I like the type safe prefix idea.
       | 
       | Personally, I rarely find I need ids to be sortable, so I just go
       | with pure randomness.
       | 
       | I also like to split out a part of the random section into a tag
       | that is easier for me to visually scan and match up ids in logs
       | etc.
       | 
       | I call my ID format afids [0]
       | 
       | [0] https://github.com/aJanuary/afid
        
       | AtNightWeCode wrote:
       | An important aspect of identifiers is to not leak any information
       | in the identifier. In some scenarios a prefix might be fine but
       | less important things have been blocked by our dpo department.
        
         | clintonb wrote:
         | Requirements depend on the use case. I don't consider the
         | prefix a "leak" and neither does Stripe.
        
         | jeremyjh wrote:
         | Yeah, this idea makes me uneasy as well for the same reason.
         | The natural conclusion would be to have DB frameworks
         | automatically do this based upon code type names and that
         | definitely feels like a leak, handing the world at large a
         | roadmap to your internal system architecture.
        
       | ukuina wrote:
       | Why is it beneficial to sort random IDs?
        
         | davidjfelix wrote:
         | So you don't have to create an additional index to scan them in
         | a somewhat sensible order. createdAt just happens to be a
         | naturally decent scan order.
        
       | lll-o-lll wrote:
       | K-Sortable is a great concept; having weakly sorted keys solves a
       | bunch of use-cases. I really like the idea of a typed, condensed
       | string representation. However I wonder if an unintended side
       | affect of UUID V7 is going to be a bunch of security problems.
       | 
       | People aren't meant to use uuids as tokens, and they aren't
       | supposed to use PKs from a DB for this either - but they do.
       | Because UUID v4 is basically crypto random, I think we've been
       | getting away with a bunch of security weaknesses that would
       | otherwise be exploited.
       | 
       | With UUID v7 we are back to 32bits of actually random data. It's
       | going to require some good educating to teach devs that uuids are
       | _guessable_.
       | 
       | [edit] Looks like I am off base with the guess-ability of the V7
       | UUID, as the draft recommends CSPRNG for the random bits, and the
       | amount of entropy is at least 74 bits and it is specifically
       | designed to be "unguessable". It does say "UUID v4" for anything
       | security related, but perhaps that is simply in regard to the
       | time stamp?
        
       | pphysch wrote:
       | I implemented something similar recently, but opted to write my
       | own UUIDv7 that uses the last 16 flexible bits for a "type ID".
       | That allows 65K different data models, which should be more than
       | enough. Could even partition that further to store a node ID for
       | a globally distributed setup.
       | 
       | So it's got all of the above perks, but it also an actual UUID
       | and fits in a Postgres UUID column.
       | 
       | It's very cool to be able to resolve a UUID to a particular
       | database table and record with almost zero performance overhead
       | (cached table lookup + indexed record select).
        
         | Daegalus wrote:
         | Ideally, you should version that as UUIDv8. Since that is a
         | more custom implementation, but I guess changing the random
         | bits with type info works fine for UUIDv7, they jsut arent
         | random anymore.
        
           | pphysch wrote:
           | Not every bit of UUID is required to be random.
           | 
           | The goals are smallness, uniqueness, monotonicity, resistance
           | to enumeration attacks, etc. Not randomness for randomness
           | sake.
           | 
           | My UUIDv7+ can be consumed as a standard UUIDv7. It is not
           | intended to be v8. A program can treat the last 16 bits as
           | random noise if it wants.
        
             | jeremyjh wrote:
             | Well UUIDv7 can be consumed as a UUIDv4 in the same way,
             | its just 16 bytes. The point of the standard is to define
             | _how_ the particular bytes are chosen.
        
               | pphysch wrote:
               | The latest standard for v7 does not meaningfully describe
               | how to interpret the last segment.
               | 
               | It says they could be pseudorandom and non-monotonic. Or
               | it could be monotonic and non-random. These are
               | completely disjoint cases! "X or not X" is tautological.
               | And there is no way to determine which (e.g. there
               | _could_ be a flag that indicates this mode, but there is
               | not).
               | 
               | To be clear, the standard should be amended to resolve
               | this ambiguity. Say the last bits MAY be monotonic or MAY
               | be pseudorandom. Or add a flag that indicates which.
               | 
               | As there is currently no standard way to interpret these
               | bits, I feel perfectly justified in using the a few of
               | the least significant ones to encode additional
               | information.
        
               | jeremyjh wrote:
               | I think the purpose of the standard is so that different
               | software implementations work the same way, so that once
               | you've picked a standard, you can use it everywhere and
               | know that keys are assigned the same way regardless of
               | which software stack is generating a particular key. Its
               | not so that systems can "interpret" it. Obviously they
               | are your bytes to use however you want if you are rolling
               | your own generator.
        
               | pphysch wrote:
               | > you can use it everywhere and know that keys are
               | assigned the same way regardless of which software stack
               | is generating a particular key.
               | 
               | But even if you follow the standard to a tee, you cannot
               | infer anything about how the last 62 bits were assigned.
               | That is my point!
        
               | dralley wrote:
               | Postgresql doesn't care, it's not going to "interpret"
               | those bits, it is just a 128-bit integer.
        
               | pphysch wrote:
               | And I'm glad for it, because I could implement this
               | without needing an extension or update to PostgreSQL.
        
             | hfkwer wrote:
             | > It is not intended to be v8.
             | 
             | There is already a UUIDv8. It's defined as vendor-specific
             | UUID. https://www.ietf.org/archive/id/draft-peabody-
             | dispatch-new-u...
             | 
             | > Some example situations in which UUIDv8 usage could
             | occur:
             | 
             | > An implementation would like to embed extra information
             | within the UUID other than what is defined in this
             | document.
             | 
             | Isn't that exactly what you are doing?
        
             | Daegalus wrote:
             | I am aware, just saying per spec, its supposed to be random
             | bit data, thats all I was saying. I am familiar with a spec
             | since I maintain a UUID library that has 6,7, and a custom
             | 8 implemented.
             | 
             | It can have extra monotonicity data instead, per section
             | 6.2 but ideally its random. Again, Not saying you can't do
             | what you are doing, I just know per the conversations while
             | the draft was gathering feedback, your type of change was
             | intended to be done as uuidv8
        
               | pphysch wrote:
               | > per spec, its supposed to be random bit data
               | 
               | > It can have extra monotonicity data instead
               | 
               | Well, which is it? These are incompatible requirements.
               | 
               | If I give you a standard UUIDv7 sample, it is impossible
               | for you to interpret the last 62 bits. You cannot
               | determine how they were generated. If I give you two
               | samples with the same timestamp, you cannot say which was
               | generated first. These bits are de facto uninterpretable,
               | unlike e.g. the 48 MSB, which have clearly defined
               | semantics.
        
               | Daegalus wrote:
               | Well, that might be an ambiguity that needs to be brought
               | up before its final if it is an issue.
               | 
               | So if we look at https://www.ietf.org/archive/id/draft-
               | ietf-uuidrev-rfc4122bi...
               | 
               | For list item #3 it says "Random data for each new UUIDv7
               | generated for any remaining space." without the word
               | "optional" and the bit layout diagram says `rand_b`
               | 
               | But when you read the description for `rand_b` it says:
               | "The final 62 bits of pseudo-random data to provide
               | uniqueness as per Section 6.8 and/or an optional counter
               | to guarantee additional monotonicity as per Section 6.2."
               | 
               | Reading section 6.2
               | https://www.ietf.org/archive/id/draft-ietf-uuidrev-
               | rfc4122bi..., it all involves incrementing counters, or
               | other monotonic random data.
               | 
               | If you can guarantee that you custom uuidv7 is globally
               | unique for 10000 values per second or more, I don't see
               | why you can't do what you do and treat your custom data
               | as random outside of your implementation.
               | 
               | I think part of this is my mistake, because I assumed you
               | replaced most of the random data with information, but
               | reading it now, I read that you replaced just the last 16
               | bits. Also since most people used random data for
               | UUIDv1's remaining 48bits of `node` then your variation
               | is no worse than UUIDv1 (or 6) while also being
               | compatible with v7.
               | 
               | I think I just got too caught up on the the bit layout
               | calling it `random` and misread your information. Sorry
               | for the misunderstanding, and thanks for discussing it.
        
       | [deleted]
        
       | stephen wrote:
       | Neat! Love the "type-safe" prefix; we'd called them "tagged ids"
       | in our ORM that auto-prefixes the otherwise-ints-in-the-db with
       | similar per-entity tags:
       | 
       | https://joist-orm.io/docs/advanced/tagged-ids
       | 
       | We'd used `:` as our delimiter, but kinda regretting not using
       | `_` because of the "double-click to copy/paste" aspect...
       | 
       | In theory it'd be really easy to get Joist to take "uuid columns
       | in the db" and turn them into "typeids in the domain model", but
       | probably not something that could be configured/done via userland
       | atm...that'd be a good idea though.
        
         | wongarsu wrote:
         | Reddit does something similar, but optimized for string length:
         | elements have ids like "t3_15bfi0" where t3_ is a prefix for
         | the type (t3 is a post, t1 a comment, t5 a subreddit, etc) and
         | the remaining is a base36 encoding of the autoincrementing
         | primary key.
        
       | swyx wrote:
       | for those researching this topic, I keep a list of these
       | UUID/GUID implementations!
       | 
       | https://github.com/swyxio/brain/blob/master/R%20-%20Dev%20No...
        
         | crdrost wrote:
         | Thanks for this!
         | 
         | I have one idea which is perhaps nerdy enough to make the list
         | but I've never fully fleshed it out, it's that one can encode
         | the nonnegative integers {0, 1, 2, ...} into the finite
         | bitstrings {0, 1}* in a way which preserves ordering.
         | 
         | So if we use hexits for the encoding the idea would be that
         | 0=0, 1=1, ... E=14, then                   F00 = 15         F01
         | = 16         ...         F0F = 30         F100 = 31
         | F101 = 32         ...         F1FF = 286         F2000 =
         | 
         | so the format is F, which is the overflow sigil, followed by a
         | recursive representation of the length of the coming string,
         | followed by a string of hexits that long.
         | 
         | What if you need 16 hexits? That's where the recursion comes
         | in,                   F F00 0123456789ABCDEF          \   \
         | \           \   \    \----- 16 hexits            \   \
         | \   \-- the number 15, "there are 15+1 digits to follow"
         | \       (consisting of overflow, 0+1 digits to follow, and hex
         | 0)               \                 \--- overflow sigil
         | 
         | Kind of goofy but would allow a bunch of things like "timestamp
         | * 1024 + 10-bit machine ID" etc without worrying about the size
         | of the numbers involved
        
       | changoplatanero wrote:
       | what does k-sortable mean?
        
         | AtNightWeCode wrote:
         | That it is a nearly sorted sequence. Typically the first part
         | of the ID is a timestamp. So you may sort the IDs down to the
         | second it was created. But for IDs created at the same time the
         | order is random. This can be used for caching, database
         | performance and so on.
        
         | sixtram wrote:
         | [dead]
        
         | tomnipotent wrote:
         | That if a hundred servers are generating (uuid, timestamp)
         | tuples that are subsequently merged on a single machine, and
         | sorted by uuid, it would have almost the same order as if
         | sorted by timestamp.
         | 
         | This property is useful for RDBMS writes, when the UUID is used
         | as a primary key and this locality ensures that fewer slotted
         | pages need to be modified to write the same amount of data.
        
           | nzgrover wrote:
           | Is that what they mean by "used as the primary key in a
           | database while ensuring good locality"/"database locality"?
           | That read/write access will hit fewer disk pages?
        
             | netcraft wrote:
             | yes, exactly
        
           | nickjj wrote:
           | > it would have almost the same order as if sorted by
           | timestamp.
           | 
           | Is there documentation covering the scenarios on how the
           | order can become out of sync and what the odds are? There's a
           | big difference between "almost" and "always" if we're talking
           | about using this as a database PK.
        
             | michaelt wrote:
             | The key seems to be based on UUIDv7, starting with a
             | timestamp in milliseconds.
             | 
             | So the order can become out of sync if multiple events
             | happen in the same millisecond; or if your servers' clock
             | error is greater than a millisecond (i.e. if you're an NTP
             | user)
             | 
             | More than sufficient for things like ordering tweets. If
             | you're ordering bank account transactions, well, you'd
             | probably be using transactions in an ACID-compliant
             | relational database.
        
             | tomnipotent wrote:
             | > There's a big difference between "almost" and "always"
             | 
             | Not in the context of an RDBMS, which use b+/b*-tree
             | variants (or LSM sstables). Sequentially generated UUID's
             | will end up near each other when sorted lexicographically,
             | regardless of the fact that the sort order doesn't
             | perfectly match the timestamp order.
        
       | vikeri wrote:
       | Another, less known, useful thing about these IDs is that you can
       | double click on them and the full id will always be selected
        
         | avarun wrote:
         | It's mentioned in the README. "can be selected for copy-pasting
         | by double-clicking"
        
         | Eduard wrote:
         | Also, they are safe to use within filenames and directory names
         | (Filesystem paths) without conversion (at least in today's
         | Filesystem not limited to e.g. 8.3 characters) .
         | 
         | Compare that with otherwise nice ISO 8601 datetime format (e.g.
         | 2023-06-28T21:47:59+00:00): it requires conversion for file
         | systems that don't allow colons and plus signs.
        
         | jrockway wrote:
         | This is a setting in your terminal emulator. For me, plain
         | UUIDs are selected just fine when double clicking.
        
           | mojuba wrote:
           | There's life outside of the terminal. For example you want to
           | double-click on the part of a URL in your browser.
        
       | Xeoncross wrote:
       | Assuming you don't need to use UUIDv7 (or any UUID's) then
       | https://github.com/segmentio/ksuid provides a much bigger
       | keyspace. You could just append a string prefix if you wanted to
       | namespace, but the chance of collisions of a ksuid is many times
       | smaller than a UUID of any version.
       | 
       | ksuid is the best general purpose id generator with sort-able
       | timestamps I've found and has libraries in most languages. UUID
       | v1-7 are wasteful.
        
         | rtheunissen wrote:
         | I moved to ULID because they are always lowercase and therefore
         | case-insensitive.
        
         | tasn wrote:
         | We maintain a couple of popular ksuid libraries[1][2] and use
         | it, so we definitely like ksuid. Though one big issue with
         | ksuid is that being 160bit means that it doesn't fit into
         | native uuid types in databases (e.g. postgres), which means
         | that they come with a performance penalty.
         | 
         | 1: https://github.com/svix/rust-ksuid 2:
         | https://github.com/svix/python-ksuid
        
           | Xeoncross wrote:
           | I'm curious, why do you not store these as binary data or do
           | you and you're saying that the UUID operations are better
           | optimized than sorts on binary data?
        
             | tasn wrote:
             | Exactly what the sibling said, and the same applies to
             | database operations (when they have a uuid type).
        
             | snuxoll wrote:
             | I can compare a 128bit UUID in a single instruction, a
             | 160-bit ksuid is a little weirder to work with at the
             | hardware level.
        
       | iillexial wrote:
       | I didn't get the "type-safe" part. How would it work in Go?
       | 
       | Let's say I have structs:
       | 
       | type User struct { ID TypeID }
       | 
       | type Post struct { ID TypeID }
       | 
       | How can I ensure the correct type is used in each of the structs?
        
         | zeroxfe wrote:
         | It's not a language primitive. It's a data format that
         | _enables_ type safety in libraries or APIs (as opposed to a
         | more opaque data format like UUIDv7.)
        
         | avgcorrection wrote:
         | It's stringly-typed type-safety: check if the value has the
         | expected prefix.
        
         | kibwen wrote:
         | Any time you ever read a string, its type is always just going
         | to be "string" (modulo whatever passes for a "string" in your
         | programming language of choice). To get an actual non-string
         | type, you'd need to parse that string, and presumably your
         | parsing function would read the prefix and reject the string if
         | it was passed an ID whose type doesn't match. So it's
         | dynamically type-safe, if not statically type-safe.
        
         | hfkwer wrote:
         | This isn't about object types in any particular language.
        
         | leetbulb wrote:
         | One way is to enforce in Marshal[0] and Unmarshal[1]
         | 
         | [0] https://pkg.go.dev/encoding/json#Marshaler
         | 
         | [1] https://pkg.go.dev/encoding/json#Unmarshaler
        
       | atulvi wrote:
       | Naive Question: The type safe part is just appending a string at
       | the beginning? What if I do that with UUIDv4? is
       | user_49b9cd12-9964-4b9c-8512-742f0a2c9be4 type safe now?
        
         | davidjfelix wrote:
         | Yep. The whole point is that you /never/ assign ids that begin
         | with "user" to types that are not users. Because of that, you
         | can be sure nobody can accidentally copy an id that begins with
         | "user" when meaning to address a different type and get back a
         | result other than "not found".
         | 
         | Example:
         | 
         | I have userId=4 and userId=2. Suppose a user can have multiple
         | bank accounts and userId=4 has accountId=5 and accountId=6 and
         | a defaultAccound accountId=5. userId=2 has an account,
         | accountId=7; I want to send userId=4 some money so I use the
         | function `sendUserMoneyFromAccount(to: int, from: int)`. This
         | is a bad interface but these things exist in the wild a lot. I
         | could accidentally assume that because I want to send userId=4
         | the money to their default account that I would call it using
         | `sendUserMoneyFromAccount(4, 7)` and that would work, but if
         | under the hood it wants 2 accountIds, I've just sent
         | accountId=4 money rather than userId=4's defaultAccount,
         | accountId=5.
         | 
         | With prefixed ids that indicate type, a function that assumes
         | type differently from the one supplied will not accidentally
         | succeed.
         | 
         | In addition, humans who copy ids will be less likely to mistake
         | them. This is just an ergonomic/human centric typing.
        
         | dloreto wrote:
         | That's how the type is encoded as a string, but type-safety
         | ultimately comes from how the TypeID libraries allow you to
         | validate that the type is correct.
         | 
         | For example, the PostgresSQL implementation of TypeID, would
         | let you use a "domain type" to define a typeid subtype. Thus
         | ensuring that the database itself always checks the validity of
         | the type prefix. An example is here:
         | https://github.com/jetpack-io/typeid-sql/blob/main/example/e...
         | 
         | In go, we're considering it making it easy to define a new Go
         | type, that enforces a particular type prefix. If you can do
         | that, then the Go type system would enforce you are passing the
         | correct type of id.
        
       | [deleted]
        
       | wood_spirit wrote:
       | UUIDv7 has been taking HN by storm for years now! When is it
       | going to become a proper standard, and when are libraries and
       | databases and all the rest going to natively support it?
        
         | vbezhenar wrote:
         | What kind of support do you expect? I'm pretty sure that
         | absolute majority of software does not care about any
         | particular bits in UUID, so you can use it today. If some
         | software cared about any particular bits, just imitate UUIDv4,
         | I mean those bits could be randomly generated as well. If you
         | need generation procedure, write it yourself, it's easy.
        
         | Daegalus wrote:
         | Its been going through drafts and improvements. It's very close
         | to being standardized, and many libraries are supporting it
         | already, or new offerings are being added. For example I
         | maintain the Dart UUID library, and my latest beta major
         | release has v6, v7 and a custom v8. There is a list of them
         | somewhere, I know I get pinged on every new draft by the
         | authors because I am listed as a library maintainer on one of
         | their pages.
        
           | Nelkins wrote:
           | How much does it change between drafts? Close enough to where
           | I could use it in production?
        
             | Daegalus wrote:
             | Seeing as how its nearly done, it doesn't change much. It
             | changed more often in the beginning, but its like on its
             | final draft, or near final draft. I think the IETF plans to
             | make final soon.
        
         | kijeda wrote:
         | It would appear to be in the final stages of standardization in
         | the IETF: https://datatracker.ietf.org/doc/draft-ietf-uuidrev-
         | rfc4122b...
        
       | TeeWEE wrote:
       | Good, but I dont see a big advantage over UUIDv7 Anyone has some
       | good ones?
        
         | dloreto wrote:
         | It's based on UUIDv7 (in fact, a TypeID can be decoded into an
         | UUIDv7). The main reasons to use TypeID over "raw" UUIDv7 are:
         | 1) For the type safety, and 2) for the more compact string
         | encoding.
         | 
         | If you don't need either of those, then UUIDv7 is the right
         | choice.
        
       | wg0 wrote:
       | Can anyone guide me about the pros and cons of xid, ksuid and
       | this type-safe option?
        
       | timf wrote:
       | I do a similar thing [1]. One of the great advantages to formally
       | namespaced IDs is including a systematic conversion into strong
       | types in your code. It's harder to accidentally mix things up
       | when coding; function parameters and return tuples are more 'self
       | documented' (and enforced by compiler where applicable).
       | 
       | [1] - https://www.peakscale.com/strongly-typed-ids/
        
       | bombela wrote:
       | I have some complaints about UUIDs. Why not just combining time +
       | random number without the ceremony of UUID versioning. And for
       | when locality doesn't matter, just use a 128bit random number
       | directly.
       | 
       | And in my experience most people somehow think a UUID must be
       | stored into the human friendly hex representation, dashes
       | included. Wasting so much space in database, network, memory.
        
         | rjh29 wrote:
         | Many people had the same idea. For example ULID
         | https://github.com/ulid/spec is more compact and stores the
         | time so it is lexically ordered.
        
         | jerf wrote:
         | While this isn't the worst area I see this in, there does seem
         | to be a tendency in the UUID space to speak as if one use case
         | stands for all and therefore there is _a_ best UUID format.
         | 
         | The reality is that it is just like any other engineering
         | situation. Sit down, write down your requirements, and see
         | what, if anything, solves it.
         | 
         | Reading about the advantages of various formats is very helpful
         | in helping you skip learning about certain things the hard way
         | and use somebody else's experience of learning them the hard
         | way instead. From that point of view I recommend at least
         | glancing through them all. Sortability and time-based locality
         | is one that you may not naturally think about, and if you need
         | it, you will appreciate not learning that the hard way four
         | years into a project after you threw that data away and then
         | realizing you needed it. And some UUID formats actually managed
         | to introduce small security issues into themselves (thinking
         | MAC address leak from UUID v1 here), nice to avoid those too.
         | 
         | If you have a use case where there's an existing solution then,
         | hey, great, go ahead and use it. Maybe if anyone ever needs
         | that but in another language they can pull a library there too.
         | 
         | But if not, don't sweat it. The biggest use of UUIDs I
         | _personally_ have I specified as  "just send me a unique
         | string, use a UUID library of your choice if it makes you feel
         | better". I think I've got a unique format per source of data in
         | this system and it's fine. I don't have volume problems, it's
         | tens of thousands of things per day. I don't have any need to
         | sort on the UUID, they're not really the "identifier", they're
         | just a unique token generated for a particular message by the
         | originator of the message so we can detect duplicate arrivals
         | downstream in a heterogenous system where I can't just defer
         | that task to the queue itself since we have multiple. I don't
         | even need them to be _globally_ unique, I just need them unique
         | within a rather small shard, and in principle I wouldn 't even
         | mind if they get repeated after a certain amount of time
         | (though I left the system enforcing across all time anyhow for
         | simplicity). In this particular case, I do indeed generate my
         | own UUIDs for the stuff I'm originating by just grabbing some
         | stuff from /dev/urandom and encoding it base64, with a size
         | selected such that base64 doesn't end the encoding with ==.
         | Even that's just for aesthetic's sake rather than any actual
         | problem it would cause.
        
         | stronglikedan wrote:
         | > combining time + random number
         | 
         | You can't guarantee that this will be _globally_ unique.
        
           | ceejayoz wrote:
           | No identifier can _guarantee_ that. We just get close enough
           | to be acceptable.
           | 
           | Per Wikipedia, the probability to find a duplicate within 103
           | trillion version-4 UUIDs is one in a billion.
           | 
           | so-youre-saying-theres-a-chance.gif
        
             | [deleted]
        
             | duped wrote:
             | A billion is not that big of a number for UUIDs
        
               | ceejayoz wrote:
               | Re-read. You'd have to generate 103 trillion to have a
               | one billion _th_ chance of a collision.
               | 
               | A billion isn't that big a number, but 103 trillion is.
        
               | jandrewrogers wrote:
               | I think you made a mistake in your math. The Birthday
               | Collision probability of just a trillion random UUID is
               | much higher than that.
        
               | ceejayoz wrote:
               | Feel free to update https://en.wikipedia.org/wiki/Univers
               | ally_unique_identifier#..., but it does note "This
               | probability can be computed precisely based on analysis
               | of the birthday problem". It does show the formula used.
        
               | deathanatos wrote:
               | Wikipedia is correct, AFAICT.
               | 
               | The probability of 1 trillion UUIDs having a collision
               | is,                 def birthday_collision(n, m):
               | return 1 - math.e ** (-((n -1) * n) / (2 * m))
               | In : birthday_collision(1_000_000_000_000, 2 ** 122)
               | Out: 9.403589018575076e-14
               | 
               | That number is roughly the approximation given in
               | Wikipedia.
               | 
               | I.e., at 1T UUIDs, it hasn't happened. For comparison,
               | the odds of being struck by lighting (over a lifetime) is
               | many orders of magnitude greater:
               | 6.535947712418301e-05
        
             | jandrewrogers wrote:
             | I have single datasets with trillions of UUID. Collision
             | probability becomes a thing.
             | 
             | That aside, UUIDv4 is banned in many orgs because there
             | have been several instances in the wild where the "random"
             | number wasn't nearly as random as advertised from some
             | sources for a variety of reasons, leading to collisions. It
             | is relatively easy to screw this up so many orgs don't risk
             | it.
        
       | aartav wrote:
       | I've been doing this kind of thing for years with two notable
       | differences:
       | 
       | 1. I don't believe people actually hand type-in these values, so
       | I'm not really concerned about the 'l' vs '1' issue. I do base 32
       | without `eiou` (vowels) to reduce the likelihood of words
       | (profanity) sneaking in.
       | 
       | 2. I add two base-32 characters as a checksum (salted of course).
       | This is prevents having to go look at the datastore when the
       | value is bogus either by accident or malice. I'm unsure why other
       | implementations don't do this.
        
         | dloreto wrote:
         | The checksum idea is interesting. I'm considering whether it
         | makes sense to add it as part of the TypeID spec.
        
         | sokoloff wrote:
         | > base 32 without `eiou` (vowels) to reduce the likelihood of
         | words (profanity) sneaking in.
         | 
         | We had "analrita" as an autogenerated password that resulted in
         | a complaint many years ago. Might consider adding 'a' as an
         | excluded letter.
        
           | michaelt wrote:
           | Presumably base 32 means 26 letters + 10 digits - 4 banned
           | letters
           | 
           | So adding an excluded letter is not easy.
        
         | [deleted]
        
         | zrail wrote:
         | I implemented number two as part of an encoding scheme a few
         | months ago. I'm not sure how much it's saved in terms of
         | database lookups but it's aesthetically pleasing to know it
         | won't hit a more inscrutable error while trying to decode.
        
       | kortex wrote:
       | Does the prefix ("user_") get recorded in the DB (so every string
       | in the column starts with the same "user_"), or does are there
       | constraints and other clever chicanery to save those bytes in
       | every record? Or do modern DB engines even care? Is this
       | premature optimization?
        
         | carlsverre wrote:
         | The authors have created a specialisation for Postgres that
         | leverages a custom type which is a tuple of type and uuidv7:
         | https://github.com/jetpack-io/typeid-sql/blob/main/sql/typei...
         | 
         | This is more optimal for Postgres while making it slightly more
         | difficult to interop between the db and the language (db driver
         | needs to handle custom types, and you need to inject a custom
         | type converter).
         | 
         | And while there are hacks you can do to make storing uuid-
         | alikes as strings less terrible for db engines, if you want the
         | best performance and smallest space consumption (compressed or
         | not) make sure to use native ID types or convert to
         | BINARY/numeric types.
        
       | jszymborski wrote:
       | This is very similar to how I generate IDs in a project I'm
       | working on.
       | 
       | Example:                   |-A-|-|------------B--------------|
       | NMSPC-9TWN1-HR7SV-MTX00-0H8VP-YCCJZ         A = Namespace, padded
       | to 5 chars. Max 5 chars. Uppercase.         B = Blake3 hashed
       | microtime with a random key.
       | 
       | I like how it folds in a time component but that it also doesn't
       | reveal the time it was generated.
       | 
       | Here's the snippet:
       | https://gist.github.com/jszym/d3c7907b7b6e916f68205c99e5e489...
        
         | goostavos wrote:
         | Namespacing identifiers in general is a great idea for handling
         | those class of integration tests which cannot be fully
         | isolated. It makes it easy to write all kinds of garbage from
         | even concurrently running tests all without any of them
         | colliding or accidentally reading each others writes (because
         | they are themselves namespace aware!). It's low effort to get
         | all the pieces of your system to play along (often entirely
         | transparent via DI), but gives a huge power to weight ratio.
         | Basically deletes an entire class of problems which usually
         | plague large, mature test suits
        
       | ajkjk wrote:
       | Unrelated, but this links to "Crockford's alphabet",
       | https://www.crockford.com/base32.html , which is a base-32 system
       | that includes all alphanumeric characters except I and L (which
       | are confusable with 1), O (which is confusable with 0), and U
       | (????). The page says the reason for excluding U is "accidental
       | obscenity'. What the heck is it talking about?
        
         | deanmen wrote:
         | The F word has a U in it. Sure you could just say FVCK
        
           | [deleted]
        
         | jszymborski wrote:
         | FUCK
        
           | Racing0461 wrote:
           | yep, youtube video ids has/had? same issue where it would
           | have things like fag/f4g etc in it.
           | 
           | eg: google "allinurl:fag site:youtube.com"
        
             | stronglikedan wrote:
             | You can prevent _any_ obscenity, O and 0 confusion, and I
             | and L confusion, just by excluding vowels. If someone
             | interprets  "f4g" in an offensive way, then they have
             | bigger issues than can be dealt with in software.
        
               | arcticbull wrote:
               | There was that time Delta generated an "H8GAYS" PNR. [1]
               | Pretty sure that's valid Crockford encoding too :)
               | however, to your point, it does rely on 'A'. "H8G4YS"
               | would likely still offend someone out there, though,
               | given the kerfuffle in [1].
               | 
               | [1] https://newsfeed.time.com/2013/12/17/delta-airlines-
               | is-very-...
        
               | kortex wrote:
               | > But as White points out, it's a bit surprising that
               | Delta didn't block this particular combination as a
               | possibility. "I'm sure they removed many four-letter
               | words that would be seen as offensive," he tells the
               | Post. "I'm surprised that 'gays' and 'H8' weren't blocked
               | as well."
               | 
               | Oh sweet summer child (meaning Jeff White, not OP/GP). As
               | someone who has implemented a censorship/filtering list,
               | this is a UX problem on the same level of decideability
               | as the halting problem. You can spend boundless time
               | curating a list to flag/grawlix every possible string
               | that would offend even the most prudish of prudes, and
               | some would still get through. Such as the superficially
               | benign "EATTHE"
               | 
               | https://www.dailymail.co.uk/news/article-2039662/Virginia
               | -dr...
        
               | MR4D wrote:
               | Obscenities change with language. That's why every
               | language has them.
               | 
               | Even programming language have them. For instance, Basic
               | has GOTO.
               | 
               | /j
        
               | ZeroClickOk wrote:
               | and javascript has type coercion
        
               | oleganza wrote:
               | The problem is not being offended per se, but having your
               | user id accidentally become "user_123fuck567" -- that's
               | akin to having a vulgar license plate on your car's
               | forehead. People don't appreciate how lucky they
               | sometimes are.
        
               | taosx wrote:
               | Why we care about obscenity in pseudo-random ids and url?
        
               | whimsicalism wrote:
               | anglo morals
        
         | programmarchy wrote:
         | Yeah, wtf?
        
         | codeulike wrote:
         | If I and O are already excluded and you also exclude U that
         | removes a lot of potential rude looking three letter
         | combinations like *** and *** and *** and also the four letter
         | ones like **** and **** and the dreaded ****. Of course because
         | you have A then **** is still a possibility but very very
         | unlikely
        
           | titanomachy wrote:
           | Wow I didn't know HN even had obscenity filters, and I've
           | been here for many years.
           | 
           | Guess that's a credit to the general civility of the
           | community.
           | 
           | EDIT: It appears that other people in this thread are freely
           | using profanity, so either your comment was targeted by
           | automation due to the unusual density of banned words, or
           | it's a joke that went over my head :)
        
             | [deleted]
        
             | [deleted]
        
             | rbera wrote:
             | That explains it, I was very confused by what I assumed was
             | self-censoring, since the comment didn't actually clarify
             | anything. I wish there was an accepted way to disambiguate
             | asterisks from server side filters.
        
               | macintux wrote:
               | I assumed this was a riff on the classic bash.org
               | transcript.
               | 
               | http://www.bash.org/?244321
        
             | taberiand wrote:
             | No obscenity filters, but there is a pretty good password
             | filter I hear. For example, my password 'hunter2' will be
             | all **** to you
        
               | 9dev wrote:
               | Isn't it nice how some traditions do stick around. It's
               | been a while, Cthon98!
        
           | AceJohnny2 wrote:
           | you accidentally the whole thing
        
         | hinkley wrote:
         | A coworker and I came up with basically this same set about 4
         | years before Crockford. We were trying to solve the url slug
         | problem, and they were long enough that we felt 5 bits per byte
         | would reduce transcription annoyances.
         | 
         | In the end I think we had a couple of characters to spare, and
         | so, sitting by ourselves because everyone else had gone home
         | for the day, we ranked swear words by how offensive they were
         | to prioritize removal of a few extra letters. Then I convinced
         | him that slurs were a bigger problem so we focused on that,
         | which got rid of the letter n, instead of u
         | 
         | tggr is just cute, n**r is an uncomfortable conversation with
         | multiple HR teams (we were B2B)
         | 
         | I'm a bit fuzzy now on what our ultimate character set was,
         | because typically you're talking [a-z][0-9], an there are a lot
         | of symbols you can't use in urls and some that are difficult to
         | dictate. My recollection is that we eliminated both 0, l, and
         | 1, but I think we relied on transcription happening either from
         | all caps or all lowercase. 0o are not a problem. Nor are 1L.
        
           | hinkley wrote:
           | Other comments are jogging my memory. I think we went case
           | sensitive (62 characters -> 30 spares), eliminated aA4, eE3,
           | iI1l oO0 (maybe Q), uU, which is 16 characters, 14 to go.
           | Remove the remaining 7 numbers (once you remove most for
           | leetspeak what's the point of the rest?), nN, yY. That leaves
           | 2 left and I can't recall what we did with those. Maybe kK or
           | rR.
           | 
           | Y is pretty versatile for pissing people off.
        
         | avgcorrection wrote:
         | > The page says the reason for excluding U is "accidental
         | obscenity'. What the heck is it talking about?
         | 
         | Because he's an American?
        
         | Zamicol wrote:
         | There's more!
         | 
         | - base 58 - Satoshi's/Bitcoin's
         | https://en.wikipedia.org/wiki/Binary-to-text_encoding#Base58
         | 
         | - "base62" - Keybase's saltpack
         | https://github.com/keybase/saltpack
         | 
         | - The famous "Adobe 85" - https://en.wikipedia.org/wiki/Ascii85
         | 
         | - basE91 - https://base91.sourceforge.net
         | 
         | At work we defined several new "bases" for QR code. IMHO, it is
         | an under applied area of computer science.
        
         | pavlov wrote:
         | True Latinists find the letter U vulgar to the point of
         | obscenity because it didn't exist in Cicero's time.
        
           | oleganza wrote:
           | Trve Latinists wovld appreciate yovr point.
        
             | littlestymaar wrote:
             | Gotcha, there was no "W" in the Latin alphabet either ;)
        
         | kibwen wrote:
         | _> The page says the reason for excluding U is  "accidental
         | obscenity'._
         | 
         | Crockford is being cheeky. To make a nice base32 alphabet out
         | of non-confusable alphanumeric characters you only need to
         | exclude O, I, and L. This leaves you with 33 characters still,
         | so you need to remove one more, and it doesn't matter which one
         | you remove, so you might as well pick an arbitrary reason for
         | the last character that gets removed (and it's not the worst
         | reason, if your goal is to use these as user-readable IDs,
         | although obviously it's not even remotely bulletproof).
        
           | pluijzer wrote:
           | You could argue that U can be confused with V.
        
             | dmurray wrote:
             | 5 and S seems more likely.
        
             | mtlmtlmtlmtl wrote:
             | A vaguely related historical tangent is that V and U used
             | to be just two ways of writing the same letter in Early
             | Modern English. Which I imagine is why W is named as
             | "double U" in speaking.
        
               | jabbany wrote:
               | This is also interesting since in French (and I think
               | Spanish?) W is (correctly) called "double V"
        
             | quickthrower2 wrote:
             | Fvck!
        
           | quickthrower2 wrote:
           | U is a fairly new letter anyway.
        
         | theptip wrote:
         | The Scunthorpe problem?
         | 
         | https://en.m.wikipedia.org/wiki/Scunthorpe_problem
        
           | pizzapill wrote:
           | E-Mail accounts seem the worst. Just lets write letters
           | again, if you need a pencil I recommend penisland.net
        
           | programmarchy wrote:
           | There's enough comedic content in this article for several
           | Silicon Valley episodes.
        
       | eezing wrote:
       | "...can be selected for copy-pasting by double-clicking"
       | 
       | Details matter.
        
       | koito17 wrote:
       | How does this compare to a SQUUID for sorting or nano-id for
       | human readability? Both are options I've used in the past when
       | using databases like Datomic or XTDB. SQUUIDs in particular
       | because I have a UUID that can be ordered by timestamp, nano-id
       | when prototyping things and I want meaningful prefixes in my
       | entity IDs rather than a bunch of UUIDs.
        
       | jtmarmon wrote:
       | This looks great! Is there a reason one couldn't use this with v4
       | UUIDs? A quick test shows that they encode/decode just fine.
       | Wondering if I could use the encoded form as a way to niceify our
       | URLs without having to change how the IDs (currently v4 uuids)
       | are stored
        
         | dloreto wrote:
         | The CLI tool will support encoding/decoding any valid UUID,
         | whether v1, v4, or v7. We picked v7 as the definition of the
         | spec, because we need to choose one of them when generating a
         | new random ID, and our opinion is that by default, that should
         | be v7.
         | 
         | We might add a warning in the future if you decode/encode
         | something that is not v7, but if it suits your use-case to
         | encode UUIDv4 in this way, go for it. Just keep in mind that
         | you'll lose the locality property.
        
       ___________________________________________________________________
       (page generated 2023-06-28 23:00 UTC)