[HN Gopher] Type-safe, K-sortable, globally unique identifier in...
___________________________________________________________________
Type-safe, K-sortable, globally unique identifier inspired by
Stripe IDs
Author : dloreto
Score : 223 points
Date : 2023-06-28 16:28 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| klabb3 wrote:
| A couple of suggestions:
|
| Lock down the prefix string _now_ before it's too late and
| document it. I see in Go that it's lowercase ascii, which seems
| fine except for compound types (like "article-comment"). May be
| worth looking at allowing a single separator given that many
| complex projects (and ORMs) can't avoid them.
|
| The Go implementation has no tests. This is very unit-testable.
| Add tests goddammit!
|
| For Go, I'd align with Googles UUID implementation, with proper
| parse functions and an internal byte array instead of strings.
| Strings are for rendering (and in your case, the prefix). Right
| now, it looks like the parsing is too permissive, and goes into
| generation mode if the suffix is empty. And the SplitN+index
| thing will panic if no underscores, no? Anyway, tests will tell.
|
| As for the actual design decisions, I tried to poke holes but I
| fold! I think this strikes the sweet spot between the different
| tradeoffs. Well done!
| tomcam wrote:
| > Add tests goddammit!
|
| Hey, you're pretty smart. How about you add them?
| klabb3 wrote:
| I'm by no means a test police. I'm in fact opposed to a lot
| of mindless testing for the sake of it. But there are places
| where unit tests shine, and this is one of them.
|
| If you mean that criticism is only allowed if you are willing
| to commit labor, I disagree with that. I always welcome
| critique myself - it may be something that I've missed. The
| maintainers always has the last word. As long as there are no
| hidden expectations, it's all good.
| avgcorrection wrote:
| > The Go implementation has no tests. This is very unit-
| testable. Add tests goddammit!
|
| Yep. The readme asks people to provide other implementations.
| Having a test suite would be good for third-party code.
| dloreto wrote:
| Thanks for the feedback!
|
| We have tests for the base32 encoding which is the most
| complicated part of the implementation
| (https://github.com/jetpack-io/typeid-
| go/blob/main/base32/bas...) but your point stands. We'll add a
| more rigorous test suite (particularly as the number of
| implementations across different languages grows, and we want
| to make sure all the implementations are compatible with each
| other)
|
| Re: prefix, is the concern that I haven't defined the allowed
| character set as part of the spec?
| ajanuary wrote:
| Neat, I like the type safe prefix idea.
|
| Personally, I rarely find I need ids to be sortable, so I just go
| with pure randomness.
|
| I also like to split out a part of the random section into a tag
| that is easier for me to visually scan and match up ids in logs
| etc.
|
| I call my ID format afids [0]
|
| [0] https://github.com/aJanuary/afid
| AtNightWeCode wrote:
| An important aspect of identifiers is to not leak any information
| in the identifier. In some scenarios a prefix might be fine but
| less important things have been blocked by our dpo department.
| clintonb wrote:
| Requirements depend on the use case. I don't consider the
| prefix a "leak" and neither does Stripe.
| jeremyjh wrote:
| Yeah, this idea makes me uneasy as well for the same reason.
| The natural conclusion would be to have DB frameworks
| automatically do this based upon code type names and that
| definitely feels like a leak, handing the world at large a
| roadmap to your internal system architecture.
| ukuina wrote:
| Why is it beneficial to sort random IDs?
| davidjfelix wrote:
| So you don't have to create an additional index to scan them in
| a somewhat sensible order. createdAt just happens to be a
| naturally decent scan order.
| lll-o-lll wrote:
| K-Sortable is a great concept; having weakly sorted keys solves a
| bunch of use-cases. I really like the idea of a typed, condensed
| string representation. However I wonder if an unintended side
| affect of UUID V7 is going to be a bunch of security problems.
|
| People aren't meant to use uuids as tokens, and they aren't
| supposed to use PKs from a DB for this either - but they do.
| Because UUID v4 is basically crypto random, I think we've been
| getting away with a bunch of security weaknesses that would
| otherwise be exploited.
|
| With UUID v7 we are back to 32bits of actually random data. It's
| going to require some good educating to teach devs that uuids are
| _guessable_.
|
| [edit] Looks like I am off base with the guess-ability of the V7
| UUID, as the draft recommends CSPRNG for the random bits, and the
| amount of entropy is at least 74 bits and it is specifically
| designed to be "unguessable". It does say "UUID v4" for anything
| security related, but perhaps that is simply in regard to the
| time stamp?
| pphysch wrote:
| I implemented something similar recently, but opted to write my
| own UUIDv7 that uses the last 16 flexible bits for a "type ID".
| That allows 65K different data models, which should be more than
| enough. Could even partition that further to store a node ID for
| a globally distributed setup.
|
| So it's got all of the above perks, but it also an actual UUID
| and fits in a Postgres UUID column.
|
| It's very cool to be able to resolve a UUID to a particular
| database table and record with almost zero performance overhead
| (cached table lookup + indexed record select).
| Daegalus wrote:
| Ideally, you should version that as UUIDv8. Since that is a
| more custom implementation, but I guess changing the random
| bits with type info works fine for UUIDv7, they jsut arent
| random anymore.
| pphysch wrote:
| Not every bit of UUID is required to be random.
|
| The goals are smallness, uniqueness, monotonicity, resistance
| to enumeration attacks, etc. Not randomness for randomness
| sake.
|
| My UUIDv7+ can be consumed as a standard UUIDv7. It is not
| intended to be v8. A program can treat the last 16 bits as
| random noise if it wants.
| jeremyjh wrote:
| Well UUIDv7 can be consumed as a UUIDv4 in the same way,
| its just 16 bytes. The point of the standard is to define
| _how_ the particular bytes are chosen.
| pphysch wrote:
| The latest standard for v7 does not meaningfully describe
| how to interpret the last segment.
|
| It says they could be pseudorandom and non-monotonic. Or
| it could be monotonic and non-random. These are
| completely disjoint cases! "X or not X" is tautological.
| And there is no way to determine which (e.g. there
| _could_ be a flag that indicates this mode, but there is
| not).
|
| To be clear, the standard should be amended to resolve
| this ambiguity. Say the last bits MAY be monotonic or MAY
| be pseudorandom. Or add a flag that indicates which.
|
| As there is currently no standard way to interpret these
| bits, I feel perfectly justified in using the a few of
| the least significant ones to encode additional
| information.
| jeremyjh wrote:
| I think the purpose of the standard is so that different
| software implementations work the same way, so that once
| you've picked a standard, you can use it everywhere and
| know that keys are assigned the same way regardless of
| which software stack is generating a particular key. Its
| not so that systems can "interpret" it. Obviously they
| are your bytes to use however you want if you are rolling
| your own generator.
| pphysch wrote:
| > you can use it everywhere and know that keys are
| assigned the same way regardless of which software stack
| is generating a particular key.
|
| But even if you follow the standard to a tee, you cannot
| infer anything about how the last 62 bits were assigned.
| That is my point!
| dralley wrote:
| Postgresql doesn't care, it's not going to "interpret"
| those bits, it is just a 128-bit integer.
| pphysch wrote:
| And I'm glad for it, because I could implement this
| without needing an extension or update to PostgreSQL.
| hfkwer wrote:
| > It is not intended to be v8.
|
| There is already a UUIDv8. It's defined as vendor-specific
| UUID. https://www.ietf.org/archive/id/draft-peabody-
| dispatch-new-u...
|
| > Some example situations in which UUIDv8 usage could
| occur:
|
| > An implementation would like to embed extra information
| within the UUID other than what is defined in this
| document.
|
| Isn't that exactly what you are doing?
| Daegalus wrote:
| I am aware, just saying per spec, its supposed to be random
| bit data, thats all I was saying. I am familiar with a spec
| since I maintain a UUID library that has 6,7, and a custom
| 8 implemented.
|
| It can have extra monotonicity data instead, per section
| 6.2 but ideally its random. Again, Not saying you can't do
| what you are doing, I just know per the conversations while
| the draft was gathering feedback, your type of change was
| intended to be done as uuidv8
| pphysch wrote:
| > per spec, its supposed to be random bit data
|
| > It can have extra monotonicity data instead
|
| Well, which is it? These are incompatible requirements.
|
| If I give you a standard UUIDv7 sample, it is impossible
| for you to interpret the last 62 bits. You cannot
| determine how they were generated. If I give you two
| samples with the same timestamp, you cannot say which was
| generated first. These bits are de facto uninterpretable,
| unlike e.g. the 48 MSB, which have clearly defined
| semantics.
| Daegalus wrote:
| Well, that might be an ambiguity that needs to be brought
| up before its final if it is an issue.
|
| So if we look at https://www.ietf.org/archive/id/draft-
| ietf-uuidrev-rfc4122bi...
|
| For list item #3 it says "Random data for each new UUIDv7
| generated for any remaining space." without the word
| "optional" and the bit layout diagram says `rand_b`
|
| But when you read the description for `rand_b` it says:
| "The final 62 bits of pseudo-random data to provide
| uniqueness as per Section 6.8 and/or an optional counter
| to guarantee additional monotonicity as per Section 6.2."
|
| Reading section 6.2
| https://www.ietf.org/archive/id/draft-ietf-uuidrev-
| rfc4122bi..., it all involves incrementing counters, or
| other monotonic random data.
|
| If you can guarantee that you custom uuidv7 is globally
| unique for 10000 values per second or more, I don't see
| why you can't do what you do and treat your custom data
| as random outside of your implementation.
|
| I think part of this is my mistake, because I assumed you
| replaced most of the random data with information, but
| reading it now, I read that you replaced just the last 16
| bits. Also since most people used random data for
| UUIDv1's remaining 48bits of `node` then your variation
| is no worse than UUIDv1 (or 6) while also being
| compatible with v7.
|
| I think I just got too caught up on the the bit layout
| calling it `random` and misread your information. Sorry
| for the misunderstanding, and thanks for discussing it.
| [deleted]
| stephen wrote:
| Neat! Love the "type-safe" prefix; we'd called them "tagged ids"
| in our ORM that auto-prefixes the otherwise-ints-in-the-db with
| similar per-entity tags:
|
| https://joist-orm.io/docs/advanced/tagged-ids
|
| We'd used `:` as our delimiter, but kinda regretting not using
| `_` because of the "double-click to copy/paste" aspect...
|
| In theory it'd be really easy to get Joist to take "uuid columns
| in the db" and turn them into "typeids in the domain model", but
| probably not something that could be configured/done via userland
| atm...that'd be a good idea though.
| wongarsu wrote:
| Reddit does something similar, but optimized for string length:
| elements have ids like "t3_15bfi0" where t3_ is a prefix for
| the type (t3 is a post, t1 a comment, t5 a subreddit, etc) and
| the remaining is a base36 encoding of the autoincrementing
| primary key.
| swyx wrote:
| for those researching this topic, I keep a list of these
| UUID/GUID implementations!
|
| https://github.com/swyxio/brain/blob/master/R%20-%20Dev%20No...
| crdrost wrote:
| Thanks for this!
|
| I have one idea which is perhaps nerdy enough to make the list
| but I've never fully fleshed it out, it's that one can encode
| the nonnegative integers {0, 1, 2, ...} into the finite
| bitstrings {0, 1}* in a way which preserves ordering.
|
| So if we use hexits for the encoding the idea would be that
| 0=0, 1=1, ... E=14, then F00 = 15 F01
| = 16 ... F0F = 30 F100 = 31
| F101 = 32 ... F1FF = 286 F2000 =
|
| so the format is F, which is the overflow sigil, followed by a
| recursive representation of the length of the coming string,
| followed by a string of hexits that long.
|
| What if you need 16 hexits? That's where the recursion comes
| in, F F00 0123456789ABCDEF \ \
| \ \ \ \----- 16 hexits \ \
| \ \-- the number 15, "there are 15+1 digits to follow"
| \ (consisting of overflow, 0+1 digits to follow, and hex
| 0) \ \--- overflow sigil
|
| Kind of goofy but would allow a bunch of things like "timestamp
| * 1024 + 10-bit machine ID" etc without worrying about the size
| of the numbers involved
| changoplatanero wrote:
| what does k-sortable mean?
| AtNightWeCode wrote:
| That it is a nearly sorted sequence. Typically the first part
| of the ID is a timestamp. So you may sort the IDs down to the
| second it was created. But for IDs created at the same time the
| order is random. This can be used for caching, database
| performance and so on.
| sixtram wrote:
| [dead]
| tomnipotent wrote:
| That if a hundred servers are generating (uuid, timestamp)
| tuples that are subsequently merged on a single machine, and
| sorted by uuid, it would have almost the same order as if
| sorted by timestamp.
|
| This property is useful for RDBMS writes, when the UUID is used
| as a primary key and this locality ensures that fewer slotted
| pages need to be modified to write the same amount of data.
| nzgrover wrote:
| Is that what they mean by "used as the primary key in a
| database while ensuring good locality"/"database locality"?
| That read/write access will hit fewer disk pages?
| netcraft wrote:
| yes, exactly
| nickjj wrote:
| > it would have almost the same order as if sorted by
| timestamp.
|
| Is there documentation covering the scenarios on how the
| order can become out of sync and what the odds are? There's a
| big difference between "almost" and "always" if we're talking
| about using this as a database PK.
| michaelt wrote:
| The key seems to be based on UUIDv7, starting with a
| timestamp in milliseconds.
|
| So the order can become out of sync if multiple events
| happen in the same millisecond; or if your servers' clock
| error is greater than a millisecond (i.e. if you're an NTP
| user)
|
| More than sufficient for things like ordering tweets. If
| you're ordering bank account transactions, well, you'd
| probably be using transactions in an ACID-compliant
| relational database.
| tomnipotent wrote:
| > There's a big difference between "almost" and "always"
|
| Not in the context of an RDBMS, which use b+/b*-tree
| variants (or LSM sstables). Sequentially generated UUID's
| will end up near each other when sorted lexicographically,
| regardless of the fact that the sort order doesn't
| perfectly match the timestamp order.
| vikeri wrote:
| Another, less known, useful thing about these IDs is that you can
| double click on them and the full id will always be selected
| avarun wrote:
| It's mentioned in the README. "can be selected for copy-pasting
| by double-clicking"
| Eduard wrote:
| Also, they are safe to use within filenames and directory names
| (Filesystem paths) without conversion (at least in today's
| Filesystem not limited to e.g. 8.3 characters) .
|
| Compare that with otherwise nice ISO 8601 datetime format (e.g.
| 2023-06-28T21:47:59+00:00): it requires conversion for file
| systems that don't allow colons and plus signs.
| jrockway wrote:
| This is a setting in your terminal emulator. For me, plain
| UUIDs are selected just fine when double clicking.
| mojuba wrote:
| There's life outside of the terminal. For example you want to
| double-click on the part of a URL in your browser.
| Xeoncross wrote:
| Assuming you don't need to use UUIDv7 (or any UUID's) then
| https://github.com/segmentio/ksuid provides a much bigger
| keyspace. You could just append a string prefix if you wanted to
| namespace, but the chance of collisions of a ksuid is many times
| smaller than a UUID of any version.
|
| ksuid is the best general purpose id generator with sort-able
| timestamps I've found and has libraries in most languages. UUID
| v1-7 are wasteful.
| rtheunissen wrote:
| I moved to ULID because they are always lowercase and therefore
| case-insensitive.
| tasn wrote:
| We maintain a couple of popular ksuid libraries[1][2] and use
| it, so we definitely like ksuid. Though one big issue with
| ksuid is that being 160bit means that it doesn't fit into
| native uuid types in databases (e.g. postgres), which means
| that they come with a performance penalty.
|
| 1: https://github.com/svix/rust-ksuid 2:
| https://github.com/svix/python-ksuid
| Xeoncross wrote:
| I'm curious, why do you not store these as binary data or do
| you and you're saying that the UUID operations are better
| optimized than sorts on binary data?
| tasn wrote:
| Exactly what the sibling said, and the same applies to
| database operations (when they have a uuid type).
| snuxoll wrote:
| I can compare a 128bit UUID in a single instruction, a
| 160-bit ksuid is a little weirder to work with at the
| hardware level.
| iillexial wrote:
| I didn't get the "type-safe" part. How would it work in Go?
|
| Let's say I have structs:
|
| type User struct { ID TypeID }
|
| type Post struct { ID TypeID }
|
| How can I ensure the correct type is used in each of the structs?
| zeroxfe wrote:
| It's not a language primitive. It's a data format that
| _enables_ type safety in libraries or APIs (as opposed to a
| more opaque data format like UUIDv7.)
| avgcorrection wrote:
| It's stringly-typed type-safety: check if the value has the
| expected prefix.
| kibwen wrote:
| Any time you ever read a string, its type is always just going
| to be "string" (modulo whatever passes for a "string" in your
| programming language of choice). To get an actual non-string
| type, you'd need to parse that string, and presumably your
| parsing function would read the prefix and reject the string if
| it was passed an ID whose type doesn't match. So it's
| dynamically type-safe, if not statically type-safe.
| hfkwer wrote:
| This isn't about object types in any particular language.
| leetbulb wrote:
| One way is to enforce in Marshal[0] and Unmarshal[1]
|
| [0] https://pkg.go.dev/encoding/json#Marshaler
|
| [1] https://pkg.go.dev/encoding/json#Unmarshaler
| atulvi wrote:
| Naive Question: The type safe part is just appending a string at
| the beginning? What if I do that with UUIDv4? is
| user_49b9cd12-9964-4b9c-8512-742f0a2c9be4 type safe now?
| davidjfelix wrote:
| Yep. The whole point is that you /never/ assign ids that begin
| with "user" to types that are not users. Because of that, you
| can be sure nobody can accidentally copy an id that begins with
| "user" when meaning to address a different type and get back a
| result other than "not found".
|
| Example:
|
| I have userId=4 and userId=2. Suppose a user can have multiple
| bank accounts and userId=4 has accountId=5 and accountId=6 and
| a defaultAccound accountId=5. userId=2 has an account,
| accountId=7; I want to send userId=4 some money so I use the
| function `sendUserMoneyFromAccount(to: int, from: int)`. This
| is a bad interface but these things exist in the wild a lot. I
| could accidentally assume that because I want to send userId=4
| the money to their default account that I would call it using
| `sendUserMoneyFromAccount(4, 7)` and that would work, but if
| under the hood it wants 2 accountIds, I've just sent
| accountId=4 money rather than userId=4's defaultAccount,
| accountId=5.
|
| With prefixed ids that indicate type, a function that assumes
| type differently from the one supplied will not accidentally
| succeed.
|
| In addition, humans who copy ids will be less likely to mistake
| them. This is just an ergonomic/human centric typing.
| dloreto wrote:
| That's how the type is encoded as a string, but type-safety
| ultimately comes from how the TypeID libraries allow you to
| validate that the type is correct.
|
| For example, the PostgresSQL implementation of TypeID, would
| let you use a "domain type" to define a typeid subtype. Thus
| ensuring that the database itself always checks the validity of
| the type prefix. An example is here:
| https://github.com/jetpack-io/typeid-sql/blob/main/example/e...
|
| In go, we're considering it making it easy to define a new Go
| type, that enforces a particular type prefix. If you can do
| that, then the Go type system would enforce you are passing the
| correct type of id.
| [deleted]
| wood_spirit wrote:
| UUIDv7 has been taking HN by storm for years now! When is it
| going to become a proper standard, and when are libraries and
| databases and all the rest going to natively support it?
| vbezhenar wrote:
| What kind of support do you expect? I'm pretty sure that
| absolute majority of software does not care about any
| particular bits in UUID, so you can use it today. If some
| software cared about any particular bits, just imitate UUIDv4,
| I mean those bits could be randomly generated as well. If you
| need generation procedure, write it yourself, it's easy.
| Daegalus wrote:
| Its been going through drafts and improvements. It's very close
| to being standardized, and many libraries are supporting it
| already, or new offerings are being added. For example I
| maintain the Dart UUID library, and my latest beta major
| release has v6, v7 and a custom v8. There is a list of them
| somewhere, I know I get pinged on every new draft by the
| authors because I am listed as a library maintainer on one of
| their pages.
| Nelkins wrote:
| How much does it change between drafts? Close enough to where
| I could use it in production?
| Daegalus wrote:
| Seeing as how its nearly done, it doesn't change much. It
| changed more often in the beginning, but its like on its
| final draft, or near final draft. I think the IETF plans to
| make final soon.
| kijeda wrote:
| It would appear to be in the final stages of standardization in
| the IETF: https://datatracker.ietf.org/doc/draft-ietf-uuidrev-
| rfc4122b...
| TeeWEE wrote:
| Good, but I dont see a big advantage over UUIDv7 Anyone has some
| good ones?
| dloreto wrote:
| It's based on UUIDv7 (in fact, a TypeID can be decoded into an
| UUIDv7). The main reasons to use TypeID over "raw" UUIDv7 are:
| 1) For the type safety, and 2) for the more compact string
| encoding.
|
| If you don't need either of those, then UUIDv7 is the right
| choice.
| wg0 wrote:
| Can anyone guide me about the pros and cons of xid, ksuid and
| this type-safe option?
| timf wrote:
| I do a similar thing [1]. One of the great advantages to formally
| namespaced IDs is including a systematic conversion into strong
| types in your code. It's harder to accidentally mix things up
| when coding; function parameters and return tuples are more 'self
| documented' (and enforced by compiler where applicable).
|
| [1] - https://www.peakscale.com/strongly-typed-ids/
| bombela wrote:
| I have some complaints about UUIDs. Why not just combining time +
| random number without the ceremony of UUID versioning. And for
| when locality doesn't matter, just use a 128bit random number
| directly.
|
| And in my experience most people somehow think a UUID must be
| stored into the human friendly hex representation, dashes
| included. Wasting so much space in database, network, memory.
| rjh29 wrote:
| Many people had the same idea. For example ULID
| https://github.com/ulid/spec is more compact and stores the
| time so it is lexically ordered.
| jerf wrote:
| While this isn't the worst area I see this in, there does seem
| to be a tendency in the UUID space to speak as if one use case
| stands for all and therefore there is _a_ best UUID format.
|
| The reality is that it is just like any other engineering
| situation. Sit down, write down your requirements, and see
| what, if anything, solves it.
|
| Reading about the advantages of various formats is very helpful
| in helping you skip learning about certain things the hard way
| and use somebody else's experience of learning them the hard
| way instead. From that point of view I recommend at least
| glancing through them all. Sortability and time-based locality
| is one that you may not naturally think about, and if you need
| it, you will appreciate not learning that the hard way four
| years into a project after you threw that data away and then
| realizing you needed it. And some UUID formats actually managed
| to introduce small security issues into themselves (thinking
| MAC address leak from UUID v1 here), nice to avoid those too.
|
| If you have a use case where there's an existing solution then,
| hey, great, go ahead and use it. Maybe if anyone ever needs
| that but in another language they can pull a library there too.
|
| But if not, don't sweat it. The biggest use of UUIDs I
| _personally_ have I specified as "just send me a unique
| string, use a UUID library of your choice if it makes you feel
| better". I think I've got a unique format per source of data in
| this system and it's fine. I don't have volume problems, it's
| tens of thousands of things per day. I don't have any need to
| sort on the UUID, they're not really the "identifier", they're
| just a unique token generated for a particular message by the
| originator of the message so we can detect duplicate arrivals
| downstream in a heterogenous system where I can't just defer
| that task to the queue itself since we have multiple. I don't
| even need them to be _globally_ unique, I just need them unique
| within a rather small shard, and in principle I wouldn 't even
| mind if they get repeated after a certain amount of time
| (though I left the system enforcing across all time anyhow for
| simplicity). In this particular case, I do indeed generate my
| own UUIDs for the stuff I'm originating by just grabbing some
| stuff from /dev/urandom and encoding it base64, with a size
| selected such that base64 doesn't end the encoding with ==.
| Even that's just for aesthetic's sake rather than any actual
| problem it would cause.
| stronglikedan wrote:
| > combining time + random number
|
| You can't guarantee that this will be _globally_ unique.
| ceejayoz wrote:
| No identifier can _guarantee_ that. We just get close enough
| to be acceptable.
|
| Per Wikipedia, the probability to find a duplicate within 103
| trillion version-4 UUIDs is one in a billion.
|
| so-youre-saying-theres-a-chance.gif
| [deleted]
| duped wrote:
| A billion is not that big of a number for UUIDs
| ceejayoz wrote:
| Re-read. You'd have to generate 103 trillion to have a
| one billion _th_ chance of a collision.
|
| A billion isn't that big a number, but 103 trillion is.
| jandrewrogers wrote:
| I think you made a mistake in your math. The Birthday
| Collision probability of just a trillion random UUID is
| much higher than that.
| ceejayoz wrote:
| Feel free to update https://en.wikipedia.org/wiki/Univers
| ally_unique_identifier#..., but it does note "This
| probability can be computed precisely based on analysis
| of the birthday problem". It does show the formula used.
| deathanatos wrote:
| Wikipedia is correct, AFAICT.
|
| The probability of 1 trillion UUIDs having a collision
| is, def birthday_collision(n, m):
| return 1 - math.e ** (-((n -1) * n) / (2 * m))
| In : birthday_collision(1_000_000_000_000, 2 ** 122)
| Out: 9.403589018575076e-14
|
| That number is roughly the approximation given in
| Wikipedia.
|
| I.e., at 1T UUIDs, it hasn't happened. For comparison,
| the odds of being struck by lighting (over a lifetime) is
| many orders of magnitude greater:
| 6.535947712418301e-05
| jandrewrogers wrote:
| I have single datasets with trillions of UUID. Collision
| probability becomes a thing.
|
| That aside, UUIDv4 is banned in many orgs because there
| have been several instances in the wild where the "random"
| number wasn't nearly as random as advertised from some
| sources for a variety of reasons, leading to collisions. It
| is relatively easy to screw this up so many orgs don't risk
| it.
| aartav wrote:
| I've been doing this kind of thing for years with two notable
| differences:
|
| 1. I don't believe people actually hand type-in these values, so
| I'm not really concerned about the 'l' vs '1' issue. I do base 32
| without `eiou` (vowels) to reduce the likelihood of words
| (profanity) sneaking in.
|
| 2. I add two base-32 characters as a checksum (salted of course).
| This is prevents having to go look at the datastore when the
| value is bogus either by accident or malice. I'm unsure why other
| implementations don't do this.
| dloreto wrote:
| The checksum idea is interesting. I'm considering whether it
| makes sense to add it as part of the TypeID spec.
| sokoloff wrote:
| > base 32 without `eiou` (vowels) to reduce the likelihood of
| words (profanity) sneaking in.
|
| We had "analrita" as an autogenerated password that resulted in
| a complaint many years ago. Might consider adding 'a' as an
| excluded letter.
| michaelt wrote:
| Presumably base 32 means 26 letters + 10 digits - 4 banned
| letters
|
| So adding an excluded letter is not easy.
| [deleted]
| zrail wrote:
| I implemented number two as part of an encoding scheme a few
| months ago. I'm not sure how much it's saved in terms of
| database lookups but it's aesthetically pleasing to know it
| won't hit a more inscrutable error while trying to decode.
| kortex wrote:
| Does the prefix ("user_") get recorded in the DB (so every string
| in the column starts with the same "user_"), or does are there
| constraints and other clever chicanery to save those bytes in
| every record? Or do modern DB engines even care? Is this
| premature optimization?
| carlsverre wrote:
| The authors have created a specialisation for Postgres that
| leverages a custom type which is a tuple of type and uuidv7:
| https://github.com/jetpack-io/typeid-sql/blob/main/sql/typei...
|
| This is more optimal for Postgres while making it slightly more
| difficult to interop between the db and the language (db driver
| needs to handle custom types, and you need to inject a custom
| type converter).
|
| And while there are hacks you can do to make storing uuid-
| alikes as strings less terrible for db engines, if you want the
| best performance and smallest space consumption (compressed or
| not) make sure to use native ID types or convert to
| BINARY/numeric types.
| jszymborski wrote:
| This is very similar to how I generate IDs in a project I'm
| working on.
|
| Example: |-A-|-|------------B--------------|
| NMSPC-9TWN1-HR7SV-MTX00-0H8VP-YCCJZ A = Namespace, padded
| to 5 chars. Max 5 chars. Uppercase. B = Blake3 hashed
| microtime with a random key.
|
| I like how it folds in a time component but that it also doesn't
| reveal the time it was generated.
|
| Here's the snippet:
| https://gist.github.com/jszym/d3c7907b7b6e916f68205c99e5e489...
| goostavos wrote:
| Namespacing identifiers in general is a great idea for handling
| those class of integration tests which cannot be fully
| isolated. It makes it easy to write all kinds of garbage from
| even concurrently running tests all without any of them
| colliding or accidentally reading each others writes (because
| they are themselves namespace aware!). It's low effort to get
| all the pieces of your system to play along (often entirely
| transparent via DI), but gives a huge power to weight ratio.
| Basically deletes an entire class of problems which usually
| plague large, mature test suits
| ajkjk wrote:
| Unrelated, but this links to "Crockford's alphabet",
| https://www.crockford.com/base32.html , which is a base-32 system
| that includes all alphanumeric characters except I and L (which
| are confusable with 1), O (which is confusable with 0), and U
| (????). The page says the reason for excluding U is "accidental
| obscenity'. What the heck is it talking about?
| deanmen wrote:
| The F word has a U in it. Sure you could just say FVCK
| [deleted]
| jszymborski wrote:
| FUCK
| Racing0461 wrote:
| yep, youtube video ids has/had? same issue where it would
| have things like fag/f4g etc in it.
|
| eg: google "allinurl:fag site:youtube.com"
| stronglikedan wrote:
| You can prevent _any_ obscenity, O and 0 confusion, and I
| and L confusion, just by excluding vowels. If someone
| interprets "f4g" in an offensive way, then they have
| bigger issues than can be dealt with in software.
| arcticbull wrote:
| There was that time Delta generated an "H8GAYS" PNR. [1]
| Pretty sure that's valid Crockford encoding too :)
| however, to your point, it does rely on 'A'. "H8G4YS"
| would likely still offend someone out there, though,
| given the kerfuffle in [1].
|
| [1] https://newsfeed.time.com/2013/12/17/delta-airlines-
| is-very-...
| kortex wrote:
| > But as White points out, it's a bit surprising that
| Delta didn't block this particular combination as a
| possibility. "I'm sure they removed many four-letter
| words that would be seen as offensive," he tells the
| Post. "I'm surprised that 'gays' and 'H8' weren't blocked
| as well."
|
| Oh sweet summer child (meaning Jeff White, not OP/GP). As
| someone who has implemented a censorship/filtering list,
| this is a UX problem on the same level of decideability
| as the halting problem. You can spend boundless time
| curating a list to flag/grawlix every possible string
| that would offend even the most prudish of prudes, and
| some would still get through. Such as the superficially
| benign "EATTHE"
|
| https://www.dailymail.co.uk/news/article-2039662/Virginia
| -dr...
| MR4D wrote:
| Obscenities change with language. That's why every
| language has them.
|
| Even programming language have them. For instance, Basic
| has GOTO.
|
| /j
| ZeroClickOk wrote:
| and javascript has type coercion
| oleganza wrote:
| The problem is not being offended per se, but having your
| user id accidentally become "user_123fuck567" -- that's
| akin to having a vulgar license plate on your car's
| forehead. People don't appreciate how lucky they
| sometimes are.
| taosx wrote:
| Why we care about obscenity in pseudo-random ids and url?
| whimsicalism wrote:
| anglo morals
| programmarchy wrote:
| Yeah, wtf?
| codeulike wrote:
| If I and O are already excluded and you also exclude U that
| removes a lot of potential rude looking three letter
| combinations like *** and *** and *** and also the four letter
| ones like **** and **** and the dreaded ****. Of course because
| you have A then **** is still a possibility but very very
| unlikely
| titanomachy wrote:
| Wow I didn't know HN even had obscenity filters, and I've
| been here for many years.
|
| Guess that's a credit to the general civility of the
| community.
|
| EDIT: It appears that other people in this thread are freely
| using profanity, so either your comment was targeted by
| automation due to the unusual density of banned words, or
| it's a joke that went over my head :)
| [deleted]
| [deleted]
| rbera wrote:
| That explains it, I was very confused by what I assumed was
| self-censoring, since the comment didn't actually clarify
| anything. I wish there was an accepted way to disambiguate
| asterisks from server side filters.
| macintux wrote:
| I assumed this was a riff on the classic bash.org
| transcript.
|
| http://www.bash.org/?244321
| taberiand wrote:
| No obscenity filters, but there is a pretty good password
| filter I hear. For example, my password 'hunter2' will be
| all **** to you
| 9dev wrote:
| Isn't it nice how some traditions do stick around. It's
| been a while, Cthon98!
| AceJohnny2 wrote:
| you accidentally the whole thing
| hinkley wrote:
| A coworker and I came up with basically this same set about 4
| years before Crockford. We were trying to solve the url slug
| problem, and they were long enough that we felt 5 bits per byte
| would reduce transcription annoyances.
|
| In the end I think we had a couple of characters to spare, and
| so, sitting by ourselves because everyone else had gone home
| for the day, we ranked swear words by how offensive they were
| to prioritize removal of a few extra letters. Then I convinced
| him that slurs were a bigger problem so we focused on that,
| which got rid of the letter n, instead of u
|
| tggr is just cute, n**r is an uncomfortable conversation with
| multiple HR teams (we were B2B)
|
| I'm a bit fuzzy now on what our ultimate character set was,
| because typically you're talking [a-z][0-9], an there are a lot
| of symbols you can't use in urls and some that are difficult to
| dictate. My recollection is that we eliminated both 0, l, and
| 1, but I think we relied on transcription happening either from
| all caps or all lowercase. 0o are not a problem. Nor are 1L.
| hinkley wrote:
| Other comments are jogging my memory. I think we went case
| sensitive (62 characters -> 30 spares), eliminated aA4, eE3,
| iI1l oO0 (maybe Q), uU, which is 16 characters, 14 to go.
| Remove the remaining 7 numbers (once you remove most for
| leetspeak what's the point of the rest?), nN, yY. That leaves
| 2 left and I can't recall what we did with those. Maybe kK or
| rR.
|
| Y is pretty versatile for pissing people off.
| avgcorrection wrote:
| > The page says the reason for excluding U is "accidental
| obscenity'. What the heck is it talking about?
|
| Because he's an American?
| Zamicol wrote:
| There's more!
|
| - base 58 - Satoshi's/Bitcoin's
| https://en.wikipedia.org/wiki/Binary-to-text_encoding#Base58
|
| - "base62" - Keybase's saltpack
| https://github.com/keybase/saltpack
|
| - The famous "Adobe 85" - https://en.wikipedia.org/wiki/Ascii85
|
| - basE91 - https://base91.sourceforge.net
|
| At work we defined several new "bases" for QR code. IMHO, it is
| an under applied area of computer science.
| pavlov wrote:
| True Latinists find the letter U vulgar to the point of
| obscenity because it didn't exist in Cicero's time.
| oleganza wrote:
| Trve Latinists wovld appreciate yovr point.
| littlestymaar wrote:
| Gotcha, there was no "W" in the Latin alphabet either ;)
| kibwen wrote:
| _> The page says the reason for excluding U is "accidental
| obscenity'._
|
| Crockford is being cheeky. To make a nice base32 alphabet out
| of non-confusable alphanumeric characters you only need to
| exclude O, I, and L. This leaves you with 33 characters still,
| so you need to remove one more, and it doesn't matter which one
| you remove, so you might as well pick an arbitrary reason for
| the last character that gets removed (and it's not the worst
| reason, if your goal is to use these as user-readable IDs,
| although obviously it's not even remotely bulletproof).
| pluijzer wrote:
| You could argue that U can be confused with V.
| dmurray wrote:
| 5 and S seems more likely.
| mtlmtlmtlmtl wrote:
| A vaguely related historical tangent is that V and U used
| to be just two ways of writing the same letter in Early
| Modern English. Which I imagine is why W is named as
| "double U" in speaking.
| jabbany wrote:
| This is also interesting since in French (and I think
| Spanish?) W is (correctly) called "double V"
| quickthrower2 wrote:
| Fvck!
| quickthrower2 wrote:
| U is a fairly new letter anyway.
| theptip wrote:
| The Scunthorpe problem?
|
| https://en.m.wikipedia.org/wiki/Scunthorpe_problem
| pizzapill wrote:
| E-Mail accounts seem the worst. Just lets write letters
| again, if you need a pencil I recommend penisland.net
| programmarchy wrote:
| There's enough comedic content in this article for several
| Silicon Valley episodes.
| eezing wrote:
| "...can be selected for copy-pasting by double-clicking"
|
| Details matter.
| koito17 wrote:
| How does this compare to a SQUUID for sorting or nano-id for
| human readability? Both are options I've used in the past when
| using databases like Datomic or XTDB. SQUUIDs in particular
| because I have a UUID that can be ordered by timestamp, nano-id
| when prototyping things and I want meaningful prefixes in my
| entity IDs rather than a bunch of UUIDs.
| jtmarmon wrote:
| This looks great! Is there a reason one couldn't use this with v4
| UUIDs? A quick test shows that they encode/decode just fine.
| Wondering if I could use the encoded form as a way to niceify our
| URLs without having to change how the IDs (currently v4 uuids)
| are stored
| dloreto wrote:
| The CLI tool will support encoding/decoding any valid UUID,
| whether v1, v4, or v7. We picked v7 as the definition of the
| spec, because we need to choose one of them when generating a
| new random ID, and our opinion is that by default, that should
| be v7.
|
| We might add a warning in the future if you decode/encode
| something that is not v7, but if it suits your use-case to
| encode UUIDv4 in this way, go for it. Just keep in mind that
| you'll lose the locality property.
___________________________________________________________________
(page generated 2023-06-28 23:00 UTC)