[HN Gopher] Show HN: 128-bit, roughly-ordered, URL-safe UUIDs
       ___________________________________________________________________
        
       Show HN: 128-bit, roughly-ordered, URL-safe UUIDs
        
       Author : amzans
       Score  : 154 points
       Date   : 2021-01-22 10:49 UTC (12 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | advisedwang wrote:
       | Just be careful if you use these as keys for a distributed system
       | that is susceptible to hotspotting. GCP's Cloud Spanner [1] has
       | this problem and has a good document explaining it.
       | 
       | As always - understand what is going on in your stack and apply
       | design with that in mind. There are no universals!
       | 
       | [1] https://cloud.google.com/spanner/docs/schema-
       | design#primary-...
        
         | crescentfresh wrote:
         | I had to pause when they use the term "anti-pattern". It may be
         | an anti-pattern for cloud spanner but it's not an anti-pattern
         | in other distributed systems, for example cassandra.
        
           | majormajor wrote:
           | The very next word after "anti-pattern" is _if_.  "The
           | following table definition that uses a timestamp-based
           | primary key as the first key part is an anti-pattern if the
           | table will see a high rate of insertion:"
           | 
           | Once your data gets big enough there's no easy one-size-fits-
           | all. You gotta look at how you're going to be using the
           | system.
        
           | zug_zug wrote:
           | Right. This spanner thing is new to me, but looking at it, I
           | don't see why spanner doesn't/couldn't shard by modulo /
           | checksum to reduce this risk. And I think other systems do
        
             | electricshampo1 wrote:
             | It shards it this way (by primary key ranges) to support
             | efficient range scans.
        
       | caleblloyd wrote:
       | This approach has become my preferred way of generating IDs for
       | database rows. It is relatively space efficient and doesn't
       | require a round trip to the database to generate an ID like auto
       | increment does.
       | 
       | While working on a C# implementation for MySQL and we found that
       | when the DB uses Little Endian Binary that the GUID must be
       | reversed to store in order:
       | 
       | https://github.com/PomeloFoundation/Pomelo.EntityFrameworkCo...
        
         | abhishekjha wrote:
         | Does random UUID work as well as lets say an ordered integer id
         | or ordered uuid? I thought the ids should have a definite
         | order(not necessarily sequential) so that a binary search can
         | be performed on the primary index. Similar for b-trees.
        
           | masklinn wrote:
           | Random uuids are perfectly ordered, there's no issue looking
           | up a single record by uuid.
           | 
           | The problems are that they're scattered during reads and
           | writes, and will require updating old pages.
        
             | kevincox wrote:
             | Note that scattering can actually be beneficial if you have
             | too much load for a single node when the data is hot. Of
             | course you can solve this with weird sharding such as
             | sharding on the least significant byte.
        
               | SigmundA wrote:
               | Seems like this point is not understood very well by
               | many. Ordered keys can create hot spots on insert and
               | read, this can be good or bad depending on backend.
        
               | mywittyname wrote:
               | A hot-spot in one database index is collocated data in
               | another. Sometimes you want uniform distribution of keys,
               | sometimes you want clusters of keys because different
               | databases are optimized based on assumptions about the
               | underlying distribution of keys.
        
               | kevincox wrote:
               | The classic goldilocks problem. You want your pages to be
               | hot, but not too hot.
        
       | Ellipsis753 wrote:
       | I'd worry a bit about the chance of collisions.
       | 
       | They correctly state you can generate 50 million of them per
       | millisecond with only a one in a billion chance of a collision
       | (per millisecond).
       | 
       | Unfortunately this means the chance of a collision in a year is
       | 99.999999999998% which isn't great!
       | 
       | (I think they've given too many bits to the timestamp.)
       | 
       | Also, the uuid format it creates isn't a valid uuid which is a
       | pity.
       | 
       | If you want to have "only" a one in a billion chance of a
       | collision a year, you can only generate about 300 a millisecond.
        
         | pas wrote:
         | You'd run into a bigger problem than collisions a lot sooner
         | (IMHO): time ordering. "ms" precision is very misleading,
         | because keeping clocks really in sync while processing this
         | much data is a real challenge.
         | 
         | Sure, maybe it doesn't matter that one server/emitter and thus
         | a bunch of IDs are out of order, but if this is really just for
         | helping indexes, then a simple 0.1 second precision would make
         | more sense. (Or even simply truncating the timestamp to however
         | many bits to get ~2-30 sec precision, and using a data store
         | with LevelDB-like staged compaction. Which is virtually a
         | requirement at this scale.)
        
         | doersino wrote:
         | > Unfortunately this means the chance of a collision in a year
         | is 99.999999999998% which isn't great!
         | 
         | How was this computed? My intuition begs to differ.
        
           | jychang wrote:
           | There are 31,536,000,000 milliseconds in a year. If you do
           | (1-1E-9)^3.1536E10, you get a 2E-14 chance of no collision in
           | a year.
        
             | hoseja wrote:
             | I think the presumption is you will not be generating 50
             | million of them each millisecond.
        
               | alpaca128 wrote:
               | Which is a pretty safe assumption, considering just
               | generating those 128bits 50 million times a millisecond
               | results in over 20 million TB of data in a year. I think
               | it's safe to say this is not a common scenario.
        
           | onetoo wrote:
           | Probability of no collision in a single millisecond: p_1 =
           | 999999999/1000000000
           | 
           | Milliseconds in a year: t = 1000 * 60 * 60 * 24 * 365
           | 
           | Probability of no collision happening for every millisecond
           | of a year: p_2 = p_1^t
           | 
           | Probability of a collision happening for at least one
           | millisecond of a year: p_3 = 1 - p_2
           | 
           | https://www.wolframalpha.com/input/?i=1+-+%28999999999%2F100.
           | ..
        
           | mytailorisrich wrote:
           | There are close to 32 billion millisecond in a year. If the
           | chance of a collision happening within a millisecond is 1 in
           | a billion then my intuition is that the probability of a
           | collision happening at least once a year is quite high,
           | indeed.
           | 
           | It's like a dice: You have 1 in 6 chance of getting a 6 every
           | time your roll the dice. Now, if you roll the dice 100 times
           | the chance that you'll get a 6 at least once is then pretty
           | high.
        
         | ktpsns wrote:
         | I guess you assume to build one timeflake per ms for one year.
         | That's about 3e13 timeflakes. Ordinary UUIDs have a collisions
         | at the order of 1e18 hashes (cf. https://en.wikipedia.org/wiki/
         | Universally_unique_identifier#...). So we are "only" 5 orders
         | of magnitude away. For me the collision quality feels the same.
        
           | Ellipsis753 wrote:
           | If you assume one per ms then timeflake and UUIDv1 both have
           | a 0% chance of collision as they encode the timestamp.
           | 
           | You might go above this rate some of the time though and we
           | might as well strive for perfection.
           | 
           | One of the nice things about UUIDs is that you can just merge
           | data together willy-nilly without worrying about collisions -
           | for example, if you gets bought by Google. You can't really
           | do this with timeflake.
        
             | phkahler wrote:
             | >> If you assume one per ms then timeflake and UUIDv1 both
             | have a 0% chance of collision as they encode the timestamp.
             | 
             | They can be generated anywhere at any time. It two machines
             | are generating even 1 per hour, they will have collisions
             | if their clocks are synchronized. The random bits are
             | important.
        
         | zuzun wrote:
         | Look for a different solution if you need 1 quintillion IDs per
         | year. Noted.
        
       | m45t3r wrote:
       | Datomic uses a similar approach by default. They called it
       | squuid: https://docs.datomic.com/on-prem/identity.html#squuids
        
       | j-pb wrote:
       | Have you considered using a 32 bit time value with second
       | resolution?
       | 
       | That way you still get sort-ability, you get 16 bit more entropy,
       | fare better on 32 bit systems, and still get 130 years of usage,
       | out of them.
       | 
       | Also you actively discourage people from trying funny stuff like
       | synchronisation and conflict resolution via the time component.
       | Because 1ms "that sounds kinda good enough :?", while 1s "that'll
       | never work!!!"
       | 
       | Ammendment:
       | 
       | Thinking about it. Using a 16bit timestamp prefix with second
       | resolution and rollover should probably provide a good tradeoff
       | between improving insertion performance (most expensive in terms
       | of index touching), a 112 bit random value, and a dense usage of
       | the timstamp field.
       | 
       | It also allows you to do thing like "forgetting" old values,
       | within a 18h timeframe.
        
         | Tostino wrote:
         | I personally use: https://github.com/tvondra/sequential-uuids
         | 
         | The time component resolution is configurable, and you can make
         | it wrap around every so often, filling in gaps left by page
         | splits, updates, deletes, etc.
        
           | j-pb wrote:
           | Nice! This is kinda what I had in mind, while writing that
           | ammendum!
           | 
           | Thanks for sharing :D
        
           | Sytten wrote:
           | Its just a bit annoying because it is a C extension and that
           | can't be used in most managed DB in the cloud.
        
             | Tostino wrote:
             | I had to re-write the time based function from that
             | extension in PL/PGSQL earlier in the week for someone on my
             | team who couldn't work with the C extension: https://gist.g
             | ithub.com/Tostino/ca104bff40e704b6db70b9af4926...
             | 
             | This can be used in RDS, etc.
        
         | dkarl wrote:
         | > Because 1ms "that sounds kinda good enough :?"
         | 
         | Ugh. You wouldn't think so, but I caught a case of this in
         | production, using millisecond timestamps as unique ids for
         | browser analytics. The developer felt that even though two
         | users _could_ in rare cases click at exactly the same
         | millisecond, their clocks weren 't likely to be in perfect sync
         | with each other, and that this "additional level" of randomness
         | made a collision virtually impossible.
        
           | mywittyname wrote:
           | I worked on an analytics system where we appended a random 3
           | digits to the millisecond resolution timestamp for exactly
           | this reason. It turns out, collisions are far, far more
           | common that the math would suggest. Especially back in the
           | aughts when browser had a lot of individual quirks to deal
           | with.
        
           | Quekid5 wrote:
           | Oh dear god.
           | 
           | I suppose it can be counter-intuitive, but people really need
           | to learn about timer precision... and, more importantly,
           | statistics (basics, at least!).
           | 
           | I absolutely hate the 'combinatorics' part of statistics, but
           | distributions, etc. seemed pretty learnable to me... but then
           | I like continuous math, so...
        
             | j-pb wrote:
             | That's not even statistics. That's simply the pigeon hole
             | principle.
             | 
             | A second has 1000ms. If you have more than 1000 users doing
             | things per second, there MUST be a collision.
        
               | Quekid5 wrote:
               | Ah, but now you're discretizing! :D
               | 
               | ... also, it Technically depends on your throughput
               | (something something queue theory).
        
         | Bogdanp wrote:
         | You can also pick your own epoch and use centiseconds since
         | that epoch for the time component. It's what I do in [1].
         | 
         | [1]: https://docs.racket-
         | lang.org/buid/index.html?q=buid#%28part....
        
       | rafaelturk wrote:
       | This is really nice, anyone aware of a NodeJs alternative?
        
         | mgkimsal wrote:
         | https://github.com/ulid/spec isn't timeflake, but is a good
         | spec, and there are implementations in multiple languages.
         | there are 2 listed for js.
         | 
         | https://github.com/aarondcohen/id128 and
         | https://github.com/ulid/javascript
        
         | audionerd wrote:
         | @thi.ng/ksuid https://github.com/thi-
         | ng/umbrella/tree/develop/packages/ksu...
        
       | sparkling wrote:
       | > Roughly ordered (K-sortable), incremental timestamp in most
       | significant bits enables faster indexing and less fragmentation
       | on database indices (vs UUID v1/v4).
       | 
       | What does this mean for database performance in simple CRUD
       | scenarios?
        
         | theta_d wrote:
         | Using an ordered ID prevents index fragmentation. Inserts will
         | be slower and ultimately so will reads. In a database like MS
         | SQL Server this is especially true since data is physically
         | stored on disk based on the clustering key which is the PK by
         | default.
        
       | amzans wrote:
       | For those interested, here's (IMO) a good explanation on why
       | completely random UUIDs hurt performance, and can lead to
       | database fragmentation issues down the road (on MySQL):
       | 
       | https://www.percona.com/blog/2019/11/22/uuids-are-popular-bu...
       | 
       | It's part of what Timeflake aims to solve, but there's plenty
       | alternatives too depending on your exact requirements.
        
         | ForHackernews wrote:
         | This is not universally true. It depends on the database engine
         | and how its indexes are implemented.
        
           | zerd wrote:
           | It's pretty much true for any B-tree based index. You could
           | get around it by using a combined index on another timestamp
           | value, but that's going to be a very large index. Or reduce
           | it by partitioning by e.g. year-month. Even hash-based
           | indices would be slightly impacted due to less locality.
        
           | amzans wrote:
           | Thanks, I updated the comment to mention that the link is in
           | the context of MySQL's case.
        
           | fastball wrote:
           | Is this applicable to postgres?
        
             | calpaterson wrote:
             | It's applicable to the indexes that postgres uses. Postgres
             | does not put rows in the heap in order of primary key so
             | the issue is less serious than in innodb.
        
       | solicode wrote:
       | I like ordered UUIDs, but if you care about 100% strict and
       | accurate ordering you need to be careful about generating them in
       | app code vs. on the DB side. With the former + using multiple app
       | servers, server clock drift can become an issue. Which is why I
       | still generate my ordered UUIDs on my Postgres server. It means
       | writing some funky PL/pgSQL (or however you decide to do it... I
       | think you can run Python too), but it gets the job done.
       | 
       | I wish ordered UUIDs were built-in to more DBs natively. I
       | remember seeing UUIDv6 being proposed, but I don't know if/when
       | that will be standardised.
        
         | httgp wrote:
         | How does one generate ordered UUIDs right on the PostgreSQL
         | server?
        
           | solicode wrote:
           | There's probably a nicer way (or more performant at least),
           | but I created a function that uses the built-in
           | `uuid_generate_v1mc` and then reverse the order of the time
           | component so that high values come first.
        
           | Tostino wrote:
           | Use this: https://github.com/tvondra/sequential-uuids
           | 
           | If you cannot install extensions, I just wrote a PL/PGSQL
           | implementation for the time-based generator I could share.
        
             | Sytten wrote:
             | Please do I have been looking for that. Posting it in the
             | repo would help a few people k am guessing.
        
               | Tostino wrote:
               | https://gist.github.com/Tostino/ca104bff40e704b6db70b9af4
               | 926...
               | 
               | Posted it to the repo as well.
        
       | hankchinaski wrote:
       | not quite sure why the requirement of being compatible with the
       | UUID 128-bit representation. i personally try to use KSUID
       | (160bit) when possible and can't see any downside so far
        
         | Tostino wrote:
         | Being 128-bits allows you to utilize the same database/object
         | types as UUID.
         | 
         | That's the main benefit IMO. I couldn't stick a KSUID into my
         | Postgres UUID column, i'd have to write a new data type to
         | store that.
        
         | [deleted]
        
       | chrisacky wrote:
       | Can someone expand on this please.
       | 
       | > They should be random, but roughly-ordered over time so that
       | your MySQL/Postgres indices stay fast and efficient as the
       | dataset grows.
       | 
       | I'm currently facing an issue where the chosen uuid format was
       | v4. From my understanding this is the least preferable because
       | it's true random, which causes hell for mysql indices.
       | 
       | What I don't quite udnerstand however, is why is sorted random
       | better if I'm using an index in both cases? I'd like to know what
       | I should be doing and if there is a way to move/migrate if
       | possible
        
         | masklinn wrote:
         | > What I don't quite udnerstand however, is why is sorted
         | random better if I'm using an index in both cases?
         | 
         | Data locality. The default index in most RDBMS is a btree which
         | is ordered and clustered so when you look up items in an index
         | if they sort similarly they'll be in the same index page(s) and
         | thus cheaper to retrieve. For PKs that means when you go
         | through your records it's cheap to get siblings.
         | 
         | When the ids are randomly distributed it's way more expensive
         | to add them because they get added at random positions (writes
         | are scattered which is bad) and when you retrieve them you have
         | to jump through the index at random loading a bunch of pages to
         | get just one or two records from it.
         | 
         | Things are even worse if the table itself is clustered after
         | the PK because now your scattered index is also ducking with
         | your actual data reads.
        
         | 101008 wrote:
         | The indexes advantages would be the same when you think of
         | searching. But for inserting (and deleting) it is much faster
         | if they are already orderded. You don't need (or well, the DB
         | engine) to traverse any trees or anything. If they are 100%
         | ordered, inserting new ones to that index would be just add
         | something at the end of a list, much faster.
        
         | thristian wrote:
         | As a rule of thumb, a table that grows over time probably has a
         | "creation time" field, and a query that touches one row
         | probably touches other rows created around the same time (like
         | a weekly report). Since rows are probably stored on disk in
         | chronological order, you can index the "creation time" field
         | and do a good job, but if your primary key is already roughly
         | chronological you may be able to get away with one fewer index
         | and keep inserts/updates cheaper.
        
         | kijin wrote:
         | In MySQL and some other database systems, primary keys are
         | clustered indexes. A clustered index defines the order in which
         | data is physically stored on disk.
         | 
         | If you have an auto-incrementing primary key like most sane
         | people do, new rows will be sorted to the very end of the
         | table. Inserting data to this table essentially becomes an
         | append-only operation, which is very fast. On the other hand,
         | inserting random values all over the place is like inserting
         | data in the middle of a file and pushing back everything that
         | comes afterward, over and over again. Even on modern SSD's,
         | this is much slower than simply appending data to the end of a
         | file.
         | 
         | For best performance, your primary keys should be as close as
         | possible to auto-incrementing integers. In other words, each
         | new value should be greater than all the values that currently
         | exist in the table. Timestamps are quite useful for this
         | purpose.
        
           | pas wrote:
           | This is an InnoDB thing for MySQL. (The old MyISAM was a
           | simple append only thing with separate index file.)
        
       | abhishekjha wrote:
       | Query : For people who recommend UUIDs as a unique identifier in
       | a distributed sytem, does the above satisfy the need?
       | 
       | In the SO answer[0] uuid v4 is being recommended but that is the
       | most random version of uuid. Isn't that going to hurt the b-tree
       | performance by having a very high fanout? How does it help the
       | primary index?
       | 
       | [0] : https://stackoverflow.com/questions/33274291/postgresql-
       | uuid...
       | 
       | EDIT : Apologies. I meant low fanout. What I meant was that you
       | need something ordered to have a better organised b-tree node. If
       | all values are random, each node will have a lot of children,
       | increasing the search space. It might turn into Linked list in
       | the worst case scenario.
        
         | tomsmeding wrote:
         | Excuse my possible ignorance, but how is more entropy in a
         | primary key going to hurt index performance? The fanout in a
         | b-tree is completely in control of the database, and not
         | dependent at all on properties of the data -- apart, of course,
         | from the _size_ of the data, since typically the node fanout is
         | precisely the number of fields that fit in one disk page.
         | Randomness does not influence the size of a UUID on disk.
        
           | rapidlua wrote:
           | With random keys updates are going to touch way more pages
           | than with sequential ones. It increases IO due to more dirty
           | buffers and journaling.
        
             | tomsmeding wrote:
             | Point taken. Quite true.
        
               | Tostino wrote:
               | More reading on this for Postgres:
               | https://www.2ndquadrant.com/en/blog/sequential-uuid-
               | generato...
        
         | mytailorisrich wrote:
         | > _Isn 't that going to hurt the b-tree performance by having a
         | very high fanout?_
         | 
         | High fanout helps minimise the depth of the tree and therefore
         | helps performance as it minimises the number of nodes to check.
        
           | abhishekjha wrote:
           | Sorry, I meant low fan out, as in a b-tree node will end up
           | have having a lot of singular valued children ultimately
           | degrading to linked list.
        
             | mytailorisrich wrote:
             | I think random binary keys are pretty good for B trees
             | because they also lead to shallow depth and a balanced tree
             | instead of potentially degrading to a linked list.
        
         | StreamBright wrote:
         | We moved away from UUIDs for different reasons. It is hard to
         | work with UUIDs when multiple entities have them. You do not
         | know if you are looking at a user or a group if both is just a
         | UUID. We created our own IDs with prefixes and using base58
         | encoding for the random part, for example: user-
         | AhnPpn7CbbwPykmux11LJ39Jk. It has the same amount of randomness
         | in it and we can understand what we are looking at.
        
           | globular-toast wrote:
           | Why would you ever be in a situation where you have a UUID
           | but don't know what it refers to?
        
             | StreamBright wrote:
             | Because somebody (not necessarily you) wrote the following
             | code:
             | 
             | printf("%s - %s", id0, id1)
             | 
             | You can even do this:
             | 
             | printf("user-id %s group-id %s", group-id, user-id)
             | 
             | Do you think you will be able to quickly debug what is
             | wrong in a high severity situation when a customer facing
             | system is down?
             | 
             | I do not, so we use eids (external ids) that are easy to
             | identify.
        
             | seppel wrote:
             | * You stumble upon them in log files or error messages.
             | 
             | * Somebody asks you why some API call is not working.
             | 
             | * You send around Excel files to higher management (for
             | example in incidents).
        
               | globular-toast wrote:
               | Now I'm even more baffled. What kind of stupid logger
               | logs a uuid without a clue as to what kind of object it
               | identifies? Why would you need to communicate uuids to
               | higher management?
               | 
               | This makes me think of Douglas Adams. You know the answer
               | is 42 but you forgot what the question was. I really
               | can't imagine how this could ever happen in a business.
               | "Yes, sir, there was a problem with 1087." "1087 what?"
               | "I don't know, but it was number 1087."
        
               | rkalla wrote:
               | You seem... fun.
        
               | seppel wrote:
               | > What kind of stupid logger logs a uuid without a clue
               | as to what kind of object it identifies?
               | 
               | The code cannot know what kind of object an uuid
               | identifies, it can only assume that it identifies what it
               | expects. So it you might see:
               | 
               | "Unknown person 550e8400-e29b-11d4-a716-446655440000" in
               | the log files. Much more helpful for debugging, however,
               | is: "Unknown person
               | order-550e8400-e29b-11d4-a716-446655440000". Then you
               | know somebody accidentally put an order id where a person
               | id belongs.
               | 
               | This is an easy example, in the real world you have
               | persons, groups, representatives, accounts, acls, etc,
               | which are easily confused. If you just have an uuid, you
               | will have a hard time figured out what it actually is.
        
               | SigmundA wrote:
               | This is why you have column headers?
               | 
               | If you just want it to look that way in a text log, then
               | sure prepend the type with the id wasting some space
               | there, but to use it throughout your db eats up tons of
               | space from both text encoding the id and repeating the
               | column name in the data over and over.
        
               | seppel wrote:
               | > This is why you have column headers?
               | 
               | I'm not sure I can follow. You get an uuid from
               | somewhere, probably external code that you cannot even
               | look at. The uuid comes stringly-typed, let's say as
               | json. Then you have no chance to know what type it is,
               | you can only assume it is the correct type that you
               | expect.
               | 
               | Just putting "550e8400-e29b-11d4-a716-446655440000" into
               | a person colum does not make it a person id.
               | 
               | If you get instead
               | "order-550e8400-e29b-11d4-a716-446655440000" as a typed-
               | uuid, you know for sure that this is not a person id. It
               | is an order id. Somebody made a mistake somewhere. You
               | also know that you probably have to look in the order DB
               | to debug what it actually refers to.
               | 
               | > but to use it throughout your db eats up tons of space
               | from both text encoding the id
               | 
               | You dont need to put your uuid in this format into the
               | db. You can add/remove the type before/after accessing
               | the db.
        
               | SigmundA wrote:
               | Just putting "order-550e8400-e29b-11d4-a716-446655440000"
               | does not make it an order id either, thats what foreign
               | keys are for, mistakes can be made either way.
               | 
               | If your sending id via json you should be using json to
               | encode type either:
               | 
               | { order:"550e8400-e29b-11d4-a716-446655440000" }
               | 
               | Or if for some reason the value could contain multiple
               | types:
               | 
               | { thing: { t:"order",
               | id:"550e8400-e29b-11d4-a716-446655440000" } }
               | 
               | Even then just use:
               | 
               | { thing: { order :"550e8400-e29b-11d4-a716-446655440000"
               | } }
               | 
               | This is redundant and not necessary adding overhead:
               | 
               | { order: "order-550e8400-e29b-11d4-a716-446655440000" }
               | 
               | Storing that in your db as json even worse (Mongo,
               | JSONB), storing as varchar in normal column not much
               | better.
               | 
               | If your not storing ID with a type prefix in the db, then
               | thats not the form of your id's that just the UI/encoding
               | for them in your app which eliminates half the supposed
               | advantage which was easy debugging say in db query tools.
               | It also means you are parsing / deparsing
               | "order-550e8400-e29b-11d4-a716-446655440000" somewhere in
               | your db mapper code instead of just using your json or
               | query string parser or protobuf or whatever, why?.
        
               | seppel wrote:
               | > Just putting
               | "order-550e8400-e29b-11d4-a716-446655440000" does not
               | make it an order id either, thats what foreign keys are
               | for, mistakes can be made either way.
               | 
               | I'm not talking about the DB level here, I'm talking
               | about what goes over the wire or ends up in logs files.
               | And if you are handing out the uuids to your consumers,
               | you can be reasonable sure that you only get back the
               | uuids you handed out. So if
               | "order-550e8400-e29b-11d4-a716-446655440000" is not an
               | order, it an error on your side and not on your clients
               | side.
               | 
               | > If your sending id via json you should be using json to
               | encode type either:
               | 
               | Yes and you should also not make any mistakes.
        
               | SigmundA wrote:
               | So it all comes down to less mistakes made with:
               | 
               | order:"order-550e8400-e29b-11d4-a716-446655440000"
               | 
               | vs
               | 
               | order:"550e8400-e29b-11d4-a716-446655440000"
               | 
               | Either one will error in a sane system that checks to
               | make sure the UUID is actually in the db either one can
               | have mishandling sending the wrong id, you just have some
               | extra redundant bytes double confirming the key type
               | while the actual ID will still have to verified.
               | 
               | The nice thing about UUID over say ints is there should
               | be no overlap, so it should be easy to confirm an error
               | like that, double encoding the key from external apis,
               | sure I guess a very slight reduction in time taken to
               | verify an error, for logs aren't you logging the json
               | with the key already?
               | 
               | Of course this whole discussion was about an time order
               | UUID's which are mostly useful just for DB's, if we are
               | just talking about how ID's are formatted for external
               | use in the app, well geez I thought that was what query
               | strings and json was for but ok.
        
               | StreamBright wrote:
               | >> but to use it throughout your db eats up tons of space
               | 
               | I think human resources and outages cost a lot more than
               | disk space these days. If I can trade disk space for
               | either of those I will do it in a blink of an eye.
        
               | SigmundA wrote:
               | Not just disk space, larger index size mean less fits in
               | caches and memory, more I/O etc.
               | 
               | These are keys which are typically used constantly by an
               | app in the db with joins etc. They are some of the
               | hottest data typically and you want to avoid cache misses
               | on them if possible.
        
               | Hnsuz wrote:
               | Exactly.
        
               | kube-system wrote:
               | I can think of a lot of reasons why it might happen.
               | 
               | Maybe you have a process accepts multiple types of
               | objects. Maybe you think you are passing in the correct
               | type of object but you are not. Maybe the person
               | reporting the error to you omitted information. Maybe the
               | person who wrote the application didn't log enough
               | context. Yes, all of these situations are not ideal, but
               | if the process was ideal you wouldn't be debugging it in
               | the first place.
               | 
               | > I really can't imagine how this could ever happen in a
               | business. "Yes, sir, there was a problem with 1087."
               | "1087 what?" "I don't know, but it was number 1087."
               | 
               | This happens _all_ the time in many businesses. Users
               | rarely generate perfect error reports, and it 's not
               | uncommon for intermediaries to incorrectly communicate
               | details either. What developer hasn't seen an error
               | report that consists of a vague screenshot with "doesn't
               | work, got this error" as the description?
        
               | globular-toast wrote:
               | > Users rarely generate perfect error reports, and it's
               | not uncommon for intermediaries to incorrectly
               | communicate details either. What developer hasn't seen an
               | error report that consists of a screenshot with "doesn't
               | work, got this error" as the description?
               | 
               | And the solution is to have them repeat a UUID to you? I
               | don't think so...
        
               | kube-system wrote:
               | Something can be not a solution to a problem, but
               | contribute to making your life easier. This is that.
               | 
               | There are certainly applications that have type-ambiguous
               | IDs which work just fine too. Not all engineers make the
               | same decisions; that's ok.
        
               | jamescampbell wrote:
               | I am with you. This makes no sense to me. But I also have
               | wrapped userdata or account data in a generated hash to
               | quickly have access to the underlying account info /
               | expiration info etc. I think it is easier to justify why
               | to implement vs. why not to implement.
        
           | SigmundA wrote:
           | looks like its almost twice the size of a UUID at 30 bytes in
           | this case with a short prefix, I'm assuming your storing this
           | as a ascii string? That's a lot of extra bytes from some
           | convenience when viewing logs or raw data etc.
           | 
           | UUID's stored properly in binary form are only 16 bytes which
           | can still be a concern over an 8 or 4 byte integer when you
           | have a lot of them all over the place in a large db. (larger
           | index more memory use).
        
             | StreamBright wrote:
             | You can store UUIDs as bytes if you wish but when you are
             | looking at it in a human readable form it will be still a
             | UUID. I think storing UUID as text in 2021 is not that
             | hard.
             | 
             | >> UUID's stored properly in binary form are only 16 bytes
             | 
             | Depending on what is your use case.
        
               | SigmundA wrote:
               | >> You can store UUIDs as bytes if you wish but when you
               | are looking at it in a human readable form it will be
               | still a UUID. I think storing UUID as text in 2021 is not
               | that hard.
               | 
               | Not going to store a text type prefix + UUID in a 16byte
               | UUID column, you will need a variable length text column.
               | All your DB tools will convert UUID to human readable
               | plus the column name no need to store in the data reading
               | for every value.
               | 
               | if your looking at network traffic or debugging the id
               | will hopefully be keyed (json, query string, protobuf,
               | etc) if your looking at plain text logs with no schema,
               | sure emit the type-UUID for readability, in your logging
               | code...
               | 
               | >>Depending on what is your use case.
               | 
               | Only use case I can think of is unstructured logging, or
               | maybe using a file name as a key? I guess if your db is
               | just key-value with string keys maybe (basically a file
               | system).
        
           | DJBunnies wrote:
           | Uuid v5 is namespaced.
        
             | skissane wrote:
             | But given a v3 or v5 UUID, you can't tell what namespace it
             | belongs to. Both are one way functions from
             | (namespace,name) -> uuid. The gp is asking for a way to
             | tell user uuid apart from group uuid just by looking at the
             | uuid.
             | 
             | That is actually possible with the v2 uuids used with DCE.
             | But, since just about nobody uses DCE any more, just about
             | nobody uses v2 uuids either
        
               | pbronez wrote:
               | What is DCE?
        
               | skissane wrote:
               | https://en.m.wikipedia.org/wiki/Distributed_Computing_Env
               | iro...
        
       | [deleted]
        
       | Dylan16807 wrote:
       | Base62? That seems strictly worse than base64 to me. Why that
       | choice?
        
         | cuspycode wrote:
         | 62 is the number of letters and digits in ASCII, so it's the
         | most compact encoding that doesn't need special punctuation
         | characters (which aren't always safe to use).
        
           | Dylan16807 wrote:
           | How many situations don't let you use _ and -? URLs certainly
           | do.
        
         | zamalek wrote:
         | The two (or three if you retain padding) additional characters
         | are symbols and can cause issues in many places. Base64-url and
         | other variants exist for this reason, but there is no single
         | base64 that works everywhere.
        
           | Dylan16807 wrote:
           | There's no "single base62 that works everywhere" either. So
           | while there's valid criticism to be had of base64, I still
           | don't see how this choice is _better_.
        
       | sltkr wrote:
       | This isn't strictly speaking a UUID. It's just a 128-bit
       | identifier. It's "compatible with UUID" only in the sense that
       | both are 128 bits wide, but this scheme doesn't have the
       | structure of a UUID.
       | 
       | To summarize the scheme, a 128-bit id is generated by
       | concatenating:                 - 48 bits timestamp (milliseconds
       | since UNIX epoch)       - 80 bits of random data
       | 
       | These ids have the property that they're ordered by generation
       | time (ids generated at the same millisecond are ordered
       | randomly). The result can then be encoded in base 16 or base 62
       | to make it URL-compatible.
       | 
       | By the way, encoding timestamps into an id is not always
       | desirable. It has some privacy implications, since now you can
       | tell from the id when the resource was created. If this
       | corresponds with user activity (e.g. a user filling out a form or
       | uploading an image) having this information encoded in a URL is
       | possibly undesirable.
        
         | skissane wrote:
         | You are talking about a v1 UUID.
         | 
         | A random number is a valid v4 UUID, provided you set the
         | version and variant bits to the correct value. A random 128-bit
         | number won't have those bits correctly set (except by chance).
         | Set them, and you've made a valid v4 UUID
        
           | sltkr wrote:
           | As you say, you need to set certain bits to special values to
           | encode a v4 UUID, leaving less than 128 bits for the payload,
           | which means a random 128 bit number is not a valid UUID
           | except by chance.
           | 
           | Second, a v4 UUID should set all remaining bits randomly.
           | Encoding a timestamp in some of the bits is not random. Sure,
           | you could argue these could have been randomly generated by
           | chance, so any such UUID is not invalid, strictly speaking.
           | Still, a collection of these v4 UUIDs don't have the
           | properties that you would expect of random UUIDs.
           | 
           | As specificed, the scheme satisfies neither of these two
           | requirements.
        
             | withinboredom wrote:
             | There's always the expired v6 proposal:
             | https://datatracker.ietf.org/doc/draft-peabody-dispatch-
             | new-...
        
           | liversage wrote:
           | Exactly. And the v4 UUID will have 122 bits of randomness.
        
       | sudhirj wrote:
       | There's a spec called ULID that's pretty much this with default
       | base32 encoding
       | 
       | https://github.com/ulid/spec
       | 
       | I've also worked on a UUID-ULID bridge for Go
       | 
       | https://github.com/sudhirj/uulid.go
       | 
       | And seeing as this is just 128 bits it's quite easy to move
       | seamlessly between formats and representations.
       | 
       | I've found this concept especially useful in nosql stores like
       | DynamoDB, where using a ULID primary key makes objects time
       | sortable automatically. It's also quite easy to query for items
       | by zeroing out the random component and setting only the time
       | stamp bytes.
        
         | mfollert wrote:
         | This is super useful and exactly what I need for a current
         | project. Thanks you for that hint!
        
         | j-pb wrote:
         | ULID has a serious security flaw though. In order to try to
         | improve performance / increase prefix similarity, there is a
         | mode where it reuses the random value, and just increments it,
         | for each new ULID.
         | 
         | With purely random UUID, you can be pretty sure that nobody can
         | guess them, so they're essentially identifier and access token
         | in one.
         | 
         | However once you can easily predict the next uuid from the
         | previous one (as by incrementing the random part by one), the
         | access token property vanishes.
         | 
         | At least this proposal doesn't make the same mistake, however
         | 80 bit's isn't THAT much entropy. (Remember every bit counts
         | with exponential growth, and 48 bits less, means
         | 281474976710656 times easier to guess.)
         | 
         | So it boils down to. Would you be ok to secure all your data
         | with 10 character one time passwords, which might not be
         | guessed all at once, but where guessing SOME is sufficient to
         | get owned?
         | 
         | Death by a thousand paper-cuts.
        
           | sudhirj wrote:
           | Which implementation are you referring to? The Go package
           | uses crypto.random and generation blocks if the system can't
           | provide enough randomness. It's possible to override this
           | with custom implementations, but either ways it's no less
           | secure than a UUID unless the implementation is horrible.
           | 
           | The spec doesn't specify a source of randomness - an
           | implementation that uses https://xkcd.com/221/ will or course
           | not be a very good one.
        
             | sudhirj wrote:
             | If you're referring to the monotonic counter
             | implementation, then yes, that's a documented choice you
             | can make if you explicitly want strictly increasing ULIDs
             | on the same millisecond, but given that you can't ensure
             | this across servers anyway, it's more an edge case.
        
               | j-pb wrote:
               | Yes, monotonic counters, are ULID's "billion dollar
               | mistake".
               | 
               | They're not an edge case, in that the standard displays
               | them quite prominently. They're actually the only code
               | examples given.
               | 
               | The standard should deprecate them, because people WILL
               | use them "wrong".
        
               | sudhirj wrote:
               | Dunno about that. The implementations I've seen so far
               | default to /dev/random and make you jump through hoops to
               | get the monotonic counter, so if you manage to enable it
               | they should assume it's what you want. I've actually used
               | them effectively in a couple of NoSQL problems where I
               | didn't need cryptographic security, just distributed
               | timestamps - I was creating two sequential events (create
               | and update), and they had to always order one after
               | another, even though I created them in the same loop and
               | millisecond.
        
           | amzans wrote:
           | Thanks for the comment!
           | 
           | Yes, I'd like to clarify Timeflake (and many alternatives
           | suggested) is NOT meant to be used for security.
           | 
           | There's an important note on the readme about this:
           | 
           | > Since the timestamp part is predictable, the search space
           | within any given millisecond is 2^80 random numbers, which is
           | meant to avoid collisions, not to secure or hide information.
           | You should not be using timeflakes for password-reset tokens,
           | API keys or for anything which is security sensitive. There
           | are better libraries which are meant for this use case (for
           | example, the standard library's secrets module).
        
             | j-pb wrote:
             | You have to ask yourself the question though. If "URL-
             | Safety" is a main feature. What URLs / user-data are NOT
             | security relevant?
             | 
             | Leaking users private pictures / documents / whatever, will
             | also kill a company.
             | 
             | I guess one could maybe circumvent this, by using a hash of
             | the ID, but then you have to store that somewhere too, and
             | you're back to square one.
             | 
             | Nevertheless, I like that you fixed the main flaw with
             | ULID. Also you provide way better insight into the
             | tradeoffs, so kudos!
        
               | remcob wrote:
               | > using a hash of the ID
               | 
               | Hashing the IDs won't solve their lack of entropy. Crude
               | example: If you hash your pincode I still have only 10^4
               | values to try.
               | 
               | The easiest way to fix this is to add an access token
               | column that is cryptographically random and use both the
               | ID and the token in the URL.
               | 
               | If you trust the 80 bits already in the ID the token only
               | needs to be 80 bits for a total of 160 bits of entropy.
               | But if you do that you have to make sure that a missing
               | ID and invalid token are handled identically from the
               | attackers perspective (same reply, same timing).
        
               | j-pb wrote:
               | > Hashing the IDs won't solve their lack of entropy.
               | 
               | Yes and no. It would still significantly largen the
               | search space, as an attacker wouldn't get a point of
               | reference to latch on to.
        
         | spiffytech wrote:
         | I adopted ULID for a project after seeing it recently mentioned
         | in another HN comments thread[0] and it's very useful. The
         | project uses Azure Table Storage, which encourages querying by
         | ranges of record IDs for efficiency, instead of filtering
         | datasets by arbitrary record properties.
         | 
         | My project revolves around processing windows of time in
         | timestamp-ordered datasets, and ULID works well as a way to
         | generate record IDs in a distributed system that naturally sort
         | in timestamp order. With ULID, all of my major queries can
         | operate on ranges of record IDs, as Azure Table Storage
         | encourages.
         | 
         | [0] https://news.ycombinator.com/item?id=25650907
        
         | mgkimsal wrote:
         | I've been using ULID on a couple projects. In one case, we're
         | trying to convert an existing project to use them, and... it's
         | a bit of a pain, but it's not ULID's fault as much as just...
         | changing primary keys all over the place is time consuming and
         | not simple.
         | 
         | The biggest pushback (and it's not a terribly big one, imo)
         | I've read on ULID is that the time portion is sold as
         | 'sortable' but is 'only ms-level resolution'. Apparently the
         | critics I was initially reading on ULID needed sub millisecond
         | accuracy, and somehow thought this made ULIDs bad. Seemed a
         | non-useful criticism, as needing that level of time granularity
         | isn't a day to day issue for most people, but might be
         | something to be aware of if the idea of overloading one field
         | with multiple uses is a requirement you have.
        
           | techdragon wrote:
           | If you're an edge case you're an edge case. It's frustrating
           | when people start assuming their use case is normal without
           | any considerations of the wider world.
           | 
           | Sub millisecond precision sorting of high speed database
           | inserts is hardly a normal requirement. At that point even
           | the size of your transaction can matter (depending on the
           | database) and you have to start asking questions like do I
           | want ordering by the time the transaction starts or the time
           | it ends? And if your loading existing data from other sources
           | and care about date ordering you should just be using the a
           | dedicated column with the database appropriate time/date
           | time/time stamp field anyway.
        
       | calpaterson wrote:
       | One of the issues this is trying to resolve is that v4 UUIDs
       | wholly random nature makes them less than brilliant in the
       | b-trees that SQL databases use for indices (I am actually not
       | sure how much that really matters in practice for most projects).
       | 
       | However, Postgres has an alternate index implementation that uses
       | a hash function. It is crash-safe now (previously wasn't) but you
       | still can't use it to enforce a unique constraint. If you can use
       | that, I suspect it might resolve the problem and would be easier
       | than switching UUID generation algorithm.
        
         | withinboredom wrote:
         | It's a shame v6 uuids didn't become a thing[1]. It's the entire
         | reason they were invented.
         | 
         | 1: https://datatracker.ietf.org/doc/draft-peabody-dispatch-
         | new-...
        
       | techdragon wrote:
       | The biggest thing I can highlight after having used Standard
       | UUIDs and heavily using ULID values including building my own
       | library and testing compatibility and behavioural compliance of
       | the library with other database variants than the big 4 (MySQL,
       | MS SQL, SQLite and PostgreSQL) ... is that _no PK format should
       | ever be considered suitable for all use cases_ its all about
       | trade offs.
       | 
       | If you care about the order of something you shouldn't expect any
       | PK to handle that alone, even auto-incremented integers will
       | start to fail you once you have to scale your write workload
       | horizontally. If there's a property of your data that you want to
       | be able to order it by then you just need to make a column of the
       | appropriate data type and store it. If I had a nickel for every
       | time someone has look at a schema and said something like "Do we
       | really need this date for ordering? The table is already sorted
       | by insert order..." I'd have a lot more nickels than I want.
        
       | varispeed wrote:
       | This is assuming that the time is linearly increasing, but what
       | if you have a service running on a fast moving object between
       | planets or in alternative dimension where time is not linear or
       | not monotonic?
        
       ___________________________________________________________________
       (page generated 2021-01-22 23:01 UTC)