[HN Gopher] Show HN: 128-bit, roughly-ordered, URL-safe UUIDs
___________________________________________________________________
Show HN: 128-bit, roughly-ordered, URL-safe UUIDs
Author : amzans
Score : 154 points
Date : 2021-01-22 10:49 UTC (12 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| advisedwang wrote:
| Just be careful if you use these as keys for a distributed system
| that is susceptible to hotspotting. GCP's Cloud Spanner [1] has
| this problem and has a good document explaining it.
|
| As always - understand what is going on in your stack and apply
| design with that in mind. There are no universals!
|
| [1] https://cloud.google.com/spanner/docs/schema-
| design#primary-...
| crescentfresh wrote:
| I had to pause when they use the term "anti-pattern". It may be
| an anti-pattern for cloud spanner but it's not an anti-pattern
| in other distributed systems, for example cassandra.
| majormajor wrote:
| The very next word after "anti-pattern" is _if_. "The
| following table definition that uses a timestamp-based
| primary key as the first key part is an anti-pattern if the
| table will see a high rate of insertion:"
|
| Once your data gets big enough there's no easy one-size-fits-
| all. You gotta look at how you're going to be using the
| system.
| zug_zug wrote:
| Right. This spanner thing is new to me, but looking at it, I
| don't see why spanner doesn't/couldn't shard by modulo /
| checksum to reduce this risk. And I think other systems do
| electricshampo1 wrote:
| It shards it this way (by primary key ranges) to support
| efficient range scans.
| caleblloyd wrote:
| This approach has become my preferred way of generating IDs for
| database rows. It is relatively space efficient and doesn't
| require a round trip to the database to generate an ID like auto
| increment does.
|
| While working on a C# implementation for MySQL and we found that
| when the DB uses Little Endian Binary that the GUID must be
| reversed to store in order:
|
| https://github.com/PomeloFoundation/Pomelo.EntityFrameworkCo...
| abhishekjha wrote:
| Does random UUID work as well as lets say an ordered integer id
| or ordered uuid? I thought the ids should have a definite
| order(not necessarily sequential) so that a binary search can
| be performed on the primary index. Similar for b-trees.
| masklinn wrote:
| Random uuids are perfectly ordered, there's no issue looking
| up a single record by uuid.
|
| The problems are that they're scattered during reads and
| writes, and will require updating old pages.
| kevincox wrote:
| Note that scattering can actually be beneficial if you have
| too much load for a single node when the data is hot. Of
| course you can solve this with weird sharding such as
| sharding on the least significant byte.
| SigmundA wrote:
| Seems like this point is not understood very well by
| many. Ordered keys can create hot spots on insert and
| read, this can be good or bad depending on backend.
| mywittyname wrote:
| A hot-spot in one database index is collocated data in
| another. Sometimes you want uniform distribution of keys,
| sometimes you want clusters of keys because different
| databases are optimized based on assumptions about the
| underlying distribution of keys.
| kevincox wrote:
| The classic goldilocks problem. You want your pages to be
| hot, but not too hot.
| Ellipsis753 wrote:
| I'd worry a bit about the chance of collisions.
|
| They correctly state you can generate 50 million of them per
| millisecond with only a one in a billion chance of a collision
| (per millisecond).
|
| Unfortunately this means the chance of a collision in a year is
| 99.999999999998% which isn't great!
|
| (I think they've given too many bits to the timestamp.)
|
| Also, the uuid format it creates isn't a valid uuid which is a
| pity.
|
| If you want to have "only" a one in a billion chance of a
| collision a year, you can only generate about 300 a millisecond.
| pas wrote:
| You'd run into a bigger problem than collisions a lot sooner
| (IMHO): time ordering. "ms" precision is very misleading,
| because keeping clocks really in sync while processing this
| much data is a real challenge.
|
| Sure, maybe it doesn't matter that one server/emitter and thus
| a bunch of IDs are out of order, but if this is really just for
| helping indexes, then a simple 0.1 second precision would make
| more sense. (Or even simply truncating the timestamp to however
| many bits to get ~2-30 sec precision, and using a data store
| with LevelDB-like staged compaction. Which is virtually a
| requirement at this scale.)
| doersino wrote:
| > Unfortunately this means the chance of a collision in a year
| is 99.999999999998% which isn't great!
|
| How was this computed? My intuition begs to differ.
| jychang wrote:
| There are 31,536,000,000 milliseconds in a year. If you do
| (1-1E-9)^3.1536E10, you get a 2E-14 chance of no collision in
| a year.
| hoseja wrote:
| I think the presumption is you will not be generating 50
| million of them each millisecond.
| alpaca128 wrote:
| Which is a pretty safe assumption, considering just
| generating those 128bits 50 million times a millisecond
| results in over 20 million TB of data in a year. I think
| it's safe to say this is not a common scenario.
| onetoo wrote:
| Probability of no collision in a single millisecond: p_1 =
| 999999999/1000000000
|
| Milliseconds in a year: t = 1000 * 60 * 60 * 24 * 365
|
| Probability of no collision happening for every millisecond
| of a year: p_2 = p_1^t
|
| Probability of a collision happening for at least one
| millisecond of a year: p_3 = 1 - p_2
|
| https://www.wolframalpha.com/input/?i=1+-+%28999999999%2F100.
| ..
| mytailorisrich wrote:
| There are close to 32 billion millisecond in a year. If the
| chance of a collision happening within a millisecond is 1 in
| a billion then my intuition is that the probability of a
| collision happening at least once a year is quite high,
| indeed.
|
| It's like a dice: You have 1 in 6 chance of getting a 6 every
| time your roll the dice. Now, if you roll the dice 100 times
| the chance that you'll get a 6 at least once is then pretty
| high.
| ktpsns wrote:
| I guess you assume to build one timeflake per ms for one year.
| That's about 3e13 timeflakes. Ordinary UUIDs have a collisions
| at the order of 1e18 hashes (cf. https://en.wikipedia.org/wiki/
| Universally_unique_identifier#...). So we are "only" 5 orders
| of magnitude away. For me the collision quality feels the same.
| Ellipsis753 wrote:
| If you assume one per ms then timeflake and UUIDv1 both have
| a 0% chance of collision as they encode the timestamp.
|
| You might go above this rate some of the time though and we
| might as well strive for perfection.
|
| One of the nice things about UUIDs is that you can just merge
| data together willy-nilly without worrying about collisions -
| for example, if you gets bought by Google. You can't really
| do this with timeflake.
| phkahler wrote:
| >> If you assume one per ms then timeflake and UUIDv1 both
| have a 0% chance of collision as they encode the timestamp.
|
| They can be generated anywhere at any time. It two machines
| are generating even 1 per hour, they will have collisions
| if their clocks are synchronized. The random bits are
| important.
| zuzun wrote:
| Look for a different solution if you need 1 quintillion IDs per
| year. Noted.
| m45t3r wrote:
| Datomic uses a similar approach by default. They called it
| squuid: https://docs.datomic.com/on-prem/identity.html#squuids
| j-pb wrote:
| Have you considered using a 32 bit time value with second
| resolution?
|
| That way you still get sort-ability, you get 16 bit more entropy,
| fare better on 32 bit systems, and still get 130 years of usage,
| out of them.
|
| Also you actively discourage people from trying funny stuff like
| synchronisation and conflict resolution via the time component.
| Because 1ms "that sounds kinda good enough :?", while 1s "that'll
| never work!!!"
|
| Ammendment:
|
| Thinking about it. Using a 16bit timestamp prefix with second
| resolution and rollover should probably provide a good tradeoff
| between improving insertion performance (most expensive in terms
| of index touching), a 112 bit random value, and a dense usage of
| the timstamp field.
|
| It also allows you to do thing like "forgetting" old values,
| within a 18h timeframe.
| Tostino wrote:
| I personally use: https://github.com/tvondra/sequential-uuids
|
| The time component resolution is configurable, and you can make
| it wrap around every so often, filling in gaps left by page
| splits, updates, deletes, etc.
| j-pb wrote:
| Nice! This is kinda what I had in mind, while writing that
| ammendum!
|
| Thanks for sharing :D
| Sytten wrote:
| Its just a bit annoying because it is a C extension and that
| can't be used in most managed DB in the cloud.
| Tostino wrote:
| I had to re-write the time based function from that
| extension in PL/PGSQL earlier in the week for someone on my
| team who couldn't work with the C extension: https://gist.g
| ithub.com/Tostino/ca104bff40e704b6db70b9af4926...
|
| This can be used in RDS, etc.
| dkarl wrote:
| > Because 1ms "that sounds kinda good enough :?"
|
| Ugh. You wouldn't think so, but I caught a case of this in
| production, using millisecond timestamps as unique ids for
| browser analytics. The developer felt that even though two
| users _could_ in rare cases click at exactly the same
| millisecond, their clocks weren 't likely to be in perfect sync
| with each other, and that this "additional level" of randomness
| made a collision virtually impossible.
| mywittyname wrote:
| I worked on an analytics system where we appended a random 3
| digits to the millisecond resolution timestamp for exactly
| this reason. It turns out, collisions are far, far more
| common that the math would suggest. Especially back in the
| aughts when browser had a lot of individual quirks to deal
| with.
| Quekid5 wrote:
| Oh dear god.
|
| I suppose it can be counter-intuitive, but people really need
| to learn about timer precision... and, more importantly,
| statistics (basics, at least!).
|
| I absolutely hate the 'combinatorics' part of statistics, but
| distributions, etc. seemed pretty learnable to me... but then
| I like continuous math, so...
| j-pb wrote:
| That's not even statistics. That's simply the pigeon hole
| principle.
|
| A second has 1000ms. If you have more than 1000 users doing
| things per second, there MUST be a collision.
| Quekid5 wrote:
| Ah, but now you're discretizing! :D
|
| ... also, it Technically depends on your throughput
| (something something queue theory).
| Bogdanp wrote:
| You can also pick your own epoch and use centiseconds since
| that epoch for the time component. It's what I do in [1].
|
| [1]: https://docs.racket-
| lang.org/buid/index.html?q=buid#%28part....
| rafaelturk wrote:
| This is really nice, anyone aware of a NodeJs alternative?
| mgkimsal wrote:
| https://github.com/ulid/spec isn't timeflake, but is a good
| spec, and there are implementations in multiple languages.
| there are 2 listed for js.
|
| https://github.com/aarondcohen/id128 and
| https://github.com/ulid/javascript
| audionerd wrote:
| @thi.ng/ksuid https://github.com/thi-
| ng/umbrella/tree/develop/packages/ksu...
| sparkling wrote:
| > Roughly ordered (K-sortable), incremental timestamp in most
| significant bits enables faster indexing and less fragmentation
| on database indices (vs UUID v1/v4).
|
| What does this mean for database performance in simple CRUD
| scenarios?
| theta_d wrote:
| Using an ordered ID prevents index fragmentation. Inserts will
| be slower and ultimately so will reads. In a database like MS
| SQL Server this is especially true since data is physically
| stored on disk based on the clustering key which is the PK by
| default.
| amzans wrote:
| For those interested, here's (IMO) a good explanation on why
| completely random UUIDs hurt performance, and can lead to
| database fragmentation issues down the road (on MySQL):
|
| https://www.percona.com/blog/2019/11/22/uuids-are-popular-bu...
|
| It's part of what Timeflake aims to solve, but there's plenty
| alternatives too depending on your exact requirements.
| ForHackernews wrote:
| This is not universally true. It depends on the database engine
| and how its indexes are implemented.
| zerd wrote:
| It's pretty much true for any B-tree based index. You could
| get around it by using a combined index on another timestamp
| value, but that's going to be a very large index. Or reduce
| it by partitioning by e.g. year-month. Even hash-based
| indices would be slightly impacted due to less locality.
| amzans wrote:
| Thanks, I updated the comment to mention that the link is in
| the context of MySQL's case.
| fastball wrote:
| Is this applicable to postgres?
| calpaterson wrote:
| It's applicable to the indexes that postgres uses. Postgres
| does not put rows in the heap in order of primary key so
| the issue is less serious than in innodb.
| solicode wrote:
| I like ordered UUIDs, but if you care about 100% strict and
| accurate ordering you need to be careful about generating them in
| app code vs. on the DB side. With the former + using multiple app
| servers, server clock drift can become an issue. Which is why I
| still generate my ordered UUIDs on my Postgres server. It means
| writing some funky PL/pgSQL (or however you decide to do it... I
| think you can run Python too), but it gets the job done.
|
| I wish ordered UUIDs were built-in to more DBs natively. I
| remember seeing UUIDv6 being proposed, but I don't know if/when
| that will be standardised.
| httgp wrote:
| How does one generate ordered UUIDs right on the PostgreSQL
| server?
| solicode wrote:
| There's probably a nicer way (or more performant at least),
| but I created a function that uses the built-in
| `uuid_generate_v1mc` and then reverse the order of the time
| component so that high values come first.
| Tostino wrote:
| Use this: https://github.com/tvondra/sequential-uuids
|
| If you cannot install extensions, I just wrote a PL/PGSQL
| implementation for the time-based generator I could share.
| Sytten wrote:
| Please do I have been looking for that. Posting it in the
| repo would help a few people k am guessing.
| Tostino wrote:
| https://gist.github.com/Tostino/ca104bff40e704b6db70b9af4
| 926...
|
| Posted it to the repo as well.
| hankchinaski wrote:
| not quite sure why the requirement of being compatible with the
| UUID 128-bit representation. i personally try to use KSUID
| (160bit) when possible and can't see any downside so far
| Tostino wrote:
| Being 128-bits allows you to utilize the same database/object
| types as UUID.
|
| That's the main benefit IMO. I couldn't stick a KSUID into my
| Postgres UUID column, i'd have to write a new data type to
| store that.
| [deleted]
| chrisacky wrote:
| Can someone expand on this please.
|
| > They should be random, but roughly-ordered over time so that
| your MySQL/Postgres indices stay fast and efficient as the
| dataset grows.
|
| I'm currently facing an issue where the chosen uuid format was
| v4. From my understanding this is the least preferable because
| it's true random, which causes hell for mysql indices.
|
| What I don't quite udnerstand however, is why is sorted random
| better if I'm using an index in both cases? I'd like to know what
| I should be doing and if there is a way to move/migrate if
| possible
| masklinn wrote:
| > What I don't quite udnerstand however, is why is sorted
| random better if I'm using an index in both cases?
|
| Data locality. The default index in most RDBMS is a btree which
| is ordered and clustered so when you look up items in an index
| if they sort similarly they'll be in the same index page(s) and
| thus cheaper to retrieve. For PKs that means when you go
| through your records it's cheap to get siblings.
|
| When the ids are randomly distributed it's way more expensive
| to add them because they get added at random positions (writes
| are scattered which is bad) and when you retrieve them you have
| to jump through the index at random loading a bunch of pages to
| get just one or two records from it.
|
| Things are even worse if the table itself is clustered after
| the PK because now your scattered index is also ducking with
| your actual data reads.
| 101008 wrote:
| The indexes advantages would be the same when you think of
| searching. But for inserting (and deleting) it is much faster
| if they are already orderded. You don't need (or well, the DB
| engine) to traverse any trees or anything. If they are 100%
| ordered, inserting new ones to that index would be just add
| something at the end of a list, much faster.
| thristian wrote:
| As a rule of thumb, a table that grows over time probably has a
| "creation time" field, and a query that touches one row
| probably touches other rows created around the same time (like
| a weekly report). Since rows are probably stored on disk in
| chronological order, you can index the "creation time" field
| and do a good job, but if your primary key is already roughly
| chronological you may be able to get away with one fewer index
| and keep inserts/updates cheaper.
| kijin wrote:
| In MySQL and some other database systems, primary keys are
| clustered indexes. A clustered index defines the order in which
| data is physically stored on disk.
|
| If you have an auto-incrementing primary key like most sane
| people do, new rows will be sorted to the very end of the
| table. Inserting data to this table essentially becomes an
| append-only operation, which is very fast. On the other hand,
| inserting random values all over the place is like inserting
| data in the middle of a file and pushing back everything that
| comes afterward, over and over again. Even on modern SSD's,
| this is much slower than simply appending data to the end of a
| file.
|
| For best performance, your primary keys should be as close as
| possible to auto-incrementing integers. In other words, each
| new value should be greater than all the values that currently
| exist in the table. Timestamps are quite useful for this
| purpose.
| pas wrote:
| This is an InnoDB thing for MySQL. (The old MyISAM was a
| simple append only thing with separate index file.)
| abhishekjha wrote:
| Query : For people who recommend UUIDs as a unique identifier in
| a distributed sytem, does the above satisfy the need?
|
| In the SO answer[0] uuid v4 is being recommended but that is the
| most random version of uuid. Isn't that going to hurt the b-tree
| performance by having a very high fanout? How does it help the
| primary index?
|
| [0] : https://stackoverflow.com/questions/33274291/postgresql-
| uuid...
|
| EDIT : Apologies. I meant low fanout. What I meant was that you
| need something ordered to have a better organised b-tree node. If
| all values are random, each node will have a lot of children,
| increasing the search space. It might turn into Linked list in
| the worst case scenario.
| tomsmeding wrote:
| Excuse my possible ignorance, but how is more entropy in a
| primary key going to hurt index performance? The fanout in a
| b-tree is completely in control of the database, and not
| dependent at all on properties of the data -- apart, of course,
| from the _size_ of the data, since typically the node fanout is
| precisely the number of fields that fit in one disk page.
| Randomness does not influence the size of a UUID on disk.
| rapidlua wrote:
| With random keys updates are going to touch way more pages
| than with sequential ones. It increases IO due to more dirty
| buffers and journaling.
| tomsmeding wrote:
| Point taken. Quite true.
| Tostino wrote:
| More reading on this for Postgres:
| https://www.2ndquadrant.com/en/blog/sequential-uuid-
| generato...
| mytailorisrich wrote:
| > _Isn 't that going to hurt the b-tree performance by having a
| very high fanout?_
|
| High fanout helps minimise the depth of the tree and therefore
| helps performance as it minimises the number of nodes to check.
| abhishekjha wrote:
| Sorry, I meant low fan out, as in a b-tree node will end up
| have having a lot of singular valued children ultimately
| degrading to linked list.
| mytailorisrich wrote:
| I think random binary keys are pretty good for B trees
| because they also lead to shallow depth and a balanced tree
| instead of potentially degrading to a linked list.
| StreamBright wrote:
| We moved away from UUIDs for different reasons. It is hard to
| work with UUIDs when multiple entities have them. You do not
| know if you are looking at a user or a group if both is just a
| UUID. We created our own IDs with prefixes and using base58
| encoding for the random part, for example: user-
| AhnPpn7CbbwPykmux11LJ39Jk. It has the same amount of randomness
| in it and we can understand what we are looking at.
| globular-toast wrote:
| Why would you ever be in a situation where you have a UUID
| but don't know what it refers to?
| StreamBright wrote:
| Because somebody (not necessarily you) wrote the following
| code:
|
| printf("%s - %s", id0, id1)
|
| You can even do this:
|
| printf("user-id %s group-id %s", group-id, user-id)
|
| Do you think you will be able to quickly debug what is
| wrong in a high severity situation when a customer facing
| system is down?
|
| I do not, so we use eids (external ids) that are easy to
| identify.
| seppel wrote:
| * You stumble upon them in log files or error messages.
|
| * Somebody asks you why some API call is not working.
|
| * You send around Excel files to higher management (for
| example in incidents).
| globular-toast wrote:
| Now I'm even more baffled. What kind of stupid logger
| logs a uuid without a clue as to what kind of object it
| identifies? Why would you need to communicate uuids to
| higher management?
|
| This makes me think of Douglas Adams. You know the answer
| is 42 but you forgot what the question was. I really
| can't imagine how this could ever happen in a business.
| "Yes, sir, there was a problem with 1087." "1087 what?"
| "I don't know, but it was number 1087."
| rkalla wrote:
| You seem... fun.
| seppel wrote:
| > What kind of stupid logger logs a uuid without a clue
| as to what kind of object it identifies?
|
| The code cannot know what kind of object an uuid
| identifies, it can only assume that it identifies what it
| expects. So it you might see:
|
| "Unknown person 550e8400-e29b-11d4-a716-446655440000" in
| the log files. Much more helpful for debugging, however,
| is: "Unknown person
| order-550e8400-e29b-11d4-a716-446655440000". Then you
| know somebody accidentally put an order id where a person
| id belongs.
|
| This is an easy example, in the real world you have
| persons, groups, representatives, accounts, acls, etc,
| which are easily confused. If you just have an uuid, you
| will have a hard time figured out what it actually is.
| SigmundA wrote:
| This is why you have column headers?
|
| If you just want it to look that way in a text log, then
| sure prepend the type with the id wasting some space
| there, but to use it throughout your db eats up tons of
| space from both text encoding the id and repeating the
| column name in the data over and over.
| seppel wrote:
| > This is why you have column headers?
|
| I'm not sure I can follow. You get an uuid from
| somewhere, probably external code that you cannot even
| look at. The uuid comes stringly-typed, let's say as
| json. Then you have no chance to know what type it is,
| you can only assume it is the correct type that you
| expect.
|
| Just putting "550e8400-e29b-11d4-a716-446655440000" into
| a person colum does not make it a person id.
|
| If you get instead
| "order-550e8400-e29b-11d4-a716-446655440000" as a typed-
| uuid, you know for sure that this is not a person id. It
| is an order id. Somebody made a mistake somewhere. You
| also know that you probably have to look in the order DB
| to debug what it actually refers to.
|
| > but to use it throughout your db eats up tons of space
| from both text encoding the id
|
| You dont need to put your uuid in this format into the
| db. You can add/remove the type before/after accessing
| the db.
| SigmundA wrote:
| Just putting "order-550e8400-e29b-11d4-a716-446655440000"
| does not make it an order id either, thats what foreign
| keys are for, mistakes can be made either way.
|
| If your sending id via json you should be using json to
| encode type either:
|
| { order:"550e8400-e29b-11d4-a716-446655440000" }
|
| Or if for some reason the value could contain multiple
| types:
|
| { thing: { t:"order",
| id:"550e8400-e29b-11d4-a716-446655440000" } }
|
| Even then just use:
|
| { thing: { order :"550e8400-e29b-11d4-a716-446655440000"
| } }
|
| This is redundant and not necessary adding overhead:
|
| { order: "order-550e8400-e29b-11d4-a716-446655440000" }
|
| Storing that in your db as json even worse (Mongo,
| JSONB), storing as varchar in normal column not much
| better.
|
| If your not storing ID with a type prefix in the db, then
| thats not the form of your id's that just the UI/encoding
| for them in your app which eliminates half the supposed
| advantage which was easy debugging say in db query tools.
| It also means you are parsing / deparsing
| "order-550e8400-e29b-11d4-a716-446655440000" somewhere in
| your db mapper code instead of just using your json or
| query string parser or protobuf or whatever, why?.
| seppel wrote:
| > Just putting
| "order-550e8400-e29b-11d4-a716-446655440000" does not
| make it an order id either, thats what foreign keys are
| for, mistakes can be made either way.
|
| I'm not talking about the DB level here, I'm talking
| about what goes over the wire or ends up in logs files.
| And if you are handing out the uuids to your consumers,
| you can be reasonable sure that you only get back the
| uuids you handed out. So if
| "order-550e8400-e29b-11d4-a716-446655440000" is not an
| order, it an error on your side and not on your clients
| side.
|
| > If your sending id via json you should be using json to
| encode type either:
|
| Yes and you should also not make any mistakes.
| SigmundA wrote:
| So it all comes down to less mistakes made with:
|
| order:"order-550e8400-e29b-11d4-a716-446655440000"
|
| vs
|
| order:"550e8400-e29b-11d4-a716-446655440000"
|
| Either one will error in a sane system that checks to
| make sure the UUID is actually in the db either one can
| have mishandling sending the wrong id, you just have some
| extra redundant bytes double confirming the key type
| while the actual ID will still have to verified.
|
| The nice thing about UUID over say ints is there should
| be no overlap, so it should be easy to confirm an error
| like that, double encoding the key from external apis,
| sure I guess a very slight reduction in time taken to
| verify an error, for logs aren't you logging the json
| with the key already?
|
| Of course this whole discussion was about an time order
| UUID's which are mostly useful just for DB's, if we are
| just talking about how ID's are formatted for external
| use in the app, well geez I thought that was what query
| strings and json was for but ok.
| StreamBright wrote:
| >> but to use it throughout your db eats up tons of space
|
| I think human resources and outages cost a lot more than
| disk space these days. If I can trade disk space for
| either of those I will do it in a blink of an eye.
| SigmundA wrote:
| Not just disk space, larger index size mean less fits in
| caches and memory, more I/O etc.
|
| These are keys which are typically used constantly by an
| app in the db with joins etc. They are some of the
| hottest data typically and you want to avoid cache misses
| on them if possible.
| Hnsuz wrote:
| Exactly.
| kube-system wrote:
| I can think of a lot of reasons why it might happen.
|
| Maybe you have a process accepts multiple types of
| objects. Maybe you think you are passing in the correct
| type of object but you are not. Maybe the person
| reporting the error to you omitted information. Maybe the
| person who wrote the application didn't log enough
| context. Yes, all of these situations are not ideal, but
| if the process was ideal you wouldn't be debugging it in
| the first place.
|
| > I really can't imagine how this could ever happen in a
| business. "Yes, sir, there was a problem with 1087."
| "1087 what?" "I don't know, but it was number 1087."
|
| This happens _all_ the time in many businesses. Users
| rarely generate perfect error reports, and it 's not
| uncommon for intermediaries to incorrectly communicate
| details either. What developer hasn't seen an error
| report that consists of a vague screenshot with "doesn't
| work, got this error" as the description?
| globular-toast wrote:
| > Users rarely generate perfect error reports, and it's
| not uncommon for intermediaries to incorrectly
| communicate details either. What developer hasn't seen an
| error report that consists of a screenshot with "doesn't
| work, got this error" as the description?
|
| And the solution is to have them repeat a UUID to you? I
| don't think so...
| kube-system wrote:
| Something can be not a solution to a problem, but
| contribute to making your life easier. This is that.
|
| There are certainly applications that have type-ambiguous
| IDs which work just fine too. Not all engineers make the
| same decisions; that's ok.
| jamescampbell wrote:
| I am with you. This makes no sense to me. But I also have
| wrapped userdata or account data in a generated hash to
| quickly have access to the underlying account info /
| expiration info etc. I think it is easier to justify why
| to implement vs. why not to implement.
| SigmundA wrote:
| looks like its almost twice the size of a UUID at 30 bytes in
| this case with a short prefix, I'm assuming your storing this
| as a ascii string? That's a lot of extra bytes from some
| convenience when viewing logs or raw data etc.
|
| UUID's stored properly in binary form are only 16 bytes which
| can still be a concern over an 8 or 4 byte integer when you
| have a lot of them all over the place in a large db. (larger
| index more memory use).
| StreamBright wrote:
| You can store UUIDs as bytes if you wish but when you are
| looking at it in a human readable form it will be still a
| UUID. I think storing UUID as text in 2021 is not that
| hard.
|
| >> UUID's stored properly in binary form are only 16 bytes
|
| Depending on what is your use case.
| SigmundA wrote:
| >> You can store UUIDs as bytes if you wish but when you
| are looking at it in a human readable form it will be
| still a UUID. I think storing UUID as text in 2021 is not
| that hard.
|
| Not going to store a text type prefix + UUID in a 16byte
| UUID column, you will need a variable length text column.
| All your DB tools will convert UUID to human readable
| plus the column name no need to store in the data reading
| for every value.
|
| if your looking at network traffic or debugging the id
| will hopefully be keyed (json, query string, protobuf,
| etc) if your looking at plain text logs with no schema,
| sure emit the type-UUID for readability, in your logging
| code...
|
| >>Depending on what is your use case.
|
| Only use case I can think of is unstructured logging, or
| maybe using a file name as a key? I guess if your db is
| just key-value with string keys maybe (basically a file
| system).
| DJBunnies wrote:
| Uuid v5 is namespaced.
| skissane wrote:
| But given a v3 or v5 UUID, you can't tell what namespace it
| belongs to. Both are one way functions from
| (namespace,name) -> uuid. The gp is asking for a way to
| tell user uuid apart from group uuid just by looking at the
| uuid.
|
| That is actually possible with the v2 uuids used with DCE.
| But, since just about nobody uses DCE any more, just about
| nobody uses v2 uuids either
| pbronez wrote:
| What is DCE?
| skissane wrote:
| https://en.m.wikipedia.org/wiki/Distributed_Computing_Env
| iro...
| [deleted]
| Dylan16807 wrote:
| Base62? That seems strictly worse than base64 to me. Why that
| choice?
| cuspycode wrote:
| 62 is the number of letters and digits in ASCII, so it's the
| most compact encoding that doesn't need special punctuation
| characters (which aren't always safe to use).
| Dylan16807 wrote:
| How many situations don't let you use _ and -? URLs certainly
| do.
| zamalek wrote:
| The two (or three if you retain padding) additional characters
| are symbols and can cause issues in many places. Base64-url and
| other variants exist for this reason, but there is no single
| base64 that works everywhere.
| Dylan16807 wrote:
| There's no "single base62 that works everywhere" either. So
| while there's valid criticism to be had of base64, I still
| don't see how this choice is _better_.
| sltkr wrote:
| This isn't strictly speaking a UUID. It's just a 128-bit
| identifier. It's "compatible with UUID" only in the sense that
| both are 128 bits wide, but this scheme doesn't have the
| structure of a UUID.
|
| To summarize the scheme, a 128-bit id is generated by
| concatenating: - 48 bits timestamp (milliseconds
| since UNIX epoch) - 80 bits of random data
|
| These ids have the property that they're ordered by generation
| time (ids generated at the same millisecond are ordered
| randomly). The result can then be encoded in base 16 or base 62
| to make it URL-compatible.
|
| By the way, encoding timestamps into an id is not always
| desirable. It has some privacy implications, since now you can
| tell from the id when the resource was created. If this
| corresponds with user activity (e.g. a user filling out a form or
| uploading an image) having this information encoded in a URL is
| possibly undesirable.
| skissane wrote:
| You are talking about a v1 UUID.
|
| A random number is a valid v4 UUID, provided you set the
| version and variant bits to the correct value. A random 128-bit
| number won't have those bits correctly set (except by chance).
| Set them, and you've made a valid v4 UUID
| sltkr wrote:
| As you say, you need to set certain bits to special values to
| encode a v4 UUID, leaving less than 128 bits for the payload,
| which means a random 128 bit number is not a valid UUID
| except by chance.
|
| Second, a v4 UUID should set all remaining bits randomly.
| Encoding a timestamp in some of the bits is not random. Sure,
| you could argue these could have been randomly generated by
| chance, so any such UUID is not invalid, strictly speaking.
| Still, a collection of these v4 UUIDs don't have the
| properties that you would expect of random UUIDs.
|
| As specificed, the scheme satisfies neither of these two
| requirements.
| withinboredom wrote:
| There's always the expired v6 proposal:
| https://datatracker.ietf.org/doc/draft-peabody-dispatch-
| new-...
| liversage wrote:
| Exactly. And the v4 UUID will have 122 bits of randomness.
| sudhirj wrote:
| There's a spec called ULID that's pretty much this with default
| base32 encoding
|
| https://github.com/ulid/spec
|
| I've also worked on a UUID-ULID bridge for Go
|
| https://github.com/sudhirj/uulid.go
|
| And seeing as this is just 128 bits it's quite easy to move
| seamlessly between formats and representations.
|
| I've found this concept especially useful in nosql stores like
| DynamoDB, where using a ULID primary key makes objects time
| sortable automatically. It's also quite easy to query for items
| by zeroing out the random component and setting only the time
| stamp bytes.
| mfollert wrote:
| This is super useful and exactly what I need for a current
| project. Thanks you for that hint!
| j-pb wrote:
| ULID has a serious security flaw though. In order to try to
| improve performance / increase prefix similarity, there is a
| mode where it reuses the random value, and just increments it,
| for each new ULID.
|
| With purely random UUID, you can be pretty sure that nobody can
| guess them, so they're essentially identifier and access token
| in one.
|
| However once you can easily predict the next uuid from the
| previous one (as by incrementing the random part by one), the
| access token property vanishes.
|
| At least this proposal doesn't make the same mistake, however
| 80 bit's isn't THAT much entropy. (Remember every bit counts
| with exponential growth, and 48 bits less, means
| 281474976710656 times easier to guess.)
|
| So it boils down to. Would you be ok to secure all your data
| with 10 character one time passwords, which might not be
| guessed all at once, but where guessing SOME is sufficient to
| get owned?
|
| Death by a thousand paper-cuts.
| sudhirj wrote:
| Which implementation are you referring to? The Go package
| uses crypto.random and generation blocks if the system can't
| provide enough randomness. It's possible to override this
| with custom implementations, but either ways it's no less
| secure than a UUID unless the implementation is horrible.
|
| The spec doesn't specify a source of randomness - an
| implementation that uses https://xkcd.com/221/ will or course
| not be a very good one.
| sudhirj wrote:
| If you're referring to the monotonic counter
| implementation, then yes, that's a documented choice you
| can make if you explicitly want strictly increasing ULIDs
| on the same millisecond, but given that you can't ensure
| this across servers anyway, it's more an edge case.
| j-pb wrote:
| Yes, monotonic counters, are ULID's "billion dollar
| mistake".
|
| They're not an edge case, in that the standard displays
| them quite prominently. They're actually the only code
| examples given.
|
| The standard should deprecate them, because people WILL
| use them "wrong".
| sudhirj wrote:
| Dunno about that. The implementations I've seen so far
| default to /dev/random and make you jump through hoops to
| get the monotonic counter, so if you manage to enable it
| they should assume it's what you want. I've actually used
| them effectively in a couple of NoSQL problems where I
| didn't need cryptographic security, just distributed
| timestamps - I was creating two sequential events (create
| and update), and they had to always order one after
| another, even though I created them in the same loop and
| millisecond.
| amzans wrote:
| Thanks for the comment!
|
| Yes, I'd like to clarify Timeflake (and many alternatives
| suggested) is NOT meant to be used for security.
|
| There's an important note on the readme about this:
|
| > Since the timestamp part is predictable, the search space
| within any given millisecond is 2^80 random numbers, which is
| meant to avoid collisions, not to secure or hide information.
| You should not be using timeflakes for password-reset tokens,
| API keys or for anything which is security sensitive. There
| are better libraries which are meant for this use case (for
| example, the standard library's secrets module).
| j-pb wrote:
| You have to ask yourself the question though. If "URL-
| Safety" is a main feature. What URLs / user-data are NOT
| security relevant?
|
| Leaking users private pictures / documents / whatever, will
| also kill a company.
|
| I guess one could maybe circumvent this, by using a hash of
| the ID, but then you have to store that somewhere too, and
| you're back to square one.
|
| Nevertheless, I like that you fixed the main flaw with
| ULID. Also you provide way better insight into the
| tradeoffs, so kudos!
| remcob wrote:
| > using a hash of the ID
|
| Hashing the IDs won't solve their lack of entropy. Crude
| example: If you hash your pincode I still have only 10^4
| values to try.
|
| The easiest way to fix this is to add an access token
| column that is cryptographically random and use both the
| ID and the token in the URL.
|
| If you trust the 80 bits already in the ID the token only
| needs to be 80 bits for a total of 160 bits of entropy.
| But if you do that you have to make sure that a missing
| ID and invalid token are handled identically from the
| attackers perspective (same reply, same timing).
| j-pb wrote:
| > Hashing the IDs won't solve their lack of entropy.
|
| Yes and no. It would still significantly largen the
| search space, as an attacker wouldn't get a point of
| reference to latch on to.
| spiffytech wrote:
| I adopted ULID for a project after seeing it recently mentioned
| in another HN comments thread[0] and it's very useful. The
| project uses Azure Table Storage, which encourages querying by
| ranges of record IDs for efficiency, instead of filtering
| datasets by arbitrary record properties.
|
| My project revolves around processing windows of time in
| timestamp-ordered datasets, and ULID works well as a way to
| generate record IDs in a distributed system that naturally sort
| in timestamp order. With ULID, all of my major queries can
| operate on ranges of record IDs, as Azure Table Storage
| encourages.
|
| [0] https://news.ycombinator.com/item?id=25650907
| mgkimsal wrote:
| I've been using ULID on a couple projects. In one case, we're
| trying to convert an existing project to use them, and... it's
| a bit of a pain, but it's not ULID's fault as much as just...
| changing primary keys all over the place is time consuming and
| not simple.
|
| The biggest pushback (and it's not a terribly big one, imo)
| I've read on ULID is that the time portion is sold as
| 'sortable' but is 'only ms-level resolution'. Apparently the
| critics I was initially reading on ULID needed sub millisecond
| accuracy, and somehow thought this made ULIDs bad. Seemed a
| non-useful criticism, as needing that level of time granularity
| isn't a day to day issue for most people, but might be
| something to be aware of if the idea of overloading one field
| with multiple uses is a requirement you have.
| techdragon wrote:
| If you're an edge case you're an edge case. It's frustrating
| when people start assuming their use case is normal without
| any considerations of the wider world.
|
| Sub millisecond precision sorting of high speed database
| inserts is hardly a normal requirement. At that point even
| the size of your transaction can matter (depending on the
| database) and you have to start asking questions like do I
| want ordering by the time the transaction starts or the time
| it ends? And if your loading existing data from other sources
| and care about date ordering you should just be using the a
| dedicated column with the database appropriate time/date
| time/time stamp field anyway.
| calpaterson wrote:
| One of the issues this is trying to resolve is that v4 UUIDs
| wholly random nature makes them less than brilliant in the
| b-trees that SQL databases use for indices (I am actually not
| sure how much that really matters in practice for most projects).
|
| However, Postgres has an alternate index implementation that uses
| a hash function. It is crash-safe now (previously wasn't) but you
| still can't use it to enforce a unique constraint. If you can use
| that, I suspect it might resolve the problem and would be easier
| than switching UUID generation algorithm.
| withinboredom wrote:
| It's a shame v6 uuids didn't become a thing[1]. It's the entire
| reason they were invented.
|
| 1: https://datatracker.ietf.org/doc/draft-peabody-dispatch-
| new-...
| techdragon wrote:
| The biggest thing I can highlight after having used Standard
| UUIDs and heavily using ULID values including building my own
| library and testing compatibility and behavioural compliance of
| the library with other database variants than the big 4 (MySQL,
| MS SQL, SQLite and PostgreSQL) ... is that _no PK format should
| ever be considered suitable for all use cases_ its all about
| trade offs.
|
| If you care about the order of something you shouldn't expect any
| PK to handle that alone, even auto-incremented integers will
| start to fail you once you have to scale your write workload
| horizontally. If there's a property of your data that you want to
| be able to order it by then you just need to make a column of the
| appropriate data type and store it. If I had a nickel for every
| time someone has look at a schema and said something like "Do we
| really need this date for ordering? The table is already sorted
| by insert order..." I'd have a lot more nickels than I want.
| varispeed wrote:
| This is assuming that the time is linearly increasing, but what
| if you have a service running on a fast moving object between
| planets or in alternative dimension where time is not linear or
| not monotonic?
___________________________________________________________________
(page generated 2021-01-22 23:01 UTC)