[HN Gopher] GUIDs Are Not the Only Answer
___________________________________________________________________
GUIDs Are Not the Only Answer
Author : ublaze
Score : 47 points
Date : 2021-01-05 20:56 UTC (1 days ago)
(HTM) web link (www.softwareatscale.dev)
(TXT) w3m dump (www.softwareatscale.dev)
| gnulinux wrote:
| Why not just `f"task-{uuid.uuid4()}"`?
| rualca wrote:
| I agree, the author is mixing up presentation with the data
| type itself, and then pinning the blame on the poor
| presentation on the way he chose to present the data.
|
| If the author's problem is not being able to grep logs easily
| with UUIDs because they miss context info then the solution to
| his problem is to output logs that are searchable by providing
| additional context.
|
| His arguments regarding searchability are also specious. Take
| for example
|
| > For example, a task ID would look like: `job-123-task-1`.
| This also helps in ad-hoc database queries to find relevant
| rows without complex JOINs.
|
| Well, if he's already running SQL queries then running an exact
| match on two attributes is far easier and elegant and less
| error prone than running a string pattern search on a derived
| attribute.
| haimez wrote:
| Read the article, but also: sortability.
| withinboredom wrote:
| GUID v6 is pretty nice when you need monotonously increasing
| numbers that are globally unique.
|
| On another note, I worked somewhere that prefixed GUIDs with the
| environment the app was running on. All of production, staging,
| even dev machines all used the same connection string.
|
| There was even a stored procedure to copy user accounts, etc of
| prod for your machine. It was hands-down the best debugging
| experience when a customer had an issue.
| asidiali wrote:
| That sounds very scary, but also, extremely freeing.
|
| No more permissions to deal with! But also...no more
| permissions to deal with...
| vsareto wrote:
| >Poorly formatted log statements/errors can become harder to
| debug >UUIDs often need context to aid debugging.
|
| The GUID vs Int problem is incidental to the real problem: poorly
| formatting logs. An integer or other key type without context is
| no more helpful.
|
| >At the very least, identifiers should not be allowed to float
| freely as strings or integers in order to prevent a class of
| inconsistency bugs.
|
| I'm not familiar with the ergonomics of GUIDs across all
| languages, but C#/MSSQL makes them pretty easy to handle when
| they have been chosen as keys.
|
| So the answer, as far as ergonomics go, is not settled depending
| on your stack.
| rualca wrote:
| > The GUID vs Int problem is incidental to the real problem:
| poorly formatting logs.
|
| This.
|
| The author's arguments boil down to "UUIDs are bad because the
| way I generate my logs is an unusable mess".
|
| The ints vs UUIDs problem is also specious because a UUID is
| nothing more than a 16 byte int that's generated in a clever
| way. It might typically be presented in a particular string
| representation but the text string is not the value of the
| UUID.
| dehrmann wrote:
| Yeah, IDs with a sufficiently large dataset will be
| unintelligible no matter what you do.
| ivan_ah wrote:
| The Python package `shortuuid` makes working with UUIDs a little
| easier by encoding them as strings:
| https://github.com/skorokithakis/shortuuid#usage (uses base-57
| encoding, with alphabet consisting of A-Za-z0-9 with potentially
| confusing symbols skipped)
|
| The string representation is what you show to users, but under
| the hood it's still a UUID and compatible/interoperable with any
| other system that needs UUID-shaped identifiers.
|
| The coolest part is youcan even truncate the string encodings to
| get shorter IDs, which correspond to UUIDs with lots of leading
| zeros.
| rualca wrote:
| > The Python package `shortuuid` makes working with UUIDs a
| little easier by encoding them as strings:
|
| This. In my opinion the only issue with UUIDs is that their
| standard textual representation can be very verbose and
| cumbersome in some scenarios.
|
| A single ID takes over around 40 string characters in a line.
| That's half the width of a default terminal's width.
|
| Thus the solution is obviously to use other textual
| representations. Instead of base16 then let's ramp up the base
| to shorten up the text dump, and while we are at it let's pick
| a readable format.
| gorgoiler wrote:
| I've always wondered: what is the history behind hyphens in
| UUIDs?
| mikequinlan wrote:
| https://en.wikipedia.org/wiki/Universally_unique_identifier#...
| winrid wrote:
| This is one thing I like about type systems where you can declare
| a primitive type as MyImportantThing. This ensures the string or
| what have you is explicitly defined as MyImportantThing. Rust
| does this pretty well. C/C++ AFAIK will let you pass in the raw
| string, and so will Java if you "extend String".
| yrimaxi wrote:
| What does it mean to "extend String" in Java? You can't, of
| course, literally do that.
| winrid wrote:
| Sorry bad example. Forgot it's final. :)
| bob1029 wrote:
| The best answer is the humble integer. The only reasonable
| arguments I have ever seen against using integer keys universally
| are as follows:
|
| #1 Integer keys have finite range.
|
| #2 Integer keys betray the identity of other sensitive resources
| when exposed as a public identity.
|
| #3 Integer keys are "difficult" to sequence in the face of
| multiple networked participants.
|
| My resolutions and counter-arguments are as follows:
|
| For many systems, #1 is not a concern, because the number of
| expected entities is well-bounded by a 64 bit integer. For
| others, #1 can be resolved by usage of more complex types such as
| BigInteger (C#). If utilized carefully, these can be treated just
| like normal integers, and quickly converted to/from byte arrays
| of appropriate length to satisfy the required range. In virtually
| all SQL implementations, blob columns containing these values can
| be indexed with the exact same semantics as with a 64-bit integer
| column. Whether this performs better or worse than GUID keys
| probably depends on if you can provoke a >120 bit BigInteger
| representation. This is quite unlikely, even for Google.
|
| #2 is trivially solved by simply applying encryption to sensitive
| keys as they traverse the boundary between your system and the
| outside world. AES256 would do the trick here. You could also
| generate entirely separate keys of any appropriate type for
| public consumption (i.e. maybe some YT-style identifier format).
|
| #3 is solved by anticipating the maximum possible # of nodes in
| your system, and then producing a key space in which identities
| are sharded out by a simple constant factor of that max quantity.
| This would certainly produce concern regarding all of the skipped
| identities (assuming you start with a small number of hosts on
| day 1), but the proposed resolution above for #1 (BigInteger)
| alleviates these concerns with a practically infinite range of
| keys. Skipping 10k identities is a non-event when you have all of
| infinity to pull from.
|
| There are also other considerations with this. GUID keys are a
| pain to communicate. Integers, even of massive range, are easy
| for most humans to communicate verbally when appropriate digit
| grouping and other reasonable measures are undertaken.
|
| Also consider a situation in which you decide to use 1 global
| integer range to key every single entity in your system. This
| allows for interesting database structures in which foreign keys
| are all referring to the same keyspace, so the specific type of a
| thing is no longer a hard constraint in a relational sense. Some
| would probably take substantial offense to this proposal, but I
| have found in many cases this allows for powerful optimizations.
| Anything can be used irresponsibly.
| Thiez wrote:
| > Integers, even of massive range, are easy for most humans to
| communicate verbally when appropriate digit grouping and other
| reasonable measures are undertaken.
|
| Presumably this is after not taking your own advice of
| performing AES256 encryption on the key before sending it to
| the user?
|
| Your three easy steps together seem a lot more complex than
| just using guids.
| bob1029 wrote:
| Not all communications around keys are of a sensitive nature.
| Virtually daily, we will communicate keys internally in order
| to clarify details regarding some activity. Applying AES to
| keys internally would only serve to hinder our operations &
| support efforts.
| mdtusz wrote:
| How often are you verbally communicating primary keys/id's
| internally within an organization?
|
| I've never understood arguments against guid's, except for
| table performance with databases that might take a
| performance hit because of either inserts, or because data
| won't necessarily be partitioned in the "correct" order
| (e.g. MySQL writes to buffer pools).
| rualca wrote:
| > The best answer is the humble integer. (...)
|
| ...and then you proceed by suggesting ID code generation
| methods that replicate UUIDs.
|
| It sounds like you are trying to criticize UUIDs while being
| totally unaware of what UUIDs are.
|
| I mean, UUIDs are nothing more than integers that are generated
| following one of half a dozen different methods so that they
| can be probably unique without relying on a central authority.
|
| One of the UUID versions is a pure random number. Another is a
| MAC address concatenated with a timestamp complemented with a
| local counter. There are literally two UUID methods that
| consist of hashing data.
|
| Your supposed solutions to the alleged problem posed by UUIDs
| is a reinvention of what UUIDs have always been. But UUIDs have
| been though through and are standard and ubiquitous.
| hermanradtke wrote:
| I have stopped using UUID and GUID in favor of
| https://github.com/ulid/spec
| WorldMaker wrote:
| The article mentions KSUID which has a very similar spec, but
| with a wider random segment.
|
| For comparison: KSUID is 160-bit wide versus ULID sticks to
| 128-bit wide (which is somewhat more compatible with UUID-like
| database column types, for instance, so long as they don't try
| things like UUID version checks). KSUID uses Base62 encoding
| versus ULID uses a much simpler to (lexicographicly) sort
| Base32 encoding.
|
| (I've found ULID useful in some of my own projects.)
| BillinghamJ wrote:
| I would strongly say strings are actually the _only_ type you
| should use for IDs. Prevents the vast majority of buggy client
| behaviour and gives you good flexibility to change how you do
| things over time.
|
| ---
|
| My company ended up with a simple KSUID implementation of our own
| - https://www.cuvva.com/product-updates/showing-off-our-fancy-...
| (having originally used UUIDs and Mongo ObjectIDs)
|
| For us, a big part of it was usability with cursor selection etc
| - in addition to it being immediately obvious what the ID was
| for.
|
| Once we finally had that rolled out everywhere, we ended up
| collecting up every other ID we'd ever used and mapped it to its
| KSUID resource equivalent, so now all our IDs work standalone
| without type/context info, even across environments (and
| thankfully we'd never had any collisions on the old IDs)
|
| ---
|
| Going back to the typing - the most difficult part of migrating
| our IDs actually was converting them all to string types. With
| Postgres this is a little slow but ultimately fine, but with
| Mongo you have to actually remove and reinsert every document -
| you cannot (or at least could not) update IDs in place.
| magicalhippo wrote:
| > I would strongly say strings are actually the _only_ type you
| should use for IDs.
|
| Just got that in the face at work. In a large custom
| integration we read orders, they have unique order numbers,
| nice 5 digit things. So in the database, they became integer
| primary keys, with lots of child tables.
|
| Fast forward 5 years, customer switches ERP system and they ask
| "hey, the order numbers, you do support 10+ digits right?"
|
| Changing the database is relatively easy, changing the code is
| a chore, but worst part will be going over all the queries and
| their parameters, especially in the reports.
|
| So yeah, lesson learned.
| fbi-director wrote:
| I have an honest question. Why would anybody, ever, make an
| order number that only has 5 digits. Even if it's just a
| home-hobby project, the cost of changing to 7-10 digits is so
| small a d negligible that I can't see any reason for choosing
| anything lower. Like the 640k that was once "enough for the
| long term future" in DOS. I understand that hindsight is
| 2020, but can't wrap my head around not starting bigger when
| there's no extra cost (nearly).
|
| Could you shed some light on that?
| jstarfish wrote:
| Back in the day, we were taught to use database constraints
| to validate user inputs.
|
| Memory, storage and compute were also more limited, so
| there _was_ an extra cost to over-spec.
| dehrmann wrote:
| Since 10 digits puts us in billions, you're really asking
| "why would someone do 'CREATE TABLE ... id INT
| AUTO_INCREMENT' when they could use a BIGINT?" _These days_
| , there's rarely a reason not to use a BIGINT, but I also
| have a little trouble faulting someone for thinking 2B
| would be enough when they're currently at 10k.
| magicalhippo wrote:
| > Could you shed some light on that?
|
| First off, I had a brainlapse, their order numbers were 6
| digits. I'm not entirely sure what they max was in their
| system, it was just that's where they were at in the
| series.
|
| And they didn't have that many orders per year, less than
| 10k, as one order could be for say five containers of
| goods. So it was not like they'd exceed 9 digits in the
| foreseeable future.
|
| Or so we thought...
| dehrmann wrote:
| Strings have their own issues. One place I worked had a bug
| where users could take over accounts because of
| missing/inconsistent unicode canonicalization. Case can be a
| problem, as can special characters.
|
| There's something to be said for strings, though. Prefixed IDs
| that mark the type can be nice to work with when it's an
| otherwise opaque ID, but they're a pain to handle internally.
|
| If you're storing third-party IDs, you probably want
| strings...unless their clever JSON API returns 1.3E6 as a
| number. Or the string "null;" that's always an adventure.
| mamcx wrote:
| I tried many things for making sync work across devices. I tried
| GUIDs, and partitioning ranges of ints, and several versions of
| it.
|
| But what worked amazing?
|
| Use NATURAL keys (or their hash) + version field. That is all you
| need most of the cases. It make sync far easier, easier to trace
| stuff (thanks to version), immune to problems of timestamps (some
| computers have their cloks wrong). In short:
| struct Order { code: String, //natural key
| version:usize } struct Location {
| code: Hash //hash of city + country city:String,
| country:String, version:usize }
|
| Natural keys are global if well defined. In some places where it
| is not obvious, hashing the whole row and put a nice encode is
| the same.
|
| This also will reveal when something TRULY need a guid or
| similar. For example, for invoices in my country the law demand
| partition of ranges with certain characteristics (ie: INV-1-XXX
| in machine 1, INV-2-XXX in machine 2).
|
| Add another id:i64 become redundant most of the time. If your
| Order.code is duplicated or whatever it will be the same problem
| with or without an extra id:i64, so is better to deal with the
| problems of the ACTUAL data when is need and not mask it with
| other stuff.
|
| The downside is that the key become repeated in JOINS (like in
| InvoiceLine) but honestly all rdbms handle triggers, and it
| actually become very nice to see the Order.code in the child
| relations (far easier to correlate).
| BillinghamJ wrote:
| Isn't the obvious difficulty here that such natural keys tend
| to change? Eg names and boundaries of cities & countries change
| - relatively - all the time. You don't want your IDs to be
| changing alongside
| mamcx wrote:
| Why not?
|
| I work for ERP/eCommerce, so this issue happens. But for
| real, I need to store historical facts anyway ie: In a
| invoice/order I must store the ship info at the time of the
| transaction.
|
| A artificial ID is useless in a lot of cases where "and if
| the key change, what?" because if that is an issue, is mostly
| because the data must retain history, and then, you need to
| store that anyway.
|
| For that, I log the data in a history/log table.
| nesarkvechnep wrote:
| "At the very least, identifiers should not be allowed to float
| freely as strings or integers in order to prevent a class of
| inconsistency bugs."
|
| Tell that to almost every Typescript developer who uses `number`
| for identifiers.
| crooked-v wrote:
| Typescript doesn't have functionality for non-equivalent types
| with identical primitives or interfaces (e.g. you can't have
| string-equivalent type A [?] string-equivalent type B), though
| there's ongoing discussion around adding it
| (https://github.com/Microsoft/TypeScript/issues/202).
| 0xdeafcafe wrote:
| Arctic take.
| hermanradtke wrote:
| That is why we use things like io-ts. t.Int is a branded type
| that ensures number is whole and greater than 0.
| ff333ttee wrote:
| I also tried to use custom types for IDs, but in my opinion, it
| has more cons than pros. I have to write custom serializers and
| model binders for them, explicitly convert them to other types,
| write separate validators etc... At the end I finished with even
| more bugs than before.
| lolinder wrote:
| I found serialization of typesafe IDs trivial in Kotlin with
| Jackson. We have a single generic supertype that has the
| correct annotation to tell Jackson to use the `value` field as
| the serialized value, and defines helpful methods like equals.
| Each new ID type is defined as a single line of code, simply
| inheriting from this base ID type and providing the actual
| underlying type (int, string, uuid).
|
| Validators likewise are a non-issue. Jackson handles that
| automatically during the underlying type conversion, and Java
| won't allow you to construct a UUID that is invalid.
|
| I see explicit conversion between types as a pro, not a con. If
| I'm going to take one ID and try to use it as an ID for a
| different type of entity, I'd better have a very good reason.
|
| Again, I did this with a specific set of tools, but the
| functionality I used should be available in pretty much any
| language and serialization framework.
| yrimaxi wrote:
| Pretty obvious stuff. Of course GUIDs are more unweildy to read
| etc. compared to simple auto-incrementing integers.
|
| I don't see a reason to prefix an id with something like `task-`.
| I would rather leave it to the display logic.
___________________________________________________________________
(page generated 2021-01-06 23:01 UTC)