[HN Gopher] GUIDs Are Not the Only Answer
       ___________________________________________________________________
        
       GUIDs Are Not the Only Answer
        
       Author : ublaze
       Score  : 47 points
       Date   : 2021-01-05 20:56 UTC (1 days ago)
        
 (HTM) web link (www.softwareatscale.dev)
 (TXT) w3m dump (www.softwareatscale.dev)
        
       | gnulinux wrote:
       | Why not just `f"task-{uuid.uuid4()}"`?
        
         | rualca wrote:
         | I agree, the author is mixing up presentation with the data
         | type itself, and then pinning the blame on the poor
         | presentation on the way he chose to present the data.
         | 
         | If the author's problem is not being able to grep logs easily
         | with UUIDs because they miss context info then the solution to
         | his problem is to output logs that are searchable by providing
         | additional context.
         | 
         | His arguments regarding searchability are also specious. Take
         | for example
         | 
         | > For example, a task ID would look like: `job-123-task-1`.
         | This also helps in ad-hoc database queries to find relevant
         | rows without complex JOINs.
         | 
         | Well, if he's already running SQL queries then running an exact
         | match on two attributes is far easier and elegant and less
         | error prone than running a string pattern search on a derived
         | attribute.
        
         | haimez wrote:
         | Read the article, but also: sortability.
        
       | withinboredom wrote:
       | GUID v6 is pretty nice when you need monotonously increasing
       | numbers that are globally unique.
       | 
       | On another note, I worked somewhere that prefixed GUIDs with the
       | environment the app was running on. All of production, staging,
       | even dev machines all used the same connection string.
       | 
       | There was even a stored procedure to copy user accounts, etc of
       | prod for your machine. It was hands-down the best debugging
       | experience when a customer had an issue.
        
         | asidiali wrote:
         | That sounds very scary, but also, extremely freeing.
         | 
         | No more permissions to deal with! But also...no more
         | permissions to deal with...
        
       | vsareto wrote:
       | >Poorly formatted log statements/errors can become harder to
       | debug >UUIDs often need context to aid debugging.
       | 
       | The GUID vs Int problem is incidental to the real problem: poorly
       | formatting logs. An integer or other key type without context is
       | no more helpful.
       | 
       | >At the very least, identifiers should not be allowed to float
       | freely as strings or integers in order to prevent a class of
       | inconsistency bugs.
       | 
       | I'm not familiar with the ergonomics of GUIDs across all
       | languages, but C#/MSSQL makes them pretty easy to handle when
       | they have been chosen as keys.
       | 
       | So the answer, as far as ergonomics go, is not settled depending
       | on your stack.
        
         | rualca wrote:
         | > The GUID vs Int problem is incidental to the real problem:
         | poorly formatting logs.
         | 
         | This.
         | 
         | The author's arguments boil down to "UUIDs are bad because the
         | way I generate my logs is an unusable mess".
         | 
         | The ints vs UUIDs problem is also specious because a UUID is
         | nothing more than a 16 byte int that's generated in a clever
         | way. It might typically be presented in a particular string
         | representation but the text string is not the value of the
         | UUID.
        
           | dehrmann wrote:
           | Yeah, IDs with a sufficiently large dataset will be
           | unintelligible no matter what you do.
        
       | ivan_ah wrote:
       | The Python package `shortuuid` makes working with UUIDs a little
       | easier by encoding them as strings:
       | https://github.com/skorokithakis/shortuuid#usage (uses base-57
       | encoding, with alphabet consisting of A-Za-z0-9 with potentially
       | confusing symbols skipped)
       | 
       | The string representation is what you show to users, but under
       | the hood it's still a UUID and compatible/interoperable with any
       | other system that needs UUID-shaped identifiers.
       | 
       | The coolest part is youcan even truncate the string encodings to
       | get shorter IDs, which correspond to UUIDs with lots of leading
       | zeros.
        
         | rualca wrote:
         | > The Python package `shortuuid` makes working with UUIDs a
         | little easier by encoding them as strings:
         | 
         | This. In my opinion the only issue with UUIDs is that their
         | standard textual representation can be very verbose and
         | cumbersome in some scenarios.
         | 
         | A single ID takes over around 40 string characters in a line.
         | That's half the width of a default terminal's width.
         | 
         | Thus the solution is obviously to use other textual
         | representations. Instead of base16 then let's ramp up the base
         | to shorten up the text dump, and while we are at it let's pick
         | a readable format.
        
       | gorgoiler wrote:
       | I've always wondered: what is the history behind hyphens in
       | UUIDs?
        
         | mikequinlan wrote:
         | https://en.wikipedia.org/wiki/Universally_unique_identifier#...
        
       | winrid wrote:
       | This is one thing I like about type systems where you can declare
       | a primitive type as MyImportantThing. This ensures the string or
       | what have you is explicitly defined as MyImportantThing. Rust
       | does this pretty well. C/C++ AFAIK will let you pass in the raw
       | string, and so will Java if you "extend String".
        
         | yrimaxi wrote:
         | What does it mean to "extend String" in Java? You can't, of
         | course, literally do that.
        
           | winrid wrote:
           | Sorry bad example. Forgot it's final. :)
        
       | bob1029 wrote:
       | The best answer is the humble integer. The only reasonable
       | arguments I have ever seen against using integer keys universally
       | are as follows:
       | 
       | #1 Integer keys have finite range.
       | 
       | #2 Integer keys betray the identity of other sensitive resources
       | when exposed as a public identity.
       | 
       | #3 Integer keys are "difficult" to sequence in the face of
       | multiple networked participants.
       | 
       | My resolutions and counter-arguments are as follows:
       | 
       | For many systems, #1 is not a concern, because the number of
       | expected entities is well-bounded by a 64 bit integer. For
       | others, #1 can be resolved by usage of more complex types such as
       | BigInteger (C#). If utilized carefully, these can be treated just
       | like normal integers, and quickly converted to/from byte arrays
       | of appropriate length to satisfy the required range. In virtually
       | all SQL implementations, blob columns containing these values can
       | be indexed with the exact same semantics as with a 64-bit integer
       | column. Whether this performs better or worse than GUID keys
       | probably depends on if you can provoke a >120 bit BigInteger
       | representation. This is quite unlikely, even for Google.
       | 
       | #2 is trivially solved by simply applying encryption to sensitive
       | keys as they traverse the boundary between your system and the
       | outside world. AES256 would do the trick here. You could also
       | generate entirely separate keys of any appropriate type for
       | public consumption (i.e. maybe some YT-style identifier format).
       | 
       | #3 is solved by anticipating the maximum possible # of nodes in
       | your system, and then producing a key space in which identities
       | are sharded out by a simple constant factor of that max quantity.
       | This would certainly produce concern regarding all of the skipped
       | identities (assuming you start with a small number of hosts on
       | day 1), but the proposed resolution above for #1 (BigInteger)
       | alleviates these concerns with a practically infinite range of
       | keys. Skipping 10k identities is a non-event when you have all of
       | infinity to pull from.
       | 
       | There are also other considerations with this. GUID keys are a
       | pain to communicate. Integers, even of massive range, are easy
       | for most humans to communicate verbally when appropriate digit
       | grouping and other reasonable measures are undertaken.
       | 
       | Also consider a situation in which you decide to use 1 global
       | integer range to key every single entity in your system. This
       | allows for interesting database structures in which foreign keys
       | are all referring to the same keyspace, so the specific type of a
       | thing is no longer a hard constraint in a relational sense. Some
       | would probably take substantial offense to this proposal, but I
       | have found in many cases this allows for powerful optimizations.
       | Anything can be used irresponsibly.
        
         | Thiez wrote:
         | > Integers, even of massive range, are easy for most humans to
         | communicate verbally when appropriate digit grouping and other
         | reasonable measures are undertaken.
         | 
         | Presumably this is after not taking your own advice of
         | performing AES256 encryption on the key before sending it to
         | the user?
         | 
         | Your three easy steps together seem a lot more complex than
         | just using guids.
        
           | bob1029 wrote:
           | Not all communications around keys are of a sensitive nature.
           | Virtually daily, we will communicate keys internally in order
           | to clarify details regarding some activity. Applying AES to
           | keys internally would only serve to hinder our operations &
           | support efforts.
        
             | mdtusz wrote:
             | How often are you verbally communicating primary keys/id's
             | internally within an organization?
             | 
             | I've never understood arguments against guid's, except for
             | table performance with databases that might take a
             | performance hit because of either inserts, or because data
             | won't necessarily be partitioned in the "correct" order
             | (e.g. MySQL writes to buffer pools).
        
         | rualca wrote:
         | > The best answer is the humble integer. (...)
         | 
         | ...and then you proceed by suggesting ID code generation
         | methods that replicate UUIDs.
         | 
         | It sounds like you are trying to criticize UUIDs while being
         | totally unaware of what UUIDs are.
         | 
         | I mean, UUIDs are nothing more than integers that are generated
         | following one of half a dozen different methods so that they
         | can be probably unique without relying on a central authority.
         | 
         | One of the UUID versions is a pure random number. Another is a
         | MAC address concatenated with a timestamp complemented with a
         | local counter. There are literally two UUID methods that
         | consist of hashing data.
         | 
         | Your supposed solutions to the alleged problem posed by UUIDs
         | is a reinvention of what UUIDs have always been. But UUIDs have
         | been though through and are standard and ubiquitous.
        
       | hermanradtke wrote:
       | I have stopped using UUID and GUID in favor of
       | https://github.com/ulid/spec
        
         | WorldMaker wrote:
         | The article mentions KSUID which has a very similar spec, but
         | with a wider random segment.
         | 
         | For comparison: KSUID is 160-bit wide versus ULID sticks to
         | 128-bit wide (which is somewhat more compatible with UUID-like
         | database column types, for instance, so long as they don't try
         | things like UUID version checks). KSUID uses Base62 encoding
         | versus ULID uses a much simpler to (lexicographicly) sort
         | Base32 encoding.
         | 
         | (I've found ULID useful in some of my own projects.)
        
       | BillinghamJ wrote:
       | I would strongly say strings are actually the _only_ type you
       | should use for IDs. Prevents the vast majority of buggy client
       | behaviour and gives you good flexibility to change how you do
       | things over time.
       | 
       | ---
       | 
       | My company ended up with a simple KSUID implementation of our own
       | - https://www.cuvva.com/product-updates/showing-off-our-fancy-...
       | (having originally used UUIDs and Mongo ObjectIDs)
       | 
       | For us, a big part of it was usability with cursor selection etc
       | - in addition to it being immediately obvious what the ID was
       | for.
       | 
       | Once we finally had that rolled out everywhere, we ended up
       | collecting up every other ID we'd ever used and mapped it to its
       | KSUID resource equivalent, so now all our IDs work standalone
       | without type/context info, even across environments (and
       | thankfully we'd never had any collisions on the old IDs)
       | 
       | ---
       | 
       | Going back to the typing - the most difficult part of migrating
       | our IDs actually was converting them all to string types. With
       | Postgres this is a little slow but ultimately fine, but with
       | Mongo you have to actually remove and reinsert every document -
       | you cannot (or at least could not) update IDs in place.
        
         | magicalhippo wrote:
         | > I would strongly say strings are actually the _only_ type you
         | should use for IDs.
         | 
         | Just got that in the face at work. In a large custom
         | integration we read orders, they have unique order numbers,
         | nice 5 digit things. So in the database, they became integer
         | primary keys, with lots of child tables.
         | 
         | Fast forward 5 years, customer switches ERP system and they ask
         | "hey, the order numbers, you do support 10+ digits right?"
         | 
         | Changing the database is relatively easy, changing the code is
         | a chore, but worst part will be going over all the queries and
         | their parameters, especially in the reports.
         | 
         | So yeah, lesson learned.
        
           | fbi-director wrote:
           | I have an honest question. Why would anybody, ever, make an
           | order number that only has 5 digits. Even if it's just a
           | home-hobby project, the cost of changing to 7-10 digits is so
           | small a d negligible that I can't see any reason for choosing
           | anything lower. Like the 640k that was once "enough for the
           | long term future" in DOS. I understand that hindsight is
           | 2020, but can't wrap my head around not starting bigger when
           | there's no extra cost (nearly).
           | 
           | Could you shed some light on that?
        
             | jstarfish wrote:
             | Back in the day, we were taught to use database constraints
             | to validate user inputs.
             | 
             | Memory, storage and compute were also more limited, so
             | there _was_ an extra cost to over-spec.
        
             | dehrmann wrote:
             | Since 10 digits puts us in billions, you're really asking
             | "why would someone do 'CREATE TABLE ... id INT
             | AUTO_INCREMENT' when they could use a BIGINT?" _These days_
             | , there's rarely a reason not to use a BIGINT, but I also
             | have a little trouble faulting someone for thinking 2B
             | would be enough when they're currently at 10k.
        
             | magicalhippo wrote:
             | > Could you shed some light on that?
             | 
             | First off, I had a brainlapse, their order numbers were 6
             | digits. I'm not entirely sure what they max was in their
             | system, it was just that's where they were at in the
             | series.
             | 
             | And they didn't have that many orders per year, less than
             | 10k, as one order could be for say five containers of
             | goods. So it was not like they'd exceed 9 digits in the
             | foreseeable future.
             | 
             | Or so we thought...
        
         | dehrmann wrote:
         | Strings have their own issues. One place I worked had a bug
         | where users could take over accounts because of
         | missing/inconsistent unicode canonicalization. Case can be a
         | problem, as can special characters.
         | 
         | There's something to be said for strings, though. Prefixed IDs
         | that mark the type can be nice to work with when it's an
         | otherwise opaque ID, but they're a pain to handle internally.
         | 
         | If you're storing third-party IDs, you probably want
         | strings...unless their clever JSON API returns 1.3E6 as a
         | number. Or the string "null;" that's always an adventure.
        
       | mamcx wrote:
       | I tried many things for making sync work across devices. I tried
       | GUIDs, and partitioning ranges of ints, and several versions of
       | it.
       | 
       | But what worked amazing?
       | 
       | Use NATURAL keys (or their hash) + version field. That is all you
       | need most of the cases. It make sync far easier, easier to trace
       | stuff (thanks to version), immune to problems of timestamps (some
       | computers have their cloks wrong). In short:
       | struct Order {             code: String, //natural key
       | version:usize         }              struct Location {
       | code: Hash //hash of city + country             city:String,
       | country:String,             version:usize         }
       | 
       | Natural keys are global if well defined. In some places where it
       | is not obvious, hashing the whole row and put a nice encode is
       | the same.
       | 
       | This also will reveal when something TRULY need a guid or
       | similar. For example, for invoices in my country the law demand
       | partition of ranges with certain characteristics (ie: INV-1-XXX
       | in machine 1, INV-2-XXX in machine 2).
       | 
       | Add another id:i64 become redundant most of the time. If your
       | Order.code is duplicated or whatever it will be the same problem
       | with or without an extra id:i64, so is better to deal with the
       | problems of the ACTUAL data when is need and not mask it with
       | other stuff.
       | 
       | The downside is that the key become repeated in JOINS (like in
       | InvoiceLine) but honestly all rdbms handle triggers, and it
       | actually become very nice to see the Order.code in the child
       | relations (far easier to correlate).
        
         | BillinghamJ wrote:
         | Isn't the obvious difficulty here that such natural keys tend
         | to change? Eg names and boundaries of cities & countries change
         | - relatively - all the time. You don't want your IDs to be
         | changing alongside
        
           | mamcx wrote:
           | Why not?
           | 
           | I work for ERP/eCommerce, so this issue happens. But for
           | real, I need to store historical facts anyway ie: In a
           | invoice/order I must store the ship info at the time of the
           | transaction.
           | 
           | A artificial ID is useless in a lot of cases where "and if
           | the key change, what?" because if that is an issue, is mostly
           | because the data must retain history, and then, you need to
           | store that anyway.
           | 
           | For that, I log the data in a history/log table.
        
       | nesarkvechnep wrote:
       | "At the very least, identifiers should not be allowed to float
       | freely as strings or integers in order to prevent a class of
       | inconsistency bugs."
       | 
       | Tell that to almost every Typescript developer who uses `number`
       | for identifiers.
        
         | crooked-v wrote:
         | Typescript doesn't have functionality for non-equivalent types
         | with identical primitives or interfaces (e.g. you can't have
         | string-equivalent type A [?] string-equivalent type B), though
         | there's ongoing discussion around adding it
         | (https://github.com/Microsoft/TypeScript/issues/202).
        
         | 0xdeafcafe wrote:
         | Arctic take.
        
           | hermanradtke wrote:
           | That is why we use things like io-ts. t.Int is a branded type
           | that ensures number is whole and greater than 0.
        
       | ff333ttee wrote:
       | I also tried to use custom types for IDs, but in my opinion, it
       | has more cons than pros. I have to write custom serializers and
       | model binders for them, explicitly convert them to other types,
       | write separate validators etc... At the end I finished with even
       | more bugs than before.
        
         | lolinder wrote:
         | I found serialization of typesafe IDs trivial in Kotlin with
         | Jackson. We have a single generic supertype that has the
         | correct annotation to tell Jackson to use the `value` field as
         | the serialized value, and defines helpful methods like equals.
         | Each new ID type is defined as a single line of code, simply
         | inheriting from this base ID type and providing the actual
         | underlying type (int, string, uuid).
         | 
         | Validators likewise are a non-issue. Jackson handles that
         | automatically during the underlying type conversion, and Java
         | won't allow you to construct a UUID that is invalid.
         | 
         | I see explicit conversion between types as a pro, not a con. If
         | I'm going to take one ID and try to use it as an ID for a
         | different type of entity, I'd better have a very good reason.
         | 
         | Again, I did this with a specific set of tools, but the
         | functionality I used should be available in pretty much any
         | language and serialization framework.
        
       | yrimaxi wrote:
       | Pretty obvious stuff. Of course GUIDs are more unweildy to read
       | etc. compared to simple auto-incrementing integers.
       | 
       | I don't see a reason to prefix an id with something like `task-`.
       | I would rather leave it to the display logic.
        
       ___________________________________________________________________
       (page generated 2021-01-06 23:01 UTC)