[HN Gopher] You'll regret using natural keys
       ___________________________________________________________________
        
       You'll regret using natural keys
        
       Author : thunderbong
       Score  : 673 points
       Date   : 2024-06-05 01:25 UTC (21 hours ago)
        
 (HTM) web link (blog.ploeh.dk)
 (TXT) w3m dump (blog.ploeh.dk)
        
       | duxup wrote:
       | I just have some form of unique id field / synthetic key,
       | everywhere.
       | 
       | Even if just for documenting issues it makes life easier. "Table
       | whatever id 12345 is the record in question."
       | 
       | I've just seen data / relationships change too much too often in
       | new and interesting ways to believe in using a natural key.
        
       | ltbarcly3 wrote:
       | Sometimes you will regret it, other times you won't.
       | 
       | Notice that even if you believe that you should 'never use
       | natural keys', that is NOT the same thing as 'always add a
       | generated synthetic key to every table as the primary key'. You
       | should _NOT_ always add a generated key to _every_ table, even if
       | you use a trash ORM that really wants you to do this.
        
         | aaronbrethorst wrote:
         | That's quite the absolute statement to make without even one
         | example.
        
           | hansvm wrote:
           | Maybe? Their statement was that you should "not always add,"
           | as opposed to the much stronger "always not add."
        
             | selcuka wrote:
             | At least the latter provided some examples.
        
               | Twisell wrote:
               | Here's a real world example if you run a script once
               | every 5 minutes that launch sub-task you might be tempted
               | to add a non natural auto-increment number that identify
               | each occurence to create a link between the script and
               | the subtasks.
               | 
               | However it will be way more painless to use a timestamp
               | of the script starting point a natural key.
               | 
               | This way when shit happen you have a relevant and stable
               | way to identify each occurrence. And timestamp is easily
               | indexable and ordered.
        
               | aaronbrethorst wrote:
               | Fast-forward a couple years: now I have scripts that
               | launch sub-tasks more than once a second.
        
               | yellowapple wrote:
               | So use timestamps with sub-second precision, which
               | virtually every SQL database supports (even ones like
               | SQLite that don't have built-in date/time types).
        
               | selcuka wrote:
               | What about multiple scripts running in parallel?
        
               | bruce511 wrote:
               | If you want a timestamp, by all means add a timestamp
               | field. And of course add a (non primary) key on the
               | field.
               | 
               | I would add a different surrogate field though for the
               | primary key. Because I've -always- discovered edge cases
               | which break my original model.
               | 
               | In this situation I can think of 2. Firstly if the
               | process triggers twice at the same time you have a
               | problem (which happened to me in a case where a test
               | suite fired up thousands of instances on a machine with
               | many cores). Secondly twice a year with daylight savings
               | (and when the owner of the system decided to change from
               | local time to utc).
               | 
               | Is there an actual natural kh out there that will never
               | fail me? Probably. But I'm tired of looking for it.
        
           | _flux wrote:
           | How about:
           | 
           | - Internet scanner that would use host IP as the primary key
           | 
           | - A table connecting two other tables using their primary
           | keys as the composite primary key (when there can only be
           | zero to one such connections)
        
           | ltbarcly3 wrote:
           | I don't think you read it carefully.
        
             | aaronbrethorst wrote:
             | I did. I just hate encountering databases designed by
             | people who take your advice to heart.
        
       | Spivak wrote:
       | Natural keys are great when they're someone else's synthetic
       | keys. My Discord bots make liberal use of this.
       | model: DiscordMessage          key = discord_message_id
       | key = discord_user_id          key = discord_channel_id
       | key = message_type             model: UserData          key =
       | discord_user_id          key = field_name             model:
       | GuildFeature          key = discord_guild_id          key =
       | well_known_feature_name             model: DiscordEvent
       | key = discord_event_id          key = discord_guild_id
       | 
       | This is how I was always taught to use them so I'm kinda confused
       | what the author is getting at. If the thing you're using as a key
       | can change and that doesn't _define_ the row then you don 't have
       | a key. If you're at a conference where everyone has a badge
       | number then that's a great natural key when scanning people into
       | your workshop. If you're at Disney and everyone has a magicband
       | with ids then you can use that as a natural key all over the
       | place when to you that's a visitor.
        
         | fn-mote wrote:
         | I think this is an excellent example of one of the pitfalls
         | that the author is getting at.
         | 
         | (Disclaimer: not a Discord bot programmer.)
         | 
         | Suppose discord_user_id is an identifier like @Spivak#2024 that
         | uniquely identifies the user. When Discord forces all users to
         | change to unique ID's and eliminates the disambiguating numbers
         | (or allows users to change them), then where does that leave
         | you?
         | 
         | (For the unfamiliar: something like this happened. Not sure if
         | those are really the "discord user id" used in bot service
         | though, but suppose that they are for the sake of the
         | argument.)
        
           | iaabtpbtpnn wrote:
           | It probably leaves you in the same place you'd be with a
           | synthetic key of your own: unable to track users across this
           | key change without additional Discord data.
        
             | foota wrote:
             | Yes, but if you then reference that as a foreign key in
             | some other table or system, then whatever migration you
             | need to do will encompass those systems as well, whereas if
             | you use a synthetic key from the start you wouldn't need to
             | change them.
        
           | jjmarr wrote:
           | Discord's real user ID is a very lengthy number not visible
           | to end-users nor used for other purposes.
           | 
           | https://support.discord.com/hc/en-
           | us/articles/206346498-Wher...
        
             | tylersmith wrote:
             | Which only reenforces the point that using what is exposed
             | as the ID is not safe as a natural key.
        
               | Spivak wrote:
               | You're reading this wrong. The real ID (the Snowflake) is
               | exposed in the API, and the UI if you turn dev tools on.
               | That's the one I use. You don't use @username#1209 but
               | 1247755691426542433.
        
           | beeboobaa3 wrote:
           | Discord user IDs are their numerical unique, unchangeable,
           | id.
           | 
           | https://support.discord.com/hc/en-
           | us/articles/206346498-Wher...
        
         | rsynnott wrote:
         | Does Discord guarantee, _absolutely guarantee_, that they will
         | never change their ids? This might be a safe practice if they
         | explicitly claim that, and you trust them absolutely. Not
         | otherwise.
        
       | ray_v wrote:
       | Feels like this could have used a few more solid examples up-
       | front. I think another good example would be PlayStation Network
       | using the natural key of "gamer tags" as the primary key to
       | identify players would be a good example. Since this effectively
       | locks players into having to keep a gamertag in order to uniquely
       | identify them in the service -- instead of having a synthetic key
       | that carries no meaning or data other than "uniqueness"
        
         | robomc wrote:
         | yeah the way he described it it's like... well who would ever
         | do that. but foreign keying to emails or usernames is much
         | easier to "accidentally" do and is a classic source of long-
         | term headaches.
        
           | jaggederest wrote:
           | As a great example: My steam account name is the email I was
           | using in 2003. I have largely not used it since then. The
           | email on the account has been updated, but the account name?
           | Stuck.
        
             | jimbobthrowawy wrote:
             | I was about to ask how you got past steamguard without
             | email access. Makes sense if you were able to navigate off
             | it.
        
               | jaggederest wrote:
               | Email was changed about 5 years before steamguard release
               | :)
        
         | mths wrote:
         | Last I checked, Steam still has me logging in with my two
         | decade old hotmail address as my account name. At least it's
         | not something that shows publicly, I think.
        
           | petepete wrote:
           | Mine too, but with a British ISP that hasn't existed for
           | fifteen years.
        
             | t90fan wrote:
             | ntl?
             | 
             | me too!
        
               | petepete wrote:
               | That's the one! myinitials@ntlworld.com
               | 
               | And NTL merged with Virgin in 2006, making it 18 years.
               | Oh dear.
        
           | nottorp wrote:
           | I wouldn't swear it, but I think when I changed my Steam
           | email the process didn't require access to the old email,
           | just knowing the user and pass.
        
           | zamadatix wrote:
           | I'm able to log out and back into Steam using my plain
           | username rather than email. The box actually says"enter
           | account name" and wouldn't log me in if I tried to enter my
           | email instead. I'm not sure if there are conditions to this
           | e.g. my account was only created in 2009 and I'm not sure if
           | I've ever toggled anything that "flipped it over" along the
           | way either.
           | 
           | On the API side I know Steam has something like 3 different
           | forms of auto generated alphanumerical account ID and I
           | imagine that's what everything is really keyed off of on the
           | back end.
        
         | bruce511 wrote:
         | This is one of those cases where examples of natural keys
         | failing are so ubiquitous that they are almost redundant.
         | 
         | For the group who have forged a career with natural keys, and
         | never regretted it, more power to you. Great.
         | 
         | However to the rest of us, myself included, where ill-
         | considered natural keys have caused endless opinions and
         | suffering, my commiserations.
         | 
         | If I could send back one piece of advice to junior-me it would
         | be to avoid natural primary keys. (Ideally with the corollary
         | to avoid sequences, but that's another thread for another day.)
        
         | saghm wrote:
         | On the other hand, Discord used to allow arbitrary usernames
         | and would add a suffix for when you needed to disambiguate
         | (e.g. if you used the username JohnDoe, your "id" would be
         | something like JohnDoe#12345) but over the past year or so
         | forced everyone to pick a fully unique username. In this case,
         | the decision seemed to potentially be financially motivated,
         | given that people with subscriptions were given priority for
         | claiming usernames, and historically there really hasn't been
         | much reason to pay for it in the first place.
        
         | bux93 wrote:
         | A phone company I was a customer of used my phone number as the
         | customer Id. Which doesn't work great for people, or companies,
         | with more than one phone number. They could also only provision
         | one SIM per phone number, and not use a phone number tied to a
         | SIM as a virtual number, and also the primary (customer Id)
         | phone number had to be tied to a SIM.
         | 
         | I ported my phone number away from them, but what if I ever go
         | back? Will my old data be there? Including my old address? What
         | if my phone number gets recycled, and some-one else gets that
         | phone number and ports it to them? So many questions.
        
       | davidhyde wrote:
       | A natural key is what I would normally call unique index. Say
       | first, surname and date of birth in an employee table as a bad
       | (poor design) example. As opposed to a surrogate key like
       | personId being an auto incrementing ID which is the norm since it
       | can be easier to use when joining tables. I wish articles like
       | these would explain terminology up front. I found it irritating
       | to wade into the anecdotal trivia and not know what I was reading
       | about and I'm very familiar with how databases work!
        
         | bongodongobob wrote:
         | One reason to not do that is because you could then figure out
         | other users IDs by their hiring date. This ended up being a
         | problem for me once with the auxiliary systems that relied on
         | that ID, namely when it was used in links in other
         | applications.
        
           | tadfisher wrote:
           | Another reason is sharding, in which case any non-random (to
           | be more precise, non-uniformly distributed) bits are going to
           | skew the partitions.
        
           | eddd-ddde wrote:
           | Why is knowing someone's ID an issue?
           | 
           | They are meant for identification, if anything it should be a
           | benefit that they are easily guessable.
        
             | selcuka wrote:
             | > Why is knowing someone's ID an issue?
             | 
             | It increases the attack surface as an authorization
             | vulnerability will allow an attacker to enumerate and
             | access all records. Yes, it is security through obscurity,
             | but a random (e.g. UUID) scheme makes it harder.
        
               | kubanczyk wrote:
               | ~128 bits worth of obscurity is considered real security
               | for the time being. Assuming a cryptographically secure
               | PRNG.
               | 
               | Thats like guessing a password 18 ASCII chars long.
        
               | bongodongobob wrote:
               | Exactly, it's not that _my_ systems use security through
               | obscurity, it 's the other ones mine ties into.
               | 
               | This was years ago and you don't see it as much anymore,
               | but think autogenerated links to shitty CRM, ticketing,
               | and project management software where the link is the
               | query aka - Blahsoftware.local/info/bunchofgarbage?=useri
               | d+garbage+view+sensitiveinfo.html type stuff.
        
       | daft_pink wrote:
       | Anyone that's spent large amounts of time using vlookup in excel
       | can totally relate. Mashing some fields together as a key works
       | great until it doesn't.
        
       | projektfu wrote:
       | The devil's advocate cases:
       | 
       | 1. When duplication occurs and goes unnoticed because the natural
       | key isn't being used, and then perhaps the problem is "corrected"
       | by an administrator in a way that doesn't make sense.
       | 
       | For example, someone sets up an account with an street address A,
       | then forgets they had it and sets up an account with street
       | address B. They call and complain that they can't find an old
       | order, or whatever, and the duplication is discovered. The
       | administrator later clobbers one of the addresses but both are in
       | use. A natural key (behaviorally speaking) may have presented the
       | duplication, assuming address isn't part of the natural key. This
       | can be satisfied by uniqueness constraints, etc.
       | 
       | 2. When you are browsing the database, you see a bunch of
       | synthetic keys and have to perform various joins to be able to
       | see the relevant data. The synthetic keys make joins easier to
       | write, but make may make ad-hoc queries more time-consuming and
       | difficult.
        
         | roenxi wrote:
         | A synthetic key is adding an additional dataum - so any problem
         | that could be solved with the original data can be solved by
         | the new data too (worst case, ignore the primary key). That is
         | one of the best reasons to add a proper index - there is almost
         | literally no downside. We're talking maybe a few bytes/record
         | and arguably a trivial amount of administrative overhead.
         | 
         | Bearing that in mind:
         | 
         | 1. Add a unique constraint on the natural key.
         | 
         | 2. If the accuracy of joining by natural key is acceptable,
         | join based on it. If that doesn't work, then a table design
         | without a synthetic key wouldn't be fit for purpose either.
        
           | kubanczyk wrote:
           | In other words, don't mistake a primary key with a key or a
           | constraint.
        
       | azlev wrote:
       | I think somewhat different than the article in some ways.
       | 
       | The first example is why `UPDATE CASCADE` was implemented. So
       | it's possible to use natural keys as identity without the fear of
       | children table. At least in most databases it works.
       | 
       | The drawback of enumeration is real, so if you expose this key
       | you'll need some authentication/ authorization mecanism.
       | 
       | Another good thing in natural keys is that you can eliminate part
       | of the joins. You don't have to join the father table because the
       | key in child table is known.
       | 
       | I think the biggest challenge is how to map logins to people.
       | It's very common to interpret both as the same, but they are not.
        
         | arthens wrote:
         | > The first example is why `UPDATE CASCADE` was implemented. So
         | it's possible to use natural keys as identity without the fear
         | of children table. At least in most databases it works.
         | 
         | This is only true in a system using a single database, not
         | replicating data in external services, and not offering APIs.
         | 
         | And while it might work today when the service/website is still
         | fairly small and self contained, the requirements might change
         | at any time. So the title (You will [future] regret it) still
         | applies to this solution
        
       | RecycledEle wrote:
       | For those who do not know, natural keys are when you have a
       | primary key in a database table that is derived from the data
       | itself.
       | 
       | An example is using a car's chassis number as the key for the
       | record describing that vehicle.
        
         | rhelz wrote:
         | Perhaps a better example would be a fingerprint or a retinal
         | scan instead of social security numbers as natural keys for a
         | person.
        
       | parpfish wrote:
       | There another reason not mentioned -- if your key is something
       | like a UUID, it's very easy to define the logic for joining and
       | filtering based on that key.
       | 
       | If you were using some sort of string like an email address or
       | username, you have to think about case sensitivity and trimming
       | white space and all sorts of preprocessing and then make sure you
       | do it consistently EVERYWHERE
        
         | Cyph0n wrote:
         | Ideally, you should aim to sanitize/normalize strings on the
         | write side rather than resanitizing on every read.
        
           | SoftTalker wrote:
           | Yes but any kind of bug could later introduce tainted values.
           | A foreign key contstraint might save you, but not always.
        
           | parpfish wrote:
           | That helps part of it, but there are still places for
           | problems to pop up.
           | 
           | - you need to do an adhoc query to look something up so you
           | have to type in the key in a where clause
           | 
           | - Or you used different sanitation methods in two different
           | databases and you need to join things now
           | 
           | - you try to join the table that had the sanitized email key
           | to a different table that just so happens to have email but
           | it wasn't sanitized because it was an optional field and not
           | the key
        
             | phatskat wrote:
             | And then you get the other stuff like one-off migrations
             | that are done by someone unaware of the business
             | constraints, or a contractor who completely misses the ORM
             | method, etc.
             | 
             | These of course should be caught by checks and balances,
             | and you can't count on anyone knowing what they need to
             | know ten years after an implementation is done
        
             | sgarland wrote:
             | > you need to do an adhoc query to look something up so you
             | have to type in the key in a where clause
             | 
             | Having a surrogate key here doesn't help. A WHERE predicate
             | can be rewritten as an INNER JOIN quite easily. Or you
             | could use a subquery, or a CTE. Many options.
             | 
             | The other problems discussed are an engineering culture
             | problem. Either you value correctness and consistency or
             | you don't.
        
               | parpfish wrote:
               | not sure how converting a WHERE to a JOIN helps when
               | you're doing adhoc queries like this to look up records:
               | 
               | `SELECT * FROM users WHERE email LIKE
               | 'FirstName@domain.net'`
        
           | sgarland wrote:
           | You could also have CHECK constraints on those columns to
           | ensure that nothing is written incorrectly, and/or a pre-
           | write trigger that casts them to lowercase.
           | 
           | Or you could make the email column case-insensitive, since
           | it's generally accepted to be CI anyway.
        
         | shkkmo wrote:
         | Side note about using email addresses.
         | 
         | It is pretty common to assume that email addresses aren't case
         | sensitive since many email providers treat them that way. Email
         | addresses absolutely are case sensitive so you need to preserve
         | case when storing them and be careful about case when using
         | them.
         | 
         | This makes email addresses particularly unsuited to being used
         | as a natural key. Most people treat them as case insensitive
         | and you need to work around that, but you can't safely just
         | treat them as case insensitive yourself.
        
       | keepamovin wrote:
       | The way I think of it is a key, in a database, is a _record_ key.
       | It is intended to identify a particular logical record, not the
       | logical _thing_ which is referred to by the record.
       | 
       | An an footnote, I'll add this doesn't mean that you need to have
       | complexities like record versioning, record history, or anything.
       | But couched in those conceptual terms where those things are
       | possible, is a happy and safe space to be. In this space, a
       | database records entries, as if they were each a single paper
       | form with boxes where you fill out, in pencil, to erase if you
       | like or not, the particulars of the thing you're recording. This
       | form comes pre-stamped with a number: the _record key_.
       | 
       | In this cozy little world, you can be imperfect, and mistakenly
       | (or deliberately, as your use case requires), file multiple slips
       | that refer to the same _logical thing_ , and yet all have
       | different file ("record") numbers - or keys.
        
       | jklowden wrote:
       | You think your surrogate key will save you? It will not. The
       | world has an external reality that needs to be reflected in your
       | database. If the unique identifier for your object -- VIN, CUSIP,
       | whatever -- if it changes, the world will henceforth refer to it
       | by both. You will need to track both. Adding a synthetic key only
       | means you have to track all three. Plus you have to generate a
       | meaningless number, which is actually a choke point in your data
       | throughput.
       | 
       | The natural key forces you to think about what makes the row
       | unique. What _identifies_ it. Sometimes, it makes you go back to
       | the SME and ask them what they mean. Sometimes it makes you
       | reconsider time: it's unique now, but does it change over time,
       | and does the database need to capture that? In short, what are
       | the boundaries of the Closed World Assumption? You need to know
       | that too, to answer any  "not exists" question.
       | 
       | To use our professor's car's example, we actually do not know the
       | database design. It could well be that the original identifier
       | remained the primary key, and the "new id" is entered as an
       | alias. The ID is unique in the Car table, identifying the
       | vehicle, and is not in the CarAlias table, where the aliases are
       | unique.
       | 
       | Oh, you say, but what if the bad old ID gets reused? Good
       | question. Better question: how will the surrogate key protect
       | you? It will not. The reused ID will be used to query the system.
       | Without some distinguishing feature, perhaps date, it will serve
       | up duplicates. The problem has to be handled, and the surrogate
       | key is no defense.
       | 
       | Model your data on the real world. Do not depend on spherical
       | horses.
        
         | kayodelycaon wrote:
         | Of course they won't save you from external data. The whole
         | point is for your system to have a way to identify rows
         | internally so you can deal with external systems getting wonky
         | without corrupting your own data.
         | 
         | All of your concerns are easily solved by a unique index.
         | 
         | Using external keys as a forcing function to prevent people
         | from representing data wrong is not great.
        
         | crabmusket wrote:
         | The surrogate key uniquely identifies _a row in your database_
         | , which is an entity just as real and significant as the car or
         | the employee or what-have-you. Don't confuse the two!
         | 
         | I agree with you that having a surrogate key isn't going to
         | save you from the reasons why natural keys can be difficult.
         | The complexity has to go somewhere. But _not_ having a unique
         | identifier for each row is going to make things _extra_
         | difficult.
        
           | hobs wrote:
           | The main thing is that the synthetic key should never leave
           | the database and never be displayed in the app - if you want
           | to have another key that represents the human oriented key do
           | it, but it should be another field, an indexed field even,
           | but one that has a lot less monotonic sequential properties
           | that are inherent to synthetic database identifiers.
           | 
           | You want to change that human key? Sure. You want to to
           | complain that the keys are not sequential? Sure. You want to
           | actually make them weird strings that are harmful to my
           | brain? Why not? You want to update primary keys in the
           | database? No. Absolutely not.
        
             | The_Colonel wrote:
             | I assume you hint at the security aspect of monotonic keys?
             | 
             | I've found this issue a bit overblown. It's basically
             | security by obscurity, which is a nice bonus, but not
             | something your security model can be based on.
             | 
             | I mean, it is a good practice to expose some kind of non-
             | sequential key (e.g. UUIDv7), but it doesn't seem to me
             | like a dealbreaker.
        
               | crabmusket wrote:
               | I love it when my competitors use sequential integer IDs.
        
               | throwaway8486k wrote:
               | Security by obscurity makes it more difficult for bad
               | actors as an additional layer that they need to break. I
               | don't see a reason why that's a bad thing and doesn't
               | take a lot of effort to implement in this case.
               | 
               | I have been in a startup where competitors used our
               | sequential keys to scrape a list of customers. Lesson
               | learned the hard way with actual business consequences!
               | 
               | Sequential keys also leak information (German tank
               | problem)
               | 
               | Your competitors can estimate number of customers, your
               | growth rate and other stats that you often don't want
               | them to know.
               | 
               | https://en.m.wikipedia.org/wiki/German_tank_problem
        
               | The_Colonel wrote:
               | I'm not saying it's a bad practice, the opposite
               | actually.
               | 
               | > I don't see a reason why that's a bad thing and doesn't
               | take a lot of effort to implement in this case.
               | 
               | True, if you start your application from scratch. Like if
               | I started designing a new app today, I'd just choose the
               | UUIDv7 for the primary key.
               | 
               | It's not an easy thing to add into an existing
               | application, though. I see applications leaking their
               | internal IDs all the time, but usually it's not worth the
               | effort to fix that, because it's a comparatively minor
               | problem.
        
               | throwaway8486k wrote:
               | > I'm not saying it's a bad practice, the opposite
               | actually
               | 
               | I'm sorry that I didn't make that more clear. I saw that
               | you mentioned it as a best practice and are aware of the
               | advantages. It's just that there are so many others that
               | don't have the balanced view as you seem to have.
               | 
               | I have been involved in many discussions at my work place
               | where "security by obscurity" is used as a way to shut
               | down discussions. They changed their minds about
               | sequential keys after the incident I mentioned, but it
               | still has the power to "win" other discussions. Sure, we
               | need to have rate limiting on ip-addresses, auth and
               | other mechanisms, but they are not perfect and bugs
               | happen all the time. An "unguessable" id is an additional
               | security layer
               | 
               | > It's not an easy thing to add into an existing
               | application, though
               | 
               | I agree, but there are ways to reduce the attack surface.
               | You could add an extra "public id" field that can be used
               | for lookup in addition to the existing id. In this way
               | you can have a gradual migration where you go through
               | each endpoint and migrate them individually without
               | changing the foreign keys and other relations in the
               | database (they would still use the sequential key). Maybe
               | you end up not having time to migrate them all, but at
               | least you can reduce the attack surface on the most
               | sensitive endpoints.
               | 
               | If you have low volume endpoints you could perhaps even
               | simply add a mapping layer where you do a simple db
               | lookup to replace the public key with the internal
               | without changing the existing queries. You could even
               | cache this easily to reduce the db load. (both ids are
               | permanently fixed and can be cached forever).
        
               | yellowapple wrote:
               | > I have been in a startup where competitors used our
               | sequential keys to scrape a list of customers.
               | 
               | If your system allows customers to see each other (or
               | worse: unauthenticated users to see customers) in this
               | fashion in the first place then whether you're using a
               | sequential integer v. a random UUID is the least of your
               | problems.
        
               | mgkimsal wrote:
               | The 'customers' could be free tier users - a social media
               | type system where everyone has a public profile -
               | intended for the public - would still be scrapable by
               | /profile/1, profile/2, etc. Doesn't necessarily require
               | 'authentication' for the exposing of sequential integers
               | to have a bad outcome.
        
               | throwaway8486k wrote:
               | You're right. The urls were public to be shared (think of
               | marketing material / ecommerce), so there was not a
               | security incident.
               | 
               | But it did give our competitor free highly qualified
               | leads that they could use to poach customers. This
               | product was new to our customers, and we had spent a lot
               | of time selling and convincing them that it was useful.
        
               | inopinatus wrote:
               | UUIDv7 may not be generated sequentially, but it is still
               | temporally ordered and discloses a timestamp, which may
               | be an undesirable leakage for some
               | applications/environments. When obfuscation matters that
               | much, use a UUIDv4 and eat the index performance hit.
               | 
               | Some might suggest, "encrypt your object identifiers on
               | the way in/out", but there's a ton of pitfalls involved
               | since for most applications they are now rolling their
               | own crypto, and it also makes identifiers much longer.
        
               | The_Colonel wrote:
               | Yes, you can go to great depths, but they each have
               | trade-offs - in performance, increased complexity etc.
               | and it's necessary to make a judgment for each particular
               | app instead of applying the most overengineered solution
               | everywhere.
        
               | hobs wrote:
               | Sort of, I have way more experience with clients saying
               | that an invoice is missing (which turned out to be a
               | rolled back transaction) and then I get to explain how
               | transactions work, and then if the customer is smart
               | enough they'll say something like "then why the hell is
               | the invoice number part of that?"
        
           | stouset wrote:
           | Precisely. And if I have a surrogate key to identify "a car",
           | _I_ get to define what makes a car unique for my purposes.
           | Maybe that really is the VIN. Maybe today it's the VIN, but
           | tomorrow it's whatever vehicle is currently being driven by
           | its owner. Maybe it's something else.
           | 
           | Some day I may need to track multiple VINs for a vehicle
           | (maybe it's got parts from multiple VINs and I want to track
           | that). I can still always decompose that table and have an
           | n-to-1 relationship between cars and VINs _without migrating
           | the rest of my data model_.
        
           | hackit2 wrote:
           | A entity can exist over more than one row in your database,
           | but it is useful to uniquely identify each row as the lowest
           | common denominator.
        
           | mkleczek wrote:
           | > The surrogate key uniquely identifies a row in your
           | database, which is an entity just as real and significant as
           | the car or the employee or what-have-you. Don't confuse the
           | two!
           | 
           | But the DBMS already maintains a row identifier (called rowid
           | or ctid or whatever depending on the DBMS). Why do you need
           | an explicit one?
        
             | unchar1 wrote:
             | It may be useful if you have data that originates from
             | another source or if something outside of our system
             | references your entity. In that case you need to keep some
             | form of an externalRef, so it's usually easier to just use
             | an id that you can control, for referencing both internally
             | and externally.
        
             | magicalhippo wrote:
             | In the DB we use[1] the internal row id is not stable:
             | 
             |  _The value returned by the function is not necessarily
             | constant between queries as various operations performed on
             | the database may result in changes to the row identifiers
             | of a table._
             | 
             |  _So, users should refrain from using the ROWID function in
             | ordinary situations; retrieval by primary key value should
             | be used instead._
             | 
             | [1]: https://infocenter.sybase.com/help/index.jsp?topic=/co
             | m.syba...
        
             | simonw wrote:
             | Be careful with those. SQLite has a rowid concept, but it's
             | not guaranteed to be stable - running a VACUUM against a
             | table can reassign the rowids for every row!
             | 
             | https://sqlite.org/rowidtable.html says:
             | 
             | > If the rowid is not aliased by INTEGER PRIMARY KEY then
             | it is not persistent and might change. In particular the
             | VACUUM command will change rowids for tables that do not
             | declare an INTEGER PRIMARY KEY. Therefore, applications
             | should not normally access the rowid directly, but instead
             | use an INTEGER PRIMARY KEY.
        
             | sgarland wrote:
             | Because every DB can and will shift those as needed. They
             | reference the physical location on disk for a given tuple.
             | They are not meant for general consumption.
        
         | polemic wrote:
         | In the wise words of Patsy, "it's only a model".
         | 
         | The real world is resistent to clean abstractions and
         | abstractions are distressingly subject to change. What made
         | your row unique today is quite likely to become non-unique in
         | the days/months/years to come.
         | 
         | Always use surrogate keys. Your future self will thank you.
        
           | hehdhdjehehegwv wrote:
           | And always kids, write code for one person, and one person
           | only: your future self.
        
             | M95D wrote:
             | Yeah, fsck your co-workers and your replacement when you
             | quit.
        
               | hehdhdjehehegwv wrote:
               | If you're good to your future self you're also good to
               | them.
        
         | tadfisher wrote:
         | In that case, using this natural datum as an ID is a
         | contradiction, because we are now removing the uniqueness
         | constraint. It's fine to model the real world, but the real
         | world also includes your data, and you may want to identify
         | unique _records_ as it is not a universal truth that data can
         | be corrected on entry.
        
         | huygens6363 wrote:
         | There is a model of a thing and there is a row that stores a
         | representation of that model of a thing.
         | 
         | They both are things. Ignoring the last one might be tempting,
         | but it's not practical.
         | 
         | Interestingly your own way of thought is applied, but now a
         | level deeper again.
         | 
         | How do you model a row? What makes it unique? A surrogate ID is
         | the only sensible unique identifier for such a thing as there
         | is no "natural key" that would make sense for instances of
         | something so general as "Row".
         | 
         | What you were saying amounts to "don't model the thing holding
         | the model", but experience shows the thing holding the model is
         | itself an (often unwilling) active part of systems.
         | 
         | Someone here gave the example of wrongly entered PK's by
         | administrative personnel handing live customers. That's IMO a
         | good example of why you need an extra layer on top of your
         | Actual Model(c). I can think of more.
        
           | crabmusket wrote:
           | > They both are things.
           | 
           | Corollary: your app is part of the real world.
        
             | TeMPOraL wrote:
             | Furthermore: bugs in your app are part of the real world,
             | so woe to those who used keys from your app as
             | natural/external keys in their app.
        
         | jtriangle wrote:
         | >Model your data on the real world. Do not depend on spherical
         | horses.
         | 
         | This took me years to realize, and once I did things became
         | much, much simpler.
        
         | CrimsonRain wrote:
         | Sorry. This is a very bad advice. I just had to fight tooth and
         | nail to make my lead turn around from this disastrous decision.
         | Using a lot of external IDs as our own row primary keys and
         | then they get propagated to all other tables as foreign keys
         | and what not. One day the foreign key chances or God forbid,
         | the formatting changes in external systems, now we need to fix
         | our whole database and all codes instead of a small isolated
         | place.
         | 
         | Generate your own unique keys for everything; add a few more
         | unique constraints if needed. A bit more work but never a
         | regret.
        
           | aidos wrote:
           | Yeesh. I once made the mistake of using an external ID as a
           | primary key. What a day it was when they were changed on me.
        
             | spoiler wrote:
             | Can you share more about this? Wouldn't you run into the
             | same problems if you used a surrogate pk? Without the
             | nat/external pk and fk, you run the risk of having validity
             | issues.
             | 
             | Conversely, if the ID changes, isn't the friction what you
             | want?
             | 
             | I feel like optimising for unlikely edge cases instead of
             | integrity because of a single incident is too reactionary.
             | 
             | Writing a query or script to update a string, even for
             | millions of rows, isn't that big of a deal?
             | 
             | Obviously there _are_ cases where nat pks/fks are bad, but
             | not all of them.
        
               | mixmastamyk wrote:
               | Instead of migrating one column, now you have that and
               | every table with a foreign key to the original. Requiring
               | transactions and possibly locking etc.
        
             | naasking wrote:
             | > I once made the mistake of using an external ID as a
             | primary key. What a day it was when they were changed on
             | me.
             | 
             | I've kept with this advice for the most part, but I'm
             | tempted in some cases to use the external id when there's
             | some guarantee of stability and universality. Like 2 and 3
             | digit ISO country codes.
        
               | Piskvorrr wrote:
               | Not that I'd get about 5 different ISO country code
               | changes (with some flipflopping) just by sitting in this
               | very same spot for a couple of decades. "Stability" in
               | country codes, bah humbug.
        
               | naasking wrote:
               | Your geographical location's sorting into a country might
               | not be stable, but I'm referring to the ISO codes that
               | name the country, which should be relatively stable.
        
         | devjab wrote:
         | In terms of the Danish CPR that is mentioned here, the way we
         | actually solved the challenge in the national architecture
         | strategy was to define your social security number(s) as what
         | we call an address on your person. I'm not sure why it was
         | called an address, but it's basically a UUID and the
         | information. Maybe it's because it was first used to store
         | addresses?
         | 
         | Anyway. Unless your system has not yet implemented the national
         | strategies (which by now are approaching 24 years) then
         | changing a Danish Social Security won't actually matter because
         | it's just an added address to your person "object". So
         | basically you'll now simply have two, and only which is marked
         | as the active. Similar to how you've got an array of addresses
         | in which you have lived but only one main address active.
         | 
         | It did indeed cause a lot of issues historically because it was
         | used as a natural key... though with it being based on dates,
         | it was never really a good candidate for a key since you
         | couldn't exactly make it into a number unless you had some way
         | to deal with all the ones beginning with 0 being shorter than
         | the rest. Anyway... it was used as a key, and it was stupid.
         | 
         | Anyway I both agree and disagree with you. Because we've
         | successfully modelled the real world with virtual keys, but you
         | access it by giving any number of natural keys. Like... you can
         | find the digital "me" by giving a system my CPR number, but you
         | can also find me by giving a system my current address.
         | Technically you could find me by giving a name, but good luck
         | if you're dealing with a common name. There is a whole range of
         | natural keys you can use to identify a "digital" person, but
         | all of it leads from a natural key into a web of connected
         | virtual keys.
         | 
         | All of it is behind some serious gatekeeping security if you're
         | worried. Or at least it's supposed to be.
        
           | mrweasel wrote:
           | It's been years since I worked on systems with CPR numbers,
           | but I seem to recall that there is also a policy in place,
           | stating that you are not allowed to use the CPR number as a
           | "primary key". Many companies did anyway, because they never
           | actually bothered to read the guidelines and regulations.
           | 
           | All the systems I've seen always had the CPR as a lookup for
           | a UUID, which as you say is the the "address" of the actual
           | person object.
        
         | paulddraper wrote:
         | Terrible advice.
         | 
         | Surrogate keys are keys are a layer of indirection.
         | 
         | They don't fix all problems, but they fix some problems.
         | 
         | Not least of which is performance. Often natural keys are
         | character strings, whereas surrogate keys can be fixed size
         | integers, saving index sizes on your FKs.
        
           | mypalmike wrote:
           | Conversely, certain queries can be much faster by using
           | natural keys when the FK is all that you need in the result
           | rather than additional fields in the primary table. In this
           | case, the primary table doesn't need to be queried at all.
           | This doesn't generally overcome the benefits of synthetic
           | keys, but it is an optimization sometimes put into practice.
        
             | crabmusket wrote:
             | Aren't you literally describing an index?
        
               | mypalmike wrote:
               | No. Maybe an example helps. You have a users table with
               | username as natural PK. You have an access table with
               | timestamps of users hitting a service, with FK of
               | username. If you query the access table for a range of
               | timestamps and just want the usernames, they're right
               | there in the results of the access table query. If you
               | had instead used a synthetic user_id key, the db would
               | have to do an additional lookup in the users table via a
               | join for each user_id to get the usernames.
        
               | crabmusket wrote:
               | Oh of course, that makes sense. I'm not sure how I
               | misunderstood that!
        
           | mrkeen wrote:
           | > surrogate keys can be fixed size integers
           | 
           | This launches into the _other_ debate about PKs: using UUIDs
           | rather than sequential keys.
        
             | sgarland wrote:
             | This is less of a debate, and more of an indicator of who
             | has had to work with a DB at scale using UUIDv4 everywhere.
             | 
             | Don't blow up your B+trees.
        
               | mrkeen wrote:
               | I guess.
               | 
               | Either your system is happy enough to route every new
               | entity through "one DB at scale" so it can let your "one
               | DB at scale" be in charge of an auto-incrementing long,
               | or it isn't.
        
               | sgarland wrote:
               | A common method is to have a small app (which can quite
               | easily be HA) that hands out sequential chunks to each
               | shard, interleaving and/or gapping as necessary.
        
         | SPBS wrote:
         | > You think your surrogate key will save you? It will not. The
         | world has an external reality that needs to be reflected in
         | your database.
         | 
         | Yes, it will. It is precisely because of a messy external
         | reality that you need an unchanging internal ID that is
         | unaffected by changes to the external ID. If the software
         | designer in the article had followed your advice, changing the
         | chassis number would have likely resulted in broken car
         | ownership records.
         | 
         | Whoever is reading this in the future, please don't follow the
         | parent comment's advice. Use surrogate/synthetic keys for your
         | primary key.
        
         | yashap wrote:
         | This is bad advice.
         | 
         | Say I have a user table, and the email is unique and required,
         | and we don't let users update their email, and we don't have
         | user deletion. If I'm going natural PK, I make email the
         | primary key.
         | 
         | But ... then we add the ability for users to update their
         | email. But it should still be the same user! This is trivial if
         | we have a surrogate primary key, a nightmare if we made email
         | the natural primary key.
         | 
         | Or building on that example, maybe at first we always require
         | an email from our users. But later we also allow phone auth,
         | and you just need an email OR a phone number. And later we add
         | user name auth, SSO, etc. Again, all good with surrogate
         | primary keys, a nightmare with natural primary keys.
         | 
         | There are countless examples like this. You brought up cars,
         | same thing with licence plates, for example. Or even Social
         | Security Numbers/Social Insurance Numbers - in Canada SINs are
         | generally permanent, but temporary residents can have their SIN
         | change if they later become permanent residents, but they're
         | still the same person.
         | 
         | You want your entities to have stable identity, even if things
         | you at one time thought gave them identity change. Surrogate
         | primary keys do that, natural primary keys do not. Don't use
         | natural primary keys, use surrogate primary keys with unique
         | constraints/indexes.
         | 
         | I challenge you to come up with a single plausible example
         | where you're screwing yourself by choosing surrogate PK +
         | unique constraints/indexes. Meanwhile there are endless
         | examples where you're screwing yourself by choosing natural PK.
        
           | maweki wrote:
           | > I challenge you to come up with a single plausible example
           | 
           | So I come from academia, but generally if you use a natural
           | key as PK in a foreign key constraint it may be possible to
           | express additional consistency criteria as CHECK-constraints
           | in the referencing table.
           | 
           | So this is a bad example, but say you have Name and Birthdate
           | as your PK, and you have a second table where you have
           | certain special offers sold to your customers and there is
           | this special offer just for Virgos and Pisces, you could
           | enforce that the birth date matches this special offer. Some
           | modern systems also technically allow FK on alternate keys,
           | so you could still do it that way, but database theory often
           | ignores that.
           | 
           | But second, while I agree that surrogate keys are often a
           | good idea, I find your argument, that you must design for
           | every conceivable change, not convincing.
        
             | cjfd wrote:
             | "for every conceivable change".
             | 
             | That is not what he is arguing at all. He is showing that
             | there are very many highly plausible changes that are
             | problematic with natural keys. And he totally correct about
             | that. Frankly, the fact that a post arguing for natural
             | keys makes it the top of an HN comment thread is extremely
             | weird. The original article is correct that natural keys
             | are bad.
        
             | zelphirkalt wrote:
             | Why does anything need to be a primary key anywhere in
             | order to enforce some constraint? At least from ORMs I know
             | I can set for example any group of attributes unique. Other
             | constraints can be implemented in some general method that
             | is called when persisting in the actual database. Even if
             | no ORM, you can write a wrapper around your persisting
             | procedure.
        
               | sgarland wrote:
               | It doesn't, you're right. However, indexes aren't free,
               | so if your data is such that a natural PK (composite or
               | otherwise) makes sense, you'll save RAM. Also, for
               | clustering RDBMS like MySQL, data is stored around the
               | PK, so you can get some locality boosts depending on
               | query patterns.
        
               | a57721 wrote:
               | > At least from ORMs I know I can set for example any
               | group of attributes unique
               | 
               | Just in case, ORMs send DDL statements with uniqueness
               | constraints to the DBMS, they don't do any magic here.
        
             | lstamour wrote:
             | > Some modern systems also technically allow FK on
             | alternate keys
             | 
             | As far as I can tell, all modern systems allow it, as it is
             | part of the SQL standard that foreign keys can be either
             | primary keys or unique indexes. Here's a brief quotation
             | from a copy of ISO/IEC 9075-2:1999 (not the latest version)
             | that I randomly found online:
             | 
             | > If the <referenced table and columns> specifies a
             | <reference column list>, then the set of <column name>s
             | contained in that <reference column list> shall be equal to
             | the set of <column name>s contained in the <unique column
             | list> of a unique constraint of the referenced table.
             | 
             | So it mentions unique constraints first. Then afterward it
             | says:
             | 
             | > If the <referenced table and columns> does not specify a
             | <reference column list>, then the table descriptor of the
             | referenced table shall include a unique constraint that
             | specifies PRIMARY KEY.
             | 
             | If I'm reading this right, it means that in the base case,
             | where you specify the column to reference, it can be any
             | unique constraint, where a primary key is just another
             | possible unique constraint (as all primary keys are by
             | definition unique). And only if you don't specify the
             | fields to reference does it then fall back to the primary
             | key instead of a named unique constraint.
             | 
             | I'm not disagreeing with you entirely - it's true that
             | often there's an assumption in database theory that primary
             | keys are natural and foreign keys are primary keys. But
             | this isn't a hard requirement in practice or in theory, and
             | it partly depends on the foreign key's purpose, why you
             | need it in the first place. This StackOverflow answer also
             | explains it well:
             | https://softwareengineering.stackexchange.com/a/254566
             | 
             | I should add that there is also a set of database design
             | wisdom that suggests you should never use database
             | constraints such as foreign keys, only app/api constraints,
             | but that's a whole different tangent.
        
               | sgarland wrote:
               | > I should add that there is also a set of database
               | design wisdom that suggests you should never use database
               | constraints such as foreign keys, only app/api
               | constraints, but that's a whole different tangent.
               | 
               | That's less a DB design thought and more of a "devs with
               | little formal training in RDBMS who only want to use it
               | as a dumb store" thought.
               | 
               | Use the DB to its strengths. CHECK constraints add an
               | infinitesimal amount of overhead for writes, and
               | guarantee that there will never be invalid data written.
               | A bad code change could allow, however briefly, for that
               | to occur.
        
             | spoiler wrote:
             | Another example is where you use a service that provides
             | you with a stable id. It makes little sense to add a
             | surrogate id and a fk on that surrogate id. It violates
             | data quality and integrity just for a hypothetical
             | situation.
             | 
             | Data integrity/quality matters. Adding friction to prevent
             | accidents also matters. I don't want something accidentally
             | and trivially updating a field that's used to reference
             | thing externally.
             | 
             | Something about the nat key is about to change? Fine, we
             | can write migrations. Even if it affects millions of rows,
             | it's not a big deal.
             | 
             | I understand people have been burnt by bad design decisions
             | involving nat keys, but they're not some devil's key
             | everyone here dogmatically makes them out to be. You can
             | mess up using anything.
        
               | maweki wrote:
               | > a service that provides you with a stable id
               | 
               | I think there's the important point. Is your key
               | _actually natural_ or is it someone else 's surrogate key
               | anyway? Going back to the vehicle identification number:
               | that's already a surrogate key. You just did not assign
               | it yourself.
        
               | sgarland wrote:
               | A VIN is not a surrogate key. A surrogate key must, by
               | definition, have no semantic meaning, and not be composed
               | of parts from multiple domains (among other
               | requirements).
               | 
               | A VIN encodes the following:
               | 
               | * Country of origin
               | 
               | * Manufacturer
               | 
               | * Vehicle type
               | 
               | * Vehicle model
               | 
               | * Engine type
               | 
               | * Transmission type
               | 
               | * Model year
               | 
               | * Manufacturing plant
               | 
               | * Serial number
        
               | shkkmo wrote:
               | > It makes little sense to add a surrogate id and a fk on
               | that surrogate id. It violates data quality and integrity
               | just for a hypothetical situation.
               | 
               | I would still almost always use an internal artificial
               | key on top of the external id. If you want to enforce
               | data integrity, you can still enforce uniqueness on the
               | external id. "Stability" of an external identifier is
               | almost always an assumption and one that I've seen fail
               | enough times to want that internal id by default.
        
           | knallfrosch wrote:
           | > then we add the ability for users to update their email.
           | 
           | At this point, you should verify the new email. At least
           | until it is verified, you must track the old email. At this
           | point, you realize you can now introduce a synthetic key and
           | you're fine.
           | 
           | Let's say you have a duplicate customer entry and the
           | customer demands their accounts be merged. Now you can't
           | identify the user by their key alone, since by definition,
           | they can't be the same (yet.)
        
             | epcoa wrote:
             | > At this point, you realize you can now introduce a
             | synthetic key and you're fine.
             | 
             | Except for having to update all foreign references. Some of
             | those may be external further complicating issues.
             | 
             | Emails are often among the worst keys because they are not
             | terribly stable and they are reusable often enough to burn
             | you.
        
               | jrs235 wrote:
               | Also are emails case sensitive or not? In some systems
               | (that you don't control mind you) they are and others
               | they are not...
        
               | sgarland wrote:
               | Per RFC5321, the local part (before @) _may_ be case-
               | sensitive, but in practice, it almost never is, and
               | relying on case sensitivity is a recipe for disaster.
               | 
               | The domain must always be case-insensitive.
        
               | jrs235 wrote:
               | My point being is if you're using email as a key you
               | "have to" treat it as case sensitive even though for most
               | it's not. And yes, I agree it will be a recipe for
               | disaster.
        
           | brigandish wrote:
           | You require three fields (or four): email at registration, a
           | date for that entry (together these create a natural key),
           | and current email (this one not part of the key and
           | editable).
           | 
           | We're almost all the way to a Tag URI[0], so you could
           | combine it with the user's name or username or any other
           | identifier that fits the spec[1] (you could even use the
           | website's own name) and you have a (definitely two thirds,
           | probably 100%) natural key.
           | 
           | It's stable over time and unique, easy to mint, and has a
           | standard behind it. The user also gets to change their
           | contact details without any problem related to the key.
           | 
           | [0] https://taguri.org/
           | 
           | [1] http://www.faqs.org/rfcs/rfc4151.html
        
             | TeMPOraL wrote:
             | Except you're encoding PII in the ID, which makes them
             | plainly visible to people who should not have access to
             | user data, and hard or impossible to change. Sure, I could
             | e.g. change my e-mail and the contact data would be
             | updated, but _you still have the old e-mail associated with
             | my account_ via ID. I 'm not sure this would fly under
             | GDPR.
        
               | brigandish wrote:
               | Erm, don't show the ID to people who don't need it.
               | 
               | Aside from that, it's not a violation of GDPR to keep
               | personal information (that they consented to you having)
               | in order to process business for that person. Using an
               | email address as a unique identifier is not a violation,
               | using it to spam them would be. If they're willing to
               | give you their current email why not an old one?
        
               | lolinder wrote:
               | > Erm, don't show the ID to people who don't need it.
               | 
               | How do you communicate with other people in your company
               | about a customer without sending around PII if the
               | customer's ID is PII?
               | 
               | Maybe we could create a field that uniquely identifies
               | the customer that isn't PII. Then that could be used to
               | uniquely identify a customer in places where we don't
               | want to expose their PII. But then... why not just use
               | this unique ID as the key?
        
               | brigandish wrote:
               | Do you often send the auto-incremented int (that would be
               | the default substitute to this) when communicating with
               | others? Then why would you send this?
               | 
               | It's so strange an argument. Right now you have my
               | username but not my email address, yet you can still
               | query the website database and get certain data that
               | you're allowed to see. There are so many ways to query a
               | particular user's data, and they would all depend on what
               | you're trying to do, needing the specific key would mean
               | you should have access to it anyway _and_ it could be
               | given on per case basis anyway.
        
               | lolinder wrote:
               | > Do you often send the auto-incremented int (that would
               | be the default substitute to this) when communicating
               | with others?
               | 
               | It's not an int, but yes, we have a unique synthetic
               | identifier that serves as the database PK and as a means
               | of communicating about a customer in insecure channels
               | without exposing PII. "Customer ID ### is having an issue
               | with such-and-such."
               | 
               | To turn your second part back around: why a natural key?
               | What is the function of minting a natural key if humans
               | are meant to use something else?
        
               | brigandish wrote:
               | > To turn your second part back around: why a natural
               | key? What is the function of minting a natural key if
               | humans are meant to use something else?
               | 
               | Because non-natural keys are unnecessary in the presence
               | of a natural key, and unnecessary things bring in
               | complexity.
               | 
               | > "Customer ID ### is having an issue with such-and-
               | such."
               | 
               | Then you need access to the customer's ID, but the devil
               | here is in the detail you didn't add, the such-and-such.
               | 
               | > communicating about a customer in insecure channels
               | 
               | Use secure channels...
        
               | lolinder wrote:
               | > Use secure channels...
               | 
               | When it comes to PII at my company, secure channels means
               | "encrypted email only". No Slack, no Jira, no chat in
               | video calls.
               | 
               | That's just not feasible for 100% of communications.
        
               | shkkmo wrote:
               | > Because non-natural keys are unnecessary in the
               | presence of a natural key, and unnecessary things bring
               | in complexity.
               | 
               | None of the things you've presented are actually
               | "natural" keys, they are pieces of information that
               | you've made assumptions about to shoehorn them into being
               | usable as a "natural key".
               | 
               | > Use secure channels...
               | 
               | No channel is perfectly secure. As channels become more
               | secure, they become harder to use and add complexity. The
               | more places you store customer data, the more risk you
               | create. The attempt to force semantic data to serve as
               | "natural key" has now added risk and complexity to your
               | entire communication infrastructure.
        
               | shkkmo wrote:
               | > Do you often send the auto-incremented int (that would
               | be the default substitute to this) when communicating
               | with others?
               | 
               | Frequently yes. It is extremely common to communicate
               | about specific records using the ID for that record. The
               | fact that this sort of behavior is extremely common is
               | pretty clearly indicated by the question itself.
               | 
               | > There are so many ways to query a particular user's
               | data, and they would all depend on what you're trying to
               | do, needing the specific key would mean you should have
               | access to it anyway
               | 
               | A responsible organization at scale with limit and log
               | access to customer data. I should be able to determine if
               | two people are talking about the same customer record
               | without needing access to that record's PII.
               | 
               | It is much better to have an artificial key that is
               | linked to this data. There is no upside to the natural
               | key and many, many downsides.
        
             | mewpmewp2 wrote:
             | But now then if you want to expose a detail page for that
             | user the id for identifying that page has to include all
             | this potentially personal information about them?
             | 
             | e.g. instead of mysocialmedia.com/users/2374927
             | 
             | you would be showing
             | 
             | mysocialmedia.com/users/email@example.com-2024-06-05-mysoci
             | almedia.com
             | 
             | Then exposing a lot of information that you may have not
             | wanted to expose.
        
               | sgarland wrote:
               | You don't have to use the PK as the URL slug. Even if you
               | want to route that way, you can have an internal ID and
               | external ID. This is one way to use something random like
               | a UUIDv4 for display without incurring the costs (at
               | least, some of them) of having it as a PK.
        
               | mewpmewp2 wrote:
               | And then if you want to list other entities to that user
               | you will have to start mapping the external id and
               | foreign relationships every time to external users?
               | 
               | And also if you are doing exception logging, for
               | ids/primary keys there's higher odds of them being logged
               | out, including your own logs and also external platforms.
               | 
               | It feels like having primary key set up like this just
               | will complicate everything unnecessarily for the future
               | including many edge cases that you don't foresee.
               | 
               | Just have the main ID not have any meaning.
               | 
               | It shouldn't contain information about the date, it
               | shouldn't be auto increment, it should really be just
               | random.
        
               | sgarland wrote:
               | The solution I outlined is the one GitLab and PlanetScale
               | both use internally, so it has been tested at scale and
               | works well, for both Postgres (the former) and MySQL (the
               | latter).
               | 
               | > It shouldn't contain information about the date, it
               | shouldn't be auto increment, it should really be just
               | random.
               | 
               | That's a great way to tank performance. You want your PK
               | to be k-sortable.
        
               | brigandish wrote:
               | > And then if you want to list other entities to that
               | user you will have to start mapping the external id and
               | foreign relationships every time to external users?
               | 
               | If we're talking about relational database engines,
               | that's what they do, relate things. One join statement is
               | much the same as another.
        
           | arnorhs wrote:
           | Another benefit of having stable identies / surrogate primary
           | keys is that any relations (FKs) will be much simpler.
           | 
           | Sure, like the post poster you replied to is pointing out,
           | you _can_ use natural keys, and then also relying on dates or
           | other parts of the data - but creating a relation for that
           | can end up being extremely cumbersome.
           | 
           | - Indexes generally become larger - relationships become
           | harder to define and maintain - Harder for other developers
           | to get up to speed on a project
        
           | ourmandave wrote:
           | _If I'm going natural PK, I make email the primary key._
           | 
           | Welcome to the Mr. Cooper mortgage provider website.
           | 
           | Your logon is your email and you can't change it.
           | 
           | If you used your cable provider email you're stuck with them
           | for the life of your 30 year mortgage.
        
             | bradleyankrom wrote:
             | Oh, also, we've been breached and your information is
             | available for purchase on the dark web. Fun!
        
           | nottorp wrote:
           | Funny you should say that...
           | 
           | I've been very slooowly degoogling myself, and that includes
           | changing all logins that have a gmail address to a different
           | non google email.
           | 
           | I'd say only like 1/3 of the sites I made logins for have the
           | option of changing the email.
        
           | twobitshifter wrote:
           | ON UPDATE CASCADE is not the nightmare that you are making it
           | out to be. (My impression from the article is that this is a
           | single SQL database being discussed.)
        
             | mewpmewp2 wrote:
             | > My impression from the article is that this is a single
             | SQL database being discussed.
             | 
             | Even if it's initially single, it's bad to assume that it
             | will be so forever and that you are not going to use third
             | party providers in the future.
             | 
             | How well does ON UPDATE CASCADE work if there's millions of
             | existing relations to that entity?
        
               | twobitshifter wrote:
               | YANGNI for 99% of projects and databases. When you get to
               | global sharded nosql etc. you need to use UUIDs for
               | anything and incrementing IDs falls over too.
        
               | mewpmewp2 wrote:
               | I'm using UUIds by default for everything. Main point
               | being that I don't have to worry about future
               | restrictions.
               | 
               | And incrementing IDs are also problematic yes, since they
               | hide business information data within them.
               | 
               | And I do think that I need it for much more than 1% of
               | projects and DBs.
        
               | sgarland wrote:
               | You'll have to worry about performance tanking instead.
               | If you're using UUIDv7 then less so, but it's still (at
               | best) 16 bytes, which is double that of even a BIGINT.
               | 
               | Anyone who says UUIDs aren't a problem hasn't dealt with
               | them at scale (or doesn't know what they're looking at,
               | and just upsizes the hardware).
        
               | chuckadams wrote:
               | Most databases with a UUID type store them as 128-bit
               | integers, typically the same as a BIGINT. It's not like
               | 378562875682765 is the bit representation of a bigint
               | either. And if you're not using uuidv7 or some other kind
               | of cluster-friendly id, you'd best be using a hash index,
               | and if you're doing neither, you probably don't care
               | about their size or performance anyway. You don't pick
               | UUIDs blindly, but on balance, they solve a lot more
               | problems than they cause.
        
               | sgarland wrote:
               | Postgres' UUID type is 16 bytes. MySQL can store them as
               | BINARY(16) once encoded. Conversely, a BIGINT for either
               | is 8 bytes. Not sure about SQL Server or Oracle.
               | 
               | > You don't pick UUIDs blindly, but on balance, they
               | solve a lot more problems than they cause.
               | 
               | IME, this is precisely the problem - devs choose them
               | blindly, because then you don't have to think about
               | proper modeling, you can arbitrarily create a key in your
               | app and be nearly guaranteed of its uniqueness, etc.
        
               | sgarland wrote:
               | > incrementing IDs falls over too
               | 
               | This is a myth. Planetscale [0] uses integers. They are
               | assuredly at scale.
               | 
               | As for auto-incrementing, there are plenty of solutions
               | for distributed systems.
               | 
               | [0]:
               | https://github.com/planetscale/discussion/discussions/366
        
             | sgarland wrote:
             | I was going to say; this is a perfect use case for a
             | cascading FK.
        
             | jval43 wrote:
             | It's never a single database in the real world. As soon as
             | you integrate something or have an API to something the
             | keys are out there. Unless you add a translation layer, but
             | then you could just as well use surrogate keys directly.
        
               | mst wrote:
               | Using a surrogate UUID for communicating with the outside
               | world is often very useful.
               | 
               | This is true for an internal PK that's an auto-inc id as
               | well as for natural keys, though.
               | 
               | Using a natural PK -inside- your own database can still
               | be a lot more pleasant to work with, even if you don't
               | let it escape.
        
               | shkkmo wrote:
               | > Using a natural PK -inside- your own database can still
               | be a lot more pleasant to work with
               | 
               | Until you need to do anything like described above. The
               | advantage of artificial keys is that they have no
               | semantic content. Anything with semantic content carries
               | the risk that the role that semantic content plays in
               | your system can change and cause problems. Having a non-
               | semantic identifier protects you from that.
               | 
               | This is not to say that you should never use a semantic
               | identifier as a key. However, you should always have a
               | non-semantic artificial key as the identifier and use the
               | semantic identifiers only when necessary.
        
             | yashap wrote:
             | Sure, if all data is in a single DB. But in the real world
             | you've generally got some/all of:
             | 
             | - 1 or more data warehouses
             | 
             | - Other services storing said data (e.g. the user id will
             | live in many databases in a service oriented architecture)
             | 
             | - External API integrators who have their own data stores
             | totally out of your control that also have copies of parts
             | of your data
             | 
             | - Job queues. It's common to have jobs scheduled for the
             | future on some other system (Redis, etc.) that say "do X
             | for user with id Y in Z days". If the "id" changes these
             | fail
             | 
             | - Caches. Say I cache data by user email instead of a
             | surrogate key, user changes their email, and another user
             | signs up with the old email. Cache hits for the wrong user!
             | 
             | - etc.
             | 
             | Changing the primary key becomes an absolute nightmare
             | project once data lives in multiple places like this, and
             | in my experience it mostly does.
             | 
             | Having truly stable identity for your entities just solves
             | so many future problems, and it's SO easy to do. In almost
             | all cases, natural PKs are really all downside, virtually
             | zero upside, except slightly less storage.
        
           | bioneuralnet wrote:
           | My first job in the late 2000's was at a small university
           | with a home-grown ERP system originally written in the 80s
           | (Informix-4GL). Student records, employee records,
           | financials, asset tracking - everything. It used natural
           | compound keys.
           | 
           | Even worse than the verbose, repetitive, and error-prone
           | conditions/joins was the few times when something big in the
           | schema changed, requiring a new column be added to the
           | compound key. We'd have to trawl through the codebase and add
           | the new column to every query condition/join that used the
           | compound key. It sucked.
        
         | aswerty wrote:
         | Surrogate keys do mirror reality though. As I once read in a
         | Terry Pratchett book; if you replace the handle of an axe and
         | then replace the head, is it still the same axe?
         | 
         | For me, the answer is yes - since we imbue the axe with an
         | identity outside of it's integral parts.
         | 
         | That is what a surrogate key is. An identity. Which is an
         | abstract concept that exists in the real world.
         | 
         | And to pile on. The top comment is bad advice! Surrogate keys
         | provide sanity - god save you if you have to work in a database
         | solely using natural keys.
        
           | vintermann wrote:
           | Yes. Ultimately, "are these the same" and "are these
           | different" are philosophical questions. Or to be more
           | precise, _teleological_ questions.
           | 
           | Because those questions have no meaning except with an "for
           | our purposes here" added. And it's up to us to decide what we
           | care about, if anything.
           | 
           | If we care about what other people want with the data though,
           | or suspect that our own wants might not be set in stone, then
           | we should also care to model identity independently in the
           | system (that is, use surrogate keys).
        
           | fuzztester wrote:
           | >As I once read in a Terry Pratchett book; if you replace the
           | handle of an axe and then replace the head, is it still the
           | same axe?
           | 
           | Yes and no. It is the Axe of Theseus ;)
           | 
           | https://en.m.wikipedia.org/wiki/Ship_of_Theseus
        
             | aswerty wrote:
             | Thanks - I think somewhere in the back of my mind I was
             | also aware of the Ship. But Pratchett is too good. The
             | quote:
             | 
             | > This, milord, is my family's axe. We have owned it for
             | almost nine hundred years, see. Of course, sometimes it
             | needed a new blade. And sometimes it has required a new
             | handle, new designs on the metalwork, a little refreshing
             | of the ornamentation . . . but is this not the... axe of my
             | family?
        
             | mgkimsal wrote:
             | Or Trigger's Broom:
             | 
             | https://www.youtube.com/watch?v=BUl6PooveJE
        
           | akira2501 wrote:
           | > since we imbue the axe with an identity outside of it's
           | integral parts.
           | 
           | Typically for the purposes of ownership. So it's really part
           | of a a hierarchical identity scheme.
        
         | groestl wrote:
         | Adding to the list of comments damning this post to ensure none
         | of my future colleagues follow this advice.
         | 
         | > You think your surrogate key will save you? It will not.
         | 
         | It definitely will, say my 20+ years of experience.
        
         | globular-toast wrote:
         | How do you deal with the Ship of Theseus/Trigger's broom?
         | There's literally nothing that defines said object apart from
         | its _history_.
        
           | zer00eyz wrote:
           | Your database starts to look like git...
           | 
           | Think about a jet engine. Let's say you figure out that a
           | part is defective and will cause a failure. You want to
           | identify every plane that has one of those parts in it. What
           | if you find that some of them had a bad oil pump that
           | shortened the life of some bearings. You want to know every
           | engine that had one of those pumps, so you can replace other
           | parts.
           | 
           | I dont know if they do this with jets but there are quite a
           | few places that take thee concept much further than this.
        
             | smaudet wrote:
             | > Your database starts to look like git...
             | 
             | No, your table archives start to look like git. Which is
             | not a bad thing, version control on data in a database is
             | very difficult, if not impractical to realize, I would take
             | any history that I can get...
             | 
             | "Synthetic" keys are just that, keys for tables that only
             | have meaning within the context of that data. You can of
             | course relate your records to as many other tables as you
             | wish, but you identify records via those keys, nothing
             | more.
        
         | RedShift1 wrote:
         | Every time I used a natural key I have to come to regret it.
        
         | th3byrdm4n wrote:
         | When I integrate systems, I use that system's natural key (love
         | it when it's a unique ID, but in the systems I work in - it
         | almost never is).
         | 
         | That said, I use that natural key as the "link" to my
         | internally managed, normalized database.
         | 
         | There's nothing that says I cannot add unique identifiers that
         | would replicate the natural key. In fact, that's good design.
        
         | hyperman1 wrote:
         | In my experience, this won't end well. Some examples:
         | 
         | Belgium has the RNR/INSZ identifying each person. But it can
         | change for a lot of reasons: It gets reused if someone dies. It
         | encodes birth date, sex, asylum state, so if something changes
         | (which happens about every day), you need to adapt your unique
         | key.
         | 
         | Belgium also has a number identifying medical organizations.
         | Until they ran out of numbers. Then someone decided to change
         | the checksum algorithm, so the same number with a different
         | checksum meant a different organizations. And of course they
         | encode things in the number and reuse them, so the number isn't
         | stable.
         | 
         | An internal COBOL system had a number as unique key, and this
         | being COBOL, they also ran out of numbers. This being COBOL, it
         | was more easy to put characters in the fixed width record than
         | expand it. And this being French COBOL, that character can have
         | accents. So everyone using the 'number' as unique key now had
         | to change their datatype to text and add unicode normalization.
         | Not fun.
         | 
         | In my experience: don't use anything as an ID that you didn't
         | generate yourself. Make it an integer or UUID. Do not put any
         | meaning in the ID. Then add a table (internal ID, registering
         | entity, start date, end date or null, their key as text). You
         | 'll still sometimes have duplicates and external ID updates as
         | the real world is messy, but at least you have a chance to fix
         | them. The overhead of that 1 lookup is negligable on any scale.
        
         | zer00eyz wrote:
         | > Model your data on the real world. Do not depend on spherical
         | horses.
         | 
         | Yes, and normalize it.
         | https://en.wikipedia.org/wiki/Database_normalization
         | 
         | > Adding a synthetic key only means you have to track all
         | three. Plus you have to generate a meaningless number, which is
         | actually a choke point in your data throughput.
         | 
         | This is true up to a point. You can add more data to the system
         | to continue to generate natural, composite keys. However at
         | some point you move from a database to an event stream, or you
         | have to track events that aren't really needed for what your
         | doing...
         | 
         | Denormalization then takes precedence and a generated key makes
         | sense again. https://en.wikipedia.org/wiki/Denormalization
         | 
         | > how will the surrogate key protect you?
         | 
         | It isnt about protection, it's about not collecting the natural
         | data to identify the event that caused the issue. Its
         | denormalization by omission in effect.
        
         | somerandomqaguy wrote:
         | >The natural key forces you to think about what makes the row
         | unique. What identifies it. Sometimes, it makes you go back to
         | the SME and ask them what they mean. Sometimes it makes you
         | reconsider time: it's unique now, but does it change over time,
         | and does the database need to capture that? In short, what are
         | the boundaries of the Closed World Assumption? You need to know
         | that too, to answer any "not exists" question.
         | 
         | Not really because your natural ID has to also account for the
         | problem of garbage data and account for the SME's not actually
         | being experts. And I can give a real world example of this
         | happening; the Canadian Long Gun registry.
         | 
         | For anyone that doesn't know, prior to the early mid 90's or so
         | Canadian gun laws only required registration of pistols. I
         | might be incorrectly remembering but IIRC the records were not
         | handled at a national level either but I could be wrong. Around
         | that time new laws were introduced that among other things
         | required registration.
         | 
         | Most guns by then had serial numbers so all you had to do was
         | tie a gun to a serial number with some characteristics and
         | voila, you've got a natural identifier, right? That's what the
         | experts say.
         | 
         | Well as it turns out, reality isn't quite so kind. A lot of
         | firearms makers such Cooey from around the 1940's to the 1970's
         | produced guns without a serial number. In other cases the
         | serial number was present from the factory but were damaged, or
         | the part that had thee number had been replaced without
         | replacing the number or had been replaced with a wrong number.
         | In rare cases the serial number from the factory was wrong
         | because of a mistake when the worker manually stamped in the
         | number.
         | 
         | So already the idea to uniquely identify using some sort of
         | simple classifier was already flawed. They attempted to solve
         | the issue of guns without serial numbers but those stickers
         | were cheaply made and readily fell off, and owners that were
         | already peeved about the program to not bother with trying to
         | paperwork correct.
         | 
         | Which segways to the next problem. There was an extremely high
         | rate of errors in registration forms being submitted. The most
         | famous example I'm aware of was someone registering a Black and
         | Decker soldering gun as a firearm, something he had done in
         | protest. As humorous as it was, it the revelation that a
         | soldering gun had been classified as a firearm unveiled another
         | fundamental problem.
         | 
         | The error rate was so high, and the pressure to show progress
         | so great, that the someone in leadership (I can't remember if
         | it was the government or RCMP) directed the data entry clerks
         | to just plug the data in with no validation as is. Didn't
         | matter if the data was wrong, or made no sense, or contradicted
         | other entries already in the database. The intent being to just
         | get all the data in as is so that they could fix it later. So
         | all that wrong information? Got pushed straight into the
         | database.
         | 
         | Like I said, this was real world mess that occurred from 1995
         | until 2012 when a new government dropped the requirement for
         | non restricted firearms to be registered with little fanfare
         | and only squeaking protests.
         | 
         | It's not to say that you shouldn't think about a 'natural key'
         | persay. But the problem assuming that there is a 'natural key'
         | requires that you or your subject matter expert is actually an
         | expert that can identify a good enough model for that to exist.
         | 
         | But what happens when your SME is just plain wrong and it in
         | turn introduces fundamental flaws in your model? Or outside
         | influences forces garbage data in? How is a database designed
         | to model only the real world supposed to cope with that?
        
         | ratherbefuddled wrote:
         | > how will the surrogate key protect you? It will not.
         | 
         | Yes it will. Your changes will be confined to only the table(s)
         | where the natural key is present, not spread across every table
         | where there's a foreign key.
         | 
         | Of course you will still have to deal with the reality that the
         | natural key is now not unique, and model reality, but your
         | implementation work in doing so is far simpler.
         | 
         | In more years than I care to count I've regretted someone using
         | natural keys as a primary key for a table many times, and
         | surrogates never.
        
         | wruza wrote:
         | In short, you advise us to foresee the future, explore unknown
         | unknowns and expect high-precision true answers from the
         | outside. Good advice, not for this universe. You can only get
         | false negatives in this one.
         | 
         | A synthetic key means "we think exists". There exists a
         | contract, a medical record, a person, in a real world, in our
         | opinion. We record these existence identities into our model by
         | using an always-unique key. Then there's a date, an appointment
         | #, a name, etc. You can refer to an entity by its identity, or
         | search by its attributes. If you use searches in place of
         | identity references, you get non-singletons eventually and your
         | singleton-based logic breaks.
        
           | datadrivenangel wrote:
           | This is why it's hard to be a DBA... Everyone thinks you're a
           | Cassandra user the way other developers ignore your
           | prophecies.
        
         | DrScientist wrote:
         | > The natural key forces you to think about what makes the row
         | unique. What identifies it.
         | 
         | When designing a table - you should always be clear about what
         | the natural key is and make sure it has uniqueness constraint.
         | Mindlessly having a surrogate key without thinking about what
         | the natural key is, is an anti-pattern. So totally agree here.
         | 
         | That doesn't stop you also having a surrogate key though.
         | 
         | Another aspect of natural versus surrogate keys is joins as the
         | key often ends up in both tables.
         | 
         | Using natural keys can mean in some circumstances you can avoid
         | a join to get the information you want - as it's replicated
         | between tables.
         | 
         | There is also the question of whether you surface the surrogate
         | key in the application layer or not - some of the problems of
         | surrogate keys can be avoided by keeping them contained within
         | the app server logic and not part of the UI.
         | 
         | So via the UI - you'll search by car registration number, not
         | surrogate key, but in terms of your database schema you don't
         | join the tables on registration number - you use a surrogate
         | key to make registration numbers easier to update.
        
         | thefz wrote:
         | Been vehemently against surrogates my whole life, glad to find
         | a kindred soul.
        
         | imtringued wrote:
         | >Plus you have to generate a meaningless number, which is
         | actually a choke point in your data throughput.
         | 
         | No it isn't. Working with natural keys in general involves
         | using compound primary keys since it is unlikely that any lone
         | field is suitable as primary key. Comparing an integer is
         | quick. Comparing three string fields in a join is not.
        
         | abujazar wrote:
         | Natural keys can change, and synthetic keys never have to. That
         | alone is reason enough to use synthetic keys.
         | 
         | Performance is another major argument for synthetic keys, as
         | they are can be made sequential, which is rarely the case for
         | natural keys.
        
         | IlliOnato wrote:
         | Sorry, by your logic an ISSN would be a good key for a database
         | of scientific journals. It's exactly what ISSN is invented for!
         | Right? Right?
         | 
         | Been there, done that. Journals that changed their names (and
         | identities) but not ISSN. That changed the ISSN but not the
         | name/identity. Journal mergers which instead of obtaining a new
         | ISSN kept one of the old ones. "Predatory journals" that
         | "borrow" an ISSN (you may not consider them real journals, but
         | you've got to track them anyway, even if only to keep them from
         | being added to the "main" database). The list may go on and on.
         | 
         | And don't even start me on using even more natural ID, the
         | journal name, perhaps in combination with some other pieces of
         | data, like year the publication started, country of origin,
         | language, etc... Any scheme based on this will need to have
         | caveats after caveats.
         | 
         | (A fun fact: there were journals that ceased publication but
         | later on "returned from the dead". Such resurrected journals
         | are supposed to be new journals and to get a new ISSN.
         | Sometimes this rule is followed...)
         | 
         | At the end, a "meaningless number" you assign yourself is the
         | only ID that reliably works (in combination with fields
         | representing relationships between journals).
         | 
         | The problem with keys that "have meaning" is that they appear
         | to carry information about your entity. And in vast majority of
         | cases this is correct information! So it's almost impossible to
         | resist "extracting" this information from a key without doing
         | actual database lookup at least mentally, and often in your
         | software too. Hidden assumptions like this lead to bugs that
         | are really hard to eliminate. A meaningless number on the other
         | hand does not tempt one :-)
        
         | sgarland wrote:
         | Your post brings up a critical difference I've noticed when
         | working with devs (I'm a DBRE): those who actually do rigorous
         | data modeling, and those who view it as an annoyance impeding
         | their project.
         | 
         | Spend time modeling your schema. Ask yourself, "does every
         | attribute in this table directly relate to the primary key? And
         | is every attribute reliant upon the primary key?" Those two
         | alone will get you most of the way through normalization.
        
         | indoordin0saur wrote:
         | I agree with both the original author's thoughts and yours.
         | Surrogate keys are generally necessary but also overrated.
         | Natural keys will always remain important. I think it's
         | important to keep natural keys as reliable as possible whenever
         | you control the system. Creating a new surrogate key every time
         | the data moves creates it's own headache as you have to keep
         | track of more and more keys referring to the same concept.
        
       | mythas wrote:
       | Another example that happens surprisingly often in healthcare. A
       | registration clerk will incorrectly enter a personal health
       | number (PHN) into the system. Then the actuall holder of that PHN
       | shows up. If this were the PK then the system just wouldn't
       | handle this case and the reg clerk would have a huge mess to sort
       | out on the spot. A surrogate key on the Person table allows this
       | registration to be made where 2 people have the same PHN in the
       | system. Then cleanup can be handled after the fact to track down
       | the first person, determine their correct PHN and update the
       | record.
        
         | jen20 wrote:
         | This sounds like it should probably be a workflow instead.
         | Modelling the intent here is actually important.
        
           | mythas wrote:
           | Care to expand on this thought? Curious what you have in
           | mind!
        
             | ttyprintk wrote:
             | (not parent)
             | 
             | Think about these different intentions, each deserving an
             | audit trail with different attached metadata:
             | 
             | - brand new MRN for a newborn. The system should be able to
             | provide info on the mother, at all points of care for the
             | newborn.
             | 
             | - unknown person arrives in the emergency department
             | unconscious. You'll see imaging named John Doe in this
             | case. The system should be able to retrieve info if and
             | when identity is established.
             | 
             | - the machine-readable bracelet given at admission needs to
             | be replaced. This is really two different cases: you intend
             | to create a duplicate bracelet, or to correct an error in
             | which all current info in the system needs to abruptly
             | update.
        
         | ethbr1 wrote:
         | One of the most surprising things I learned about the US
         | healthcare system is how often everything is mutating or being
         | retroactively updated (assume it applies to other countries
         | too).
         | 
         | Things you'd think would be constant after Step 1 often aren't,
         | and processes are tolerant of their being corrected / re-
         | entered after Step 52 has already been completed.
        
         | romdev wrote:
         | Finding and eliminating duplicates is a very common software
         | problem that is rarely solved in a reusable, user-friendly way
         | that preserves history while eliminating redundant data. In
         | fact, in 40 years of working with computers I can't think of a
         | single UI that I'd want to emulate.
        
           | Terr_ wrote:
           | Ultimately the key is the business-modeling that captures
           | "duplicates" as Things That Happen.
           | 
           | That's the precondition for any sane UI, and sometimes it's
           | not even obvious because the "duplication" has been
           | transformed, reified into its own concept.
        
       | Kon-Peki wrote:
       | > how about a personal identification number? In Denmark we have
       | the CPR number, and I understand that the US Social Security
       | Number is vaguely analogous.
       | 
       | The US SSN is not guaranteed to be unique, the SSN assigned to a
       | person could change, there is no guarantee that a person with an
       | SSN assigned to them is a US citizen, and there is no guarantee
       | that a US citizen has an SSN - they must be requested, and you
       | don't need one unless you do something that requires having one.
       | There are also things called ITINs and ATINs that look like an
       | SSN but are not, yet can be used in place of an SSN in a huge
       | range of SSN-required situations!
       | 
       | (Please don't use the SSN as a database key!)
        
         | selcuka wrote:
         | > The US SSN is not guaranteed to be unique
         | 
         | The cases you listed do _not_ mean SSNs are not unique, unless
         | there are people who share the same SSN. You can still define a
         | unique index for the SSN column. A column can be both nullable
         | and unique as each null is different in SQL.
        
           | ars wrote:
           | > unless there are people who share the same SSN
           | 
           | That happens all the time. First of all people steal SSN's
           | and use them (and you are not the police, so it's not your
           | responsibility to do anything about that). Second people make
           | up fake SSN's because they don't want to give you their SSN.
           | 
           | People also make typo's, and you can end up with the same
           | SSN.
           | 
           | An SSN is not unique in the real world.
        
           | Kon-Peki wrote:
           | EDIT - SSA claims that numbers are not recycled. But there
           | are known cases where the same number has been assigned to
           | multiple people.
           | 
           | Note that in less than 100 years, more than half of all
           | possible SSNs have already been used...
        
             | bloopernova wrote:
             | So we need SSNv6? ;)
        
           | dqv wrote:
           | > It's not as uncommon as you might think. In fact, some 40
           | million SSNs are associated with multiple people, according
           | to a 2010 study by ID Analytics.
           | 
           | https://www.pcworld.com/article/424392/a-tale-of-two-
           | women-s...
        
             | pseudosavant wrote:
             | Funny enough, I used to work at IDA.
             | 
             | I'll add that there is a huge difference between the SSN
             | database that the Social Security Administration maintains,
             | and the list of SSNs that have been associated with a
             | person. Especially because it is very common to change a
             | single digit of your SSN when performing credit fraud -
             | because they've already burned their real one. Some people
             | will have dozens of SSNs attached to them.
             | 
             | IDA was very good at determining who a person is through
             | the graphs that represent identities in our world (names,
             | DOBs, phone numbers, addresses, SSNs, etc.)
        
           | cmiles74 wrote:
           | People also type them in wrong, so there's another thing:
           | where is the data coming from.
        
             | Tangurena2 wrote:
             | > _The most misused SSN of all time was (078-05-1120). In
             | 1938, wallet manufacturer the E. H. Ferree company in
             | Lockport, New York decided to promote its product by
             | showing how a Social Security card would fit into its
             | wallets. A sample card, used for display purposes, was
             | inserted in each wallet. Company Vice President and
             | Treasurer Douglas Patterson thought it would be a clever
             | idea to use the actual SSN of his secretary, Mrs. Hilda
             | Schrader Whitcher._
             | 
             | > _The wallet was sold by Woolworth stores and other
             | department stores all over the country. Even though the
             | card was only half the size of a real card, was printed all
             | in red, and had the word "specimen" written across the
             | face, many purchasers of the wallet adopted the SSN as
             | their own. In the peak year of 1943, 5,755 people were
             | using Hilda's number. SSA acted to eliminate the problem by
             | voiding the number and publicizing that it was incorrect to
             | use it. (Mrs. Whitcher was given a new number.) However,
             | the number continued to be used for many years. In all,
             | over 40,000 people reported this as their SSN. As late as
             | 1977, 12 people were found to still be using the SSN
             | "issued by Woolworth."_
             | 
             | https://www.ssa.gov/history/ssn/misused.html
             | 
             | As late as the 1970s, the first 3 digits of your SSN told
             | what office issued your number to you. And the next 2
             | digits told workers at that office what filing cabinet held
             | your application.
             | 
             | https://www.ssa.gov/policy/docs/ssb/v45n11/v45n11p29.pdf
        
         | otherme123 wrote:
         | In Spain we have the DNI number, that a lot of people asume is
         | unique, even database designers that use is as a natural key.
         | 
         | Turns out the DNI can have, and actually have, a lot of
         | duplicates. The police has a page explaining it
         | (https://citapreviadnipasaporte.es/dni/dni-duplicados-espana/),
         | and how it's not a primary key in their databases, but a number
         | entered manually from a pool of possible numbers. And number
         | re-using is a possibility. They estimate the number of
         | duplicates in 200,000 for a population of 50,000,000.
         | 
         | The point is that if you asume DNIs are unique and use them as
         | PK your database is exposed to the bad design of the DNI
         | database. There are some stores that use the DNI as the
         | "unique" identifier for fidelity cards.
        
           | prmoustache wrote:
           | I've seen banks or insurers use DNI as user login.
        
             | benhurmarcel wrote:
             | All of them do I think. But from experience, they can
             | change it for your account.
        
             | SoftTalker wrote:
             | The user login (hopefully) is not the internal primary key
             | for the user. It should be unique at a given point in time,
             | obviously, but certainly there are reasons it might need to
             | change.
        
             | Piskvorrr wrote:
             | And I've been bitten by this design numerous times. The
             | most common bug: literally the birthday paradox, i.e.
             | "these two people have been assigned the same number, and
             | now we need to distinguish them in our database".
        
           | benhurmarcel wrote:
           | Also your DNI can change. Typically foreigners get a NIE
           | (used for the same thing but a different format), and get a
           | new DNI if they ever get Spanish nationality.
        
           | steinsgatezero wrote:
           | I remember having to deal with this and other identifiers
           | like NIF and NIE while working on the academic titles
           | homologation platform for the MCIN. I didn't understand why
           | it wasn't used as, not necessarily PK, but an identifier
           | logging in. Thanks for letting me know that the hellish time
           | spent integrating Cl@ve wasn't in vain.
        
         | delusional wrote:
         | We have the same problem in Denmark, most people just don't
         | realize it. At my dayjob we get at least one person every year
         | who changes gender and consequently gets a new SSN (the final
         | digit is supposed to signify gender). Most people don't store
         | SSNs so they never realize, but it does happen fairly
         | frequently.
        
           | winternewt wrote:
           | If Denmark is anything like Sweden there's also:
           | 
           | - SSN from different ID space gets assigned to immigrants;
           | when they become citizens they are assigned a new permanent
           | SSN.
           | 
           | - SSN:s have a long and a short form; the short form which
           | cuts off century information can be the same for someone who
           | is 5 years old and someone who is 105 years old.
           | 
           | - When an unconscious patient comes in to the E.R. you don't
           | know their SSN, so a temporary one is assigned for use in
           | patient records. Such temporary SSN:s are not coordinated
           | nation-wide so multiple patients may have the same SSN. In
           | some hospitals they don't even have a local standard for
           | ID:s. The staff just makes something up on the spot. It
           | happens that the SSN they made up collides with a valid SSN
           | for another person.
        
             | kassner wrote:
             | Also: the personnummer carries a date that often means
             | birth date, but there are cases where it's not, but I've
             | seen a few system that just assumes it's the same.
             | 
             | > SSN from different ID space gets assigned to immigrants;
             | when they become citizens they are assigned a new permanent
             | SSN
             | 
             | Having gone through that, my personummer didn't change.
             | Maybe that doesn't happen anymore?
        
               | tapland wrote:
               | It does, but if you already have a Personnummer or get a
               | residence permit right away so that you are eligible for
               | one you don't get the temporary Samordningsnummer.
        
               | kalleboo wrote:
               | And the samordningsnummer isn't only for immigrants -
               | it's also for Swedes born abroad who have never been
               | folkbokford.
        
               | tapland wrote:
               | They're still still an immigrant. The numbers are for
               | residents (at some point in time) and citizenship isn't
               | reflected by it.
        
               | kalleboo wrote:
               | Interesting, I've never heard of anyone referring to a
               | citizen as an immigrant, but looking up the legal
               | definition, they certainly are. TIL.
        
             | mrweasel wrote:
             | Immigrants and refugees will get a "replacement CPR
             | number", which is outside the normal space/range of CPR
             | numbers. Once registered as living in Denmark, they'll get
             | a real CPR number.
             | 
             | Danish SSNs doesn't have a long or short form, you used the
             | seventh digit to do a table lookup to see if the person is
             | born in 18XY, 19XY or 20XY. The date of birth is always
             | ddmmyy, there is no long form. So if the seventh digit is
             | 9, and the 5-6 digit is between 00 and 36, then you're born
             | in between 2000 and 2036, if the 5-6 digits are 37 to 99,
             | then your born in the 1900. But you need the published
             | table to figure that out.
             | 
             | Last point, there is backup system for unconscious
             | patients, but it should be the same across all medical
             | records as these are somewhat standardized.
        
             | delusional wrote:
             | > - SSN:s have a long and a short form; the short form
             | which cuts off century information can be the same for
             | someone who is 5 years old and someone who is 105 years
             | old.
             | 
             | We don't do that. instead, we shove that extra bit of
             | information into the digit following the last two digits
             | through a table: https://da.wikipedia.org/wiki/CPR-
             | nummer#Under_eller_over_10...
        
           | rsynnott wrote:
           | In Ireland, until the 90s, if a woman got married, she gave
           | up her PPS number (essentially a social security number) and
           | took her husband's, with a 'W' appended (this fact tells you
           | a lot about pre-90s Ireland...) If she subsequently divorced
           | or the husband died, she got a _new_ PPS number.
           | 
           | While this was abolished in the 90s, _some people still have
           | these 'W' numbers_, ticking timebombs for anyone relying on
           | them as a key.
        
         | mikedelfino wrote:
         | How to uniquely identify an American citizen?
        
           | hprotagonist wrote:
           | ha ha only serious; ask them for their papers and if they say
           | "who the fuck are you?", they're americans!
           | 
           | a slightly more serious answer: you largely don't. The US
           | doesn't have a national ID, proof of birth is not even close
           | to standardized, etc.
        
           | Longhanks wrote:
           | How about not doing that?
        
             | mikedelfino wrote:
             | Then I guess it could lead to duplicate records. Now,
             | whether this is a problem or not depends on the business
             | and how much one cares about data integrity. I work with
             | higher education, so uniquely identifying students and
             | keeping all their records organized is somewhat important.
             | Granted, I don't work in the US, and here we have a unique,
             | national number. So this is covered, except for foreign
             | students. I was curious how this problem is solved in other
             | countries.
        
               | Kon-Peki wrote:
               | The US government doesn't assume that it can uniquely
               | identify a citizen with 100% certainty. When things get
               | tricky, we rely on the judicial system to weigh the
               | evidence and make a decision. Which could later get
               | changed.
               | 
               | If you want to design your system to assume that you can
               | do it, that's your problem. Literally, it is now _your_
               | problem, and nobody is going to step in to help.
        
           | j16sdiz wrote:
           | You can't.
           | 
           | That's why everybody use SSN.
        
           | vharuck wrote:
           | The population health database I work with uses generated IDs
           | for people and relies on health systems to link records
           | before submission (they put a lot of effort into data used
           | for billing). But we do check for problems by looking at
           | duplicates of SSN with date of birth. We don't consider
           | names, because they could be transcribed differently. Even
           | though the SSN+DOB pair should be unique, there have been
           | cases where a widow provides her deceased husband's SSN.
           | Likely because she used it for Social Security benefits and
           | forgot her own long ago.
        
           | polygotdomain wrote:
           | To give a slightly more technical answer, at a large
           | insurance company I used to work for, the legal department
           | had provided definitions on what conditions we would consider
           | various components of PII a match. So between SSN, DoB, First
           | Name, Last Name, and a couple others, there were potential
           | combinations that our system would say, "yes this is the same
           | person". Note that we didn't necessarily need exact matches
           | on things like names, but "close enough" matches were
           | sometimes sufficient.
        
           | gwbas1c wrote:
           | You can get kind-of close with driver's license numbers. They
           | will change when US citizens move among states, (and we do
           | that quite frequently.)
           | 
           | (People who don't drive have to get state-issued IDs that
           | have the same number. At least, they have to if they want to
           | buy do anything that requires proof of identification.)
        
           | Tangurena2 wrote:
           | You cannot.
           | 
           | I used to work for one state's Department of Motor Vehicles.
           | 
           | Part of why REAL ID is such a charley foxtrot is because
           | there exists no such identifier and since every state counts
           | approximately as a sort-of-almost sovereign nation (this is
           | basically what the 10th Amendment guarantees), there is and
           | can be no possible nationwide citizen identifier.
           | 
           | There are about 3500 different counties in the US. All of
           | which may or may not issue their own birth certificates.
           | States are supposed to do that now.
        
         | pradn wrote:
         | Pernicious assumptions!
         | 
         | Google Cloud projects have three attributes: user-friendly
         | names, system numbers, and system names. System names are
         | alphanumeric. They can be chosen by the user, derived from the
         | friendly name if there's no collision.
         | 
         | But! There's some system names from the olden days that are
         | actually all numbers - so not actually alpha-and-numeric.
         | Thankfully we don't run into those often.
        
         | ajuc wrote:
         | Yeah, these things are almost never as simple as they are
         | supposed to be.
         | 
         | Poland has PESEL numbers since 70s. It was supposed to be
         | unique, only apply to Polish citizens, never change, and have a
         | checksum digit. Every Polish citizen gets one at 18 when they
         | get their national ID document, and you can request it earlier
         | if you want to.
         | 
         | Turns out there are duplicated PESEL numbers. A LOT of non-
         | Polish citizens have them assigned (mostly Ukrainian refuges
         | but not only). The checksums are sometimes wrong. And some
         | people have several PESEL numbers.
         | 
         | If you used PESEL as database key you're fucked.
         | 
         | The system works perfectly, but it interfaces with external
         | world through computer-human-paper-human-computer interface.
         | And at some point the mistake propagates so far that it becomes
         | the truth assumptions be damned.
        
       | rhelz wrote:
       | Natural keys are (quite literally) essential to defining
       | entities. Quine had a great slogan for this: "No entity without
       | identity!" Natural keys are how you determine the identity of an
       | object, which is to say, if you have two different referring
       | expressions, how can you tell whether they are referring to _the
       | same_ object, or different objects?
       | 
       | Suppose you take the advice of this article, and use, say, social
       | security numbers to identify people. (Let's ignore the fact that
       | this only works for the U.S.) Suppose one fine day, somebody
       | tries to enter in a record about a new person--but some typo has
       | happened somewhere, and the new person's SSN clashes with
       | somebody's SSN who is already in the system.
       | 
       | Sure, the database will notify you that something went wrong. But
       | how do you know which social security number is correct, and
       | which is incorrect? You will have to find a set of fields which
       | uniquely identifies each person _ANYWAYS_. I.e. you 'll have to
       | differentiate them by name, birthday, place of birth, etc etc.
       | 
       | Now even worse!!! What if somebody attempts to add the same
       | person TWICE to your database, but mistypes their social security
       | number. Now your database can't even tell you that something has
       | gone wrong. It will happily record duplicate or contradictory
       | information about the same person--and in order to resolve the
       | mess, again, you have to find out what uniquely identifies the
       | people _ANYWAYS_.
       | 
       | Now, even worser than worse--what if you have taken the advice of
       | this article, and you _haven 't_ bothered to identify a set of
       | fields which are genuinely unique to each person. You just have a
       | social security number and a name. How are you going to even
       | going to correct the fact that John Smith is in your system
       | twice, when you have ten John Smiths? How can you possibly tell
       | which two John Smiths are the same John Smith?
       | 
       | Yeah, it take some time and careful thinking to properly come up
       | with natural keys for your entities. But unless you do, you
       | haven't actually specified your entities at all. This is one area
       | where long years of experience really pay off. Expert data
       | modelers spend years and decades honing their craft, observing
       | the work of others, etc etc. Eventually they acquire the wisdom
       | needed to be able to know what kind of information is really
       | needed to nail down what kind of entity the database needs to
       | know about.
        
         | yuriks wrote:
         | > Suppose you take the advice of this article, and use, say,
         | social security numbers to identify people.
         | 
         | You seen to have misunderstood the point of the article: the
         | author is recommending NOT using the SSN (a natural key) for
         | primary keys, and instead to use an artificial, automatically
         | generated key, so that the SSN is decoupled from the record and
         | can potentially be updated.
        
           | rhelz wrote:
           | But SSN's _are_ artificial, automatically generated keys.
           | They are not natural keys, they cannot be natural keys for
           | persons, because not every person has one.
           | 
           | A natural key is a set of attributes which an entity has even
           | if it is _not_ in your database.
        
       | iblaine wrote:
       | It's funny that indexing used to be unavailable when processing
       | big data on Hadoop. Now that we have options to do so we now
       | care. As tech evolves we adapt.
        
       | wavemode wrote:
       | Another massive annoyance with natural keys - privacy. If your
       | table's primary key contains personal information, that PII now
       | infects every other table that holds a foreign key to that table.
        
         | hehdhdjehehegwv wrote:
         | Extremely good point.
        
         | whyever wrote:
         | AFAIK, even using a UUID is still considered PII if it uniquely
         | identifies the user.
        
         | gregw2 wrote:
         | Thank you for your comment. It spurred me to think about this
         | issue more thoroughly in a way I hadn't, even though this
         | comment may disagree -- to some extent -- with your
         | perspective.
         | 
         | I have not seen clear guidelines about whether an
         | organization's surrogate keys for persons are considered PII.
         | (And this ambiguity has frustrated me for some time as I am
         | unclear whether to take an aggressive or conservative view on
         | labelling PII where I work.) When I have read the guidelines,
         | it seems ambiguous but on balance I think it disagrees with
         | your implied claim (that PII is not present in a table that
         | uses a surrogate foreign key to a person/user.)
         | 
         | The definition of PII per NIST
         | https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpubli...
         | is:
         | 
         | "Any information about an individual maintained by an agency,
         | including (1) any information that can be used to distinguish
         | or trace an individual's identity, such as name, social
         | security number, date and place of birth, mother's maiden name,
         | or biometric records; and (2) any other information that is
         | linked or linkable to an individual, such as medical,
         | educational, financial, and employment information."
         | 
         | A surrogate key associated with a person is arguably "any other
         | information that is linked or linkable to an individual"
         | meaning that all tables containing the surrogate key remain PII
         | and remain "infected". It's true the surrogate key only allows
         | linkability within the context of the data ecosystem in which
         | it resides, but such distinctions (of "internal" to the system
         | vs "external from the system") are not made in the language of
         | the definition. Additionally, from a pure risk and PII
         | disclosure impact, usually the whole database gets dumped, not
         | just the "non-person" tables. If you have financial/medical
         | transactions in table "A" and a personal numeric ID linking
         | data to a person table "B", both tables contain PII, right?
         | 
         | From a privacy standpoint, if you can SQL JOIN the data to
         | trace the person involved either within your system or even
         | data reasonably obtainable outside your system, (or if an
         | attacker can), it's PII.
         | 
         | Intranet IP addresses are called "linked PII" in section 3.2.2
         | of the above NIST guidelines for example, and NIST does define
         | some related terms that would seem to apply to surrogate keys
         | like: * Distinguishable Information: Information that can be
         | used to identify an individual. * Linkable Information:
         | Information about or related to an individual for which there
         | is a possibility of logical association with other information
         | about the individual. * Linked Information: Information about
         | or related to an individual that is logically associated with
         | other information about the individual.
         | 
         | As a data engineer, the above interpretation means PII is in a
         | zillion tables and labeling a table with a boolean indicator
         | yes/no isn't that helpful. But as a policy person, NIST seems
         | to be recommending gauging PII more at the system (not table)
         | level and with a PII Confidentiality Impact Level of
         | low/medium/high that takes into account the context and overall
         | risk and that seems sensible. From a data cataloging standpoint
         | (gauging what's "infected" to use your term), I think it's
         | probably helpful to identify particular transactional tables or
         | personal table as having "high" PII disclosure impact vs "low";
         | the presence of "infection" from a virus ("PII") is mostly
         | irrelevant in a sufficiently large system where viruses/some
         | PII is inevitable but what matters is the severity/impact. A
         | zillion tables will be "low" (e.g. if most tables have audit
         | column saying which user last changed a particular record) but
         | certain transactional or user tables may be "high" and should
         | be recognized as such and the focus of any risk discussions
         | with the business or legal or breach notifications to customers
         | or what have you.
         | 
         | Going back to your original point, I don't think the choice of
         | natural vs surrogate key impacts the PII risk of the system or
         | even its individual tables. I would slightly concede that a
         | surrogate key (which in general I am in favor of) would make it
         | easier to reduce a particular individual's PII from a system by
         | concentrating it in one or a small number of tables with names,
         | etc. which might be helpful for enabling GDPR right to be
         | forgotten or something. But the degree of that PII elimination
         | from a system by blanking out or archiving a particular
         | user/person record is not necessarily reducing the PII for them
         | to 0 at least definitionally unless the transactional records
         | themselves are also removed as the AOL 2006 search data scandal
         | demonstrated (where a woman identified solely by a surrogate
         | key was able to be identified from her search term transactions
         | alone.) (Legally there would appear to be carveouts around
         | transactional deletion for some financial transaction records
         | and backups, but IANAL...)
        
           | wavemode wrote:
           | Your perspective is very thorough and interesting, but I can
           | simplify the matter a lot - yes, from my personal experience
           | it absolutely does matter for privacy regulation whether you
           | use natural or surrogate keys.
           | 
           | I've worked on systems in the biotech space where certain
           | tables and certain columns in our dataset were considered PHI
           | (personal health information), and others weren't. An auto-
           | incrementing user ID does not qualify as PHI under HIPAA.
           | Whereas a person's name or email address most certainly does
           | (since, the very fact that a particular person had received
           | lab testing at a particular lab, is itself considered PHI.)
           | 
           | Those PHI bits of data were scrubbed from the dataset that
           | most regular employees had access to. Casually allowing all
           | employees to have access to PHI would not have been compliant
           | with the law.
           | 
           | The restrictions on what can even be logged (when there's a
           | system error, for example) was very much controlled. "User
           | 12345 experienced <some exception>" was perfectly fine (and
           | greatly aided operational investigation, and customer
           | support.)
        
             | gregw2 wrote:
             | Thanks for the reply. I will concede or defer to you in
             | regards to PHI and HIPAA... it seems the philosophy behind
             | HIPAA/PHI is very different than PII or GDPR. HIPPA is
             | prescriptive. PII/GDPR are principle-based. HIPAA, it
             | seems, has some text that it's not PHI if the risk is "very
             | small" based on the opinion of someone with statistical
             | expertise documents that it could be de-identified OR if
             | the person avoids an explicit list of 18 things that it
             | cares about (see items (A) through (R) on page 96 of https:
             | //www.hhs.gov/sites/default/files/ocr/privacy/hipaa/ad...
             | ).
             | 
             | One of the first 17 things might be a surrogate key
             | ("account number") in one's system but if you look through
             | the others, the rest are things like name, SSN, biometrics,
             | IP#s, etc, which are definitely not surrogate keys.
             | 
             | The "OR" language makes the statistical expertise (and
             | "principles" of privacy) irrelevant if you avoid the 18
             | things; that avoidance forms a "safe harbor" of sorts so
             | you don't have to do any heavy thinking/lifting.
             | 
             | The 18th ("(R)") element of what is considered PHI does
             | seem to refer to surrogate keys but in a manner which
             | creates a clear carve-out/safe-harbor for them not being
             | PHI. That 18th form of PHI is "Any other unique identifying
             | number, characteristic, or code, except as permitted by
             | paragraph (c) of this section;"
             | 
             | But that paragraph (c) section seems indicate identifiers
             | such as surrogate integer/guid keys kept within a system,
             | as long as they are 1) not derived from an individual's
             | information (ie like integer or UUID surrogate keys) and 2)
             | which are maintained solely in the system are not
             | considered as element-18-"R"-PHI:
             | 
             | "(c) Implementation specifications: Re-identification. A
             | covered entity may assign a code or other means of record
             | identification to allow information de-identified under
             | this section to be re-identified by the covered entity,
             | provided that: (1) Derivation. The code or other means of
             | record identification is not derived from or related to
             | information about the individual and is not otherwise
             | capable of being translated so as to identify the
             | individual; and (2) Security. The covered entity does not
             | use or disclose the code or other means of record
             | identification for any other purpose, and does not disclose
             | the mechanism for re- identification."
             | 
             | By my reading, a surrogate key maintained within a system
             | is thus clearly not PHI under HIPAA. I never looked at the
             | details of HIPAA until today since it hasn't applied much
             | to my data and have been focused more on PII/GDPR. I
             | appreciate you describing the context of your remarks.
        
               | wavemode wrote:
               | The other difference between HIPAA and GDPR is compliance
               | rules. Under GDPR it is not a problem if all your
               | employees who have access to your database have access to
               | PII (since, from the perspective of GDPR, the customer
               | has given consent to share their information with your
               | organization as a whole). Sharing with third parties
               | outside your organization is where you get into trouble.
               | 
               | But under HIPAA, even your own employees need to have a
               | specific, documented justification for accessing customer
               | PHI. If they can do their job just as well by only
               | accessing a non-PHI dataset, you're required by law to
               | only allow them access to a scrubbed, non-PHI dataset.
        
               | whyever wrote:
               | I don't think GDPR is that different in that regard, you
               | also have to minimize access to PII within the
               | organization. See Article 25:
               | 
               | > The controller shall implement appropriate technical
               | and organisational measures for ensuring that, by
               | default, only personal data which are necessary for each
               | specific purpose of the processing are processed. That
               | obligation applies to the amount of personal data
               | collected, the extent of their processing, the period of
               | their storage and their accessibility. In particular,
               | such measures shall ensure that by default personal data
               | are not made accessible without the individual's
               | intervention to an indefinite number of natural persons.
        
               | randomdata wrote:
               | > SSN ... which are definitely not surrogate keys.
               | 
               | Surely SSN is a surrogate key? They are not naturally
               | derived. The early ones were serial (i.e. an auto-
               | incrementing field) and more recent ones are randomly
               | generated (i.e. a UUID).
        
               | shkkmo wrote:
               | SSN is absolutely not a surrogate key. If you received a
               | piece of information from an external source, it is data,
               | not a surrogate key.
               | 
               | If you use data as a key, then that is a natural key, if
               | you invent a value to use as an identifier, that is an
               | artificial or surrogate key.
               | 
               | If an API provides you an ID for a record, that is data.
               | If you use it as a key, that is also a natural key in
               | your system.
        
               | randomdata wrote:
               | _> If you received a piece of information from an
               | external source, it is data, not a surrogate key._
               | 
               | It may not be _your_ surrogate key, but it is someone 's!
        
               | shkkmo wrote:
               | Possibly, you don't actually know since it is external
               | data.
        
               | randomdata wrote:
               | We do know because the database owner has openly talked
               | about how the keys are derived. That still doesn't make
               | it a good key for your database, but I can assure you
               | that the world doesn't revolve around you. It is
               | someone's surrogate key - therefore it is a surrogate
               | key.
        
               | hathawsh wrote:
               | Conceptually, any information created or consumed outside
               | your organization is not valid as part of a surrogate
               | key, so SSNs are not a surrogate key. Furthermore, any
               | time you reveal the primary key, that key can become the
               | thing that people come to depend upon to find that
               | database row, which leads to the possibility that
               | someday, someone will have an important need for some
               | primary keys to change, even if the primary key was
               | supposed to be a surrogate key. The longer a database
               | lives, the more likely that surrogate keys morph into
               | natural keys. That's what happened to SSNs.
        
         | twodave wrote:
         | Agreed, and also they tend to be more guessable. Make a page
         | available with an email address as the id and just watch the
         | hackers use it to discover users of your service and attempt to
         | log in as them.
        
           | sgarland wrote:
           | That is an AuthZ problem, orthogonal to DB key debates.
        
       | wodenokoto wrote:
       | Just to add on to the CPR number. As mentioned it contains a
       | birthdate and a gender identification as well as a checksum.
       | 
       | Since many refugees don't know (or can't prove) their birthdate,
       | they are given first of January. And enough first of Januaries
       | are handed out that for some years there aren't enough valid
       | numbers, so numbers that fail the checksum are handed out too.
        
         | knallfrosch wrote:
         | Shows how the assumption that everyone has, or reveals, a known
         | or at least approximate birth date cascades through the system.
        
       | ccppurcell wrote:
       | Here in Czech republic, everyone has an id number (rodne cislo)
       | but as a foreigner, I have multiple id numbers. Further
       | complicating matters, I was given a rc with the wrong gender (the
       | gender is encoded in the number) and my nationality was at some
       | point incorrectly listed as "Ireland". I do wonder if the Czech
       | government thinks I am just two completely different people.
        
         | Culonavirus wrote:
         | seems like a problem, you should get that czeched out
        
       | jpnc wrote:
       | The title could probably be extended to 'You'll regret using
       | natural keys as primary keys' and it would be right in some ways.
       | Personally, I've come to the conclusion that it's probably best
       | to use both surrogate and natural keys. Surrogates (IDENTITYs,
       | UUIDs what-have-you) as PK from a technical perspective and
       | natural key as 'PK' from a business/data modeling perspective.
        
       | __MatrixMan__ wrote:
       | Natural keys are often the best you can do if you're trying to be
       | partition tolerant.
       | 
       | With surrogate keys you end up with duplicate entities whenever a
       | partition heals (supposing the entity appeared to both partitions
       | while they were still separate).
        
       | karmakaze wrote:
       | A simpler and more compelling reason in my experience is that
       | relating tables is so much easier with synthetic keys. You can
       | decide how large your synthetic keys are based on uniqueness
       | requirements, natural keys have their own size and format that
       | would then be copied to make relations. And if uniqueness depends
       | on several values to be unique then dealing with compound foreign
       | keys isn't fun or efficient. At this point, there would have to
       | be a very compelling reason to want to use natural keys to offset
       | this and I've rarely seen them. A case that may come up is for a
       | miscellaneous table to use the synthetic key from another table
       | with another natural value that uniquely identifies a row. I
       | wouldn't choose this route if you ever think you want to relate
       | anything to rows in this table, they should be leaves related
       | only to the synthetic key being used.
        
       | kijin wrote:
       | "Natural keys" just means that someone else can change them under
       | your feet.
       | 
       | People change names, their phone numbers, passport numbers.
       | Governments change their numbering schemes all the time.
       | Corporations merge and split. Heck, even governments merge and
       | split, surprisingly often, at the municipal level. All without
       | your knowledge or consent. And you're left wondering why you
       | suddenly have duplicate key errors in your database.
       | 
       | The only key that you can trust is one that you, and only you,
       | control. That's the point of the surrogate key. Whether it's
       | BIGINT or UUID v7 is beside the point.
        
       | tedunangst wrote:
       | Every blog post about database schema should use as its example
       | the most obvious example, a blog. People are pretty familiar with
       | the domain, and it even has many of the same challenges. You
       | probably have a postid field in there somewhere, but lots of blog
       | software also uses url as a natural key. Which is why sometimes
       | the old title appears in the url. Can't change that. Except you
       | can. Some software will do redirects, by creating stub records
       | for old URLs that map to new ones. So that's not an intractable
       | problem. Etc.
        
       | pphysch wrote:
       | The article should be renamed "natural primary keys". Natural
       | primary keys have fatal flaws, but natural "secondary" keys are
       | all sugar. Best of both worlds.
        
       | MikePlacid wrote:
       | > I already predicted that changing the number would prove to be
       | either impossible, or have all sorts of cascading effects,
       | ultimately terminating in official records no longer recognizing
       | that the car is mine.
       | 
       | Oh my, I can feel that pain. Here's what happened last week,
       | caused by a change in... SSN? no - in our home address.
       | 
       | My wife earned some unemployment benefits three years ago, which
       | were put on a plastic card issued by The Bank. She finally found
       | time to access the funds (she's a middle school teacher), but
       | when she went to The Bank, they said they didn't have the money
       | anymore--they'd sent it back to California. So, she called
       | California. They were like, "No problem, we'll send you a check.
       | Oh, you have a new address? Let's change it. Wait, what is
       | happening... oh, now your account is locked, it says: potential
       | fraud"... They needed a supervisor to unlock it, which took 20
       | minutes. The supervisor unlocked the account, but because it was
       | marked as potential fraud, they couldn't mail a check anymore.
       | Instead, they linked the account back to The Bank (20 more
       | minutes), and she had to go there in person with her ID to get it
       | checked.
       | 
       | So, she went to The Bank. But you can't just walk in The Bank and
       | show your ID to get it checked; you need an appointment. And to
       | get an appointment, you need an account with The Bank. But her
       | California benefits account? Oh, it is marked as potential fraud
       | - it didn't count. So, they spent 20 minutes to open a new
       | account for her. She got an appointment for later that day, in 4
       | hours, went back to The Bank, and had her ID checked (yes, the
       | second time in one day - they need to check your ID to open an
       | account too).
       | 
       | Did she get the money then? Of course not--the account is still
       | flagged as potential fraud. No cash possible. Call California
       | again. California agreed to mail the check to the new address,
       | probably by some oversight.
       | 
       | So my question is: do these "potential fraud" flags in databases
       | ever die a natural death?
       | 
       | With some hope, sincerely, a Husband of a Potential Fraudster.
        
         | knallfrosch wrote:
         | > Mailing a check to transfer money
         | 
         | Real 1800 vibes here.
        
       | the_gipsy wrote:
       | In Spain the ID numbers are assigned at birth, carry no
       | information, and cannot be changed. Each police comissary that
       | registers birth gets a unique set of IDs to assign (per year or
       | whatever).
       | 
       | However. Mistakes still happen. A colleague had the same ID as
       | someone else. He said he tried to change it, but it was
       | impossible because it was such an impossible concept to any
       | public servant involved. In the end he gave up and just lived
       | with the fact.
        
         | fcatalan wrote:
         | Not really assigned at birth, but at the time you first request
         | the card. It's a very very very crappy database ID number:
         | 
         | -There's a significant amount of duplicates, enough that if
         | your database is big enough you will find one sooner or later.
         | 
         | -Format is variable: There are older 7 digit numbers, modern 8
         | digit ones, some people pad the 7-digit with a 0, some don't,
         | some consider the CRC letter part of the key, some don't, some
         | just append the letter, some hyphen it. If you deal with manual
         | data entry anywhere, you will suffer.
         | 
         | -Foreign people exist: You start using their passport number,
         | the format of your key is now completely arbitrary, can't
         | validate it. If you were using 8 digit NIFs as keys now some
         | countries use 8 digit passports, increasing your chances of
         | duplicates.
         | 
         | -Foreign people stay: They get a resident card and they use
         | that. Format is almost like NIF but not really so you need to
         | account for that. Someone you registered initially with a
         | passport now has the card and registers for something else with
         | it. Have fun cleaning their duplicated identity from the whole
         | system.
         | 
         | -Foreign people become Spanish: Now they get a NIF. If you have
         | dealt with them for a long enough time, have fun again fixing
         | their records for the second time.
        
         | benhurmarcel wrote:
         | Foreigners change from a NIE to a new DNI if they get Spanish
         | nationality. So their ID number changes then.
        
         | shkkmo wrote:
         | Even if this is true now (which it isn't entirely), that is no
         | guarantee that it will be true in the future. Any time you use
         | an external value as a key, you run the risk of a policy
         | change. An internally generated artificial key is the only safe
         | way to ensure immutability and lack of semantic content.
        
       | EVa5I7bHFq9mnYK wrote:
       | How is name and city a key for a restaurant? There are a dozen of
       | mcdonalds in my city.
        
         | layman51 wrote:
         | That combination of columns being a candidate key depends on
         | the world or domain that your dataset is modeling. In this
         | example, I think it was a list of the top 50 restaurants in the
         | world and so it would make a fine key. But if that database was
         | ever expanded then definitely you would run into collision
         | issues.
        
       | perlgeek wrote:
       | Natural keys have a tendency to change over time in what is
       | considered unique.
       | 
       | For example, you have a company, and every employee has a unique
       | employee number generated by HR... until the company merges with
       | another, that also has unique employee numbers, and suddenly the
       | identifier becomes the tuple (organization, employee number) that
       | becomes unique. If you've used the employee number as a foreign
       | key in other tables, you have to change those too.
       | 
       | This is a somewhat contrived example, but I've had enough real,
       | annoying examples happen to me in my career that I avoid natural
       | keys.
        
         | knallfrosch wrote:
         | Ah, how I remember changing my name. People think it only
         | happens now and only to 0.1% of people who decide to change
         | their gender in a leftist Western society, but somehow they
         | forget that people have been changing family names for
         | centuries. And of course, my work email address also changed.
         | Now THAT is fun.
        
       | raverbashing wrote:
       | > Many were the times, earlier in my career, when I decided to
       | use a 'natural key' as a key in my own database. As far as I
       | recall, I've regretted it every single time.
       | 
       | Correct. Infamously correct
       | 
       | Every "natural key" will have some way of fumbling things down
       | the line.
       | 
       | Nothing is unique, not even joining your natural key with other
       | deduplication info. Just save yourself the trouble.
        
       | DeathArrow wrote:
       | Synthetic key is the easiest choice. Because the primary key in
       | one table is going to be foreign key in another table. If I have
       | an employee table I can use their email address as the key but
       | then I will have to use the email in any other table that
       | references employee table, such as salary table, days off table
       | and so on.
       | 
       | Most of the times using an auto incrementing integer or UUID is
       | just fine.
       | 
       | In NOSQL there's usually a synthetic ID used to uniquely identify
       | a document, I've never seen people using natural keys or compound
       | keys.
        
       | webprofusion wrote:
       | As a junior dev decades ago I was once asked to change a
       | perfectly good database (with integer IDs) to use natural keys,
       | so to look up (or SQL join to) a particular office for instance
       | your query would filter on Country, City and Street with the
       | expectation that you get one result.
       | 
       | This was the dumbest technical decision I have ever been asked to
       | be a part of.
        
       | gpvos wrote:
       | In the example given, the most logical natural key to me would be
       | (year, rank). Why otherwise would they have the year in there?
        
         | mathgorges wrote:
         | That was my first thought too. But like TFA says, that still
         | wouldn't prtect you from clerical errors.
        
       | moi2388 wrote:
       | I've just always considered natural keys public keys, and added a
       | surrogate key as private key.
       | 
       | From the outside you won't see it, but internally it saves a lot
       | of headaches.
       | 
       | Space, speed, migrations, I have just never seen an actual
       | downside to using a surrogate.
       | 
       | And if you are _absolutely_ sure, add some unique constraints.
       | Easier to change when you _inevitably_ have to.
       | 
       | Mainly; I've learned that my initial assumptions are never
       | correct.
        
       | samtheprogram wrote:
       | I technically use a natural key, if you could call it that, when
       | using Supabase. Having a `public.users` table with a primary key
       | that's the same as the key for the internal user auth table (from
       | Supabase).
       | 
       | Arguably not a natural key, or at least a contrived example, but:
       | any downsides?
        
       | mobiuscog wrote:
       | It's also important to put as much thought into any synthetic
       | key.
       | 
       | It's _easy_ to just use an auto-generated sequence... but then
       | you start having to export /import or otherwise merge data, and
       | other manipulations that often use the primary key and find there
       | are collisions everywhere. There can also be problems when
       | needing to support multiple databases, or update versions.
       | 
       | UUIDs (or equivalent synthetic keys that are independent of the
       | database itself) are often the best answer for this reason.
       | 
       | Having been bitten by sequences so many times in the past, I find
       | them to often be more trouble than natural keys in the first
       | place - just a lazy approach.
        
         | jonathan-re wrote:
         | Well, UUIDs bring their own challenges. Dropping that one here:
         | 
         | Be Careful with UUID or GUID as Primary Keys
         | https://news.ycombinator.com/item?id=14523523
        
         | globular-toast wrote:
         | It's pretty easy to migrate to UUIDs later if required, as long
         | as you don't leak your keys.
         | 
         | My work made their own uuid-like scheme, similar to UUIDv1
         | which incorporates several elements like machine ID and
         | timestamp. The mistake was twofold: first they exposed them (so
         | people started using them) but, worse, they made them easily
         | reversible, so people started decoding the information in them.
         | People would naturally see one and think "oh this is a record
         | from place X". Of course that might not be true following
         | subsequent data corrections, but the key can't change.
        
         | akvadrako wrote:
         | UUIDs are only good when you don't care about ergonomics or
         | performance.
         | 
         | Snowflake IDs are much more reasonable. 41 bits timestamp (ms)
         | + 10 bit machine id + 12 bit serial.
         | 
         | But if you care about ergonomics and privacy the most then
         | short and random IDs are really the best.
        
       | otikik wrote:
       | One of my first jobs involved an account management software
       | which created an ID for each user based on first name and
       | surname. So John Smith would get SMIJOH. It was the user natural
       | key and it could not be changed once created.
       | 
       | One woman got married and changed her name; se became really
       | upset when she found out that we couldn't change her login.
        
       | canary-hill wrote:
       | What has worked best for us is to have a sequential integer as
       | the primary key as well as UUIDv4 as a surrogate key for every
       | row. The applications would expose the objects only using uuid
       | and never the primary key. The primary key always remain internal
       | at the database level and never gets referenced or used at the
       | application layer. In this way the security and privacy of
       | objects are maintained and having sequential integral primary key
       | for the object ensures the database design remains robust.
       | Essentially, separating out the key for the database and the
       | application. Is there any potential issue with this approach?
        
       | sigtstp wrote:
       | If anyone works in bioinformatics, please, please, for the love
       | of god, generate your own unique IDs. Database identifiers are
       | not generally unique (the same id might get reused for e.g.
       | protein variants). Even sequences are problematic: I've found the
       | same seq with different ids (can't remember the db now), and they
       | can change (sequencing or human error might've occured and they
       | get updated).
        
       | teaearlgraycold wrote:
       | What I'm doing right now is putting synthetic (auto incremented)
       | ids on every table, but also adding lots of compound unique
       | indices. We have a lot of long background processes that might
       | need to be restarted in the middle. The natural key definition
       | allows for easy idempotency in subsequent runs (the jobs only
       | upsert with. collisions handled by a simple timestamp update).
       | 
       | Most entities are tied to a version record (more or less a God
       | object). But _some_ span across versions. I'm able to have the
       | entities that aren't tied to a version float around because they
       | reference the natural key of the versioned entities, minus the
       | version's synthetic key.
        
       | strnisa wrote:
       | I've become a fan of unique, relatively short and "human-
       | readable" IDs, such at the ones used by Stripe, e.g.
       | `cus_MJA953cFzEuO1z` for an ID of a customer. Here's a Stripe dev
       | article on the topic: https://dev.to/stripe/designing-apis-for-
       | humans-object-ids-3...
       | 
       | If you use JavaScript/TypeScript, you can make them like this:
       | function makeSlug(length: number): string {             const
       | validChars =
       | "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
       | const randomBytes = crypto.randomBytes(length);             let
       | result = "";             for (let i = 0; i < length; i++) {
       | result += validChars[randomBytes[i] % validChars.length];
       | }             return result;         }              function
       | makeId(tableName: string): string {             const idSpec =
       | TABLE_NAME_TO_ID_SPEC[tableName];             const prefix =
       | idSpec.prefix;             const slugLength = idSpec.length -
       | prefix.length - 1;             return
       | `${prefix}_${makeSlug(slugLength)}`;         }
        
         | vanderZwan wrote:
         | Nice!
         | 
         | For the record: the valid chars string is 62 characters, so
         | naively using a modulo on a random byte will _technically_
         | introduce a bias (since dividing 256 values by 62 leaves a
         | remainder). I don 't expect it to really matter here, but since
         | you're putting in the effort of using crypto.randomBytes I
         | figured you might appreciate the nitpick ;).
         | 
         | Melissa E. O'Neill has a nice article explaining what the
         | problem is, and includes a large number of ways to remove the
         | bias as well:
         | 
         | https://www.pcg-random.org/posts/bounded-rands.html
         | 
         | (in this case adding two more characters to the validChars
         | string would be the easiest and most efficient fix, but I'm not
         | sure if that is a possibility here)
        
           | naasking wrote:
           | > For the record: the valid chars string is 62 characters, so
           | naively using a modulo on a random byte will technically
           | introduce a bias
           | 
           | Indeed, there's no reason you couldn't just add "_" and "-"
           | or "." as well to complete the set. Your identifier will
           | still be URL-safe. I've been using this type of encoding for
           | years [1] for these kinds of ids to use in URLs, and
           | encoding/decoding is super-fast with some bit shifts. And
           | unlike Base64, you don't need padding characters.
           | 
           | [1] https://sourceforge.net/p/sasa/code/ci/default/tree/Sasa.
           | Web...
        
             | notfed wrote:
             | - and _ tend to break text selection.
        
               | naasking wrote:
               | I'm not sure what you mean by "break". If you mean that
               | touching or double-clicking on a block of text only
               | extends up to the nearest symbols, that's true. But if
               | your text selection UX is not terrible then it should be
               | simple to extend that further.
               | 
               | That said, iOS and Android text selection have gotten
               | worse recently, IMO.
        
         | whyever wrote:
         | Please don't use % to generate integers from a range, it's not
         | uniform, which can be disastrous if you rely on your numbers
         | not being predictable. You can use crypto.randomInt instead.
        
           | CodesInChaos wrote:
           | That is definitely an improvement, but I find that concern a
           | bit exaggerated.
           | 
           | If all you need is unpredictability, a minor bias is sub-
           | optimal but not disastrous. The 256 % 62 bias used here
           | should reduce the min-entropy per character by 5%. And you
           | can easily minimize the bias by using larger integer than a
           | single 8-bit byte. There few algorithms where minor biases
           | cause a disaster, like DSA.
        
         | vincentdm wrote:
         | I don't like the prefix idea: besides the duplication of
         | information, it also becomes a liability if you ever rename
         | things.
         | 
         | Imagine you prefix all customer IDs with `cus_`, but at some
         | point decide to rename Customer to Organization in your
         | codebase (e.g. because it turns out some of the entities you
         | are storing are not actually customers). Now you have some
         | legacy prefix that cannot be changed, is permanently out of
         | sync with the code and will confuse every new developer.
        
           | codeulike wrote:
           | I wouldn't worry about that - I still think its worth it.
           | I've had systems which during development we thought the
           | Contact page was going to be called 'Contact' in the UI but
           | at the end it got re-labelled to 'Individual' but in all the
           | code it was still called Contact and the IDs all started with
           | a C - but you know what? It was still useful to look at an
           | ID, see the C and know that it was an Individual.
        
           | shawabawa3 wrote:
           | > Now you have some legacy prefix that cannot be changed
           | 
           | Yes you can.
           | 
           | You can support the old cus_<ID> prefix as well as the new
           | org_<ID> prefix, but always return org_<ID> from now on
        
           | wongarsu wrote:
           | Reddit prefixes their IDs with t1_ for comments, t2_ for
           | accounts, etc. That sidesteps the renaming issue.
           | 
           | Though I believe they mostly do it because their IDs are
           | sequential, so without prefix you wouldn't easily notice if
           | you use the wrong kind of id. They also only apply prefixes
           | at the api boundary when they base36 encode the IDs, the
           | database stores integers
        
           | chuckadams wrote:
           | I have that exact issue with a couple different identifiers,
           | and it's not a big deal. Usually it goes along with some data
           | model change you already have to write compatibility code
           | for, the new and old names tend to be related, and the old
           | name tends to stick around other parts of the code anyway.
           | Opaque IDs don't reduce the confusion there, documentation in
           | appropriate places does.
        
         | codeulike wrote:
         | Except the Stripe ones are case sensitive which can be annoying
         | with some databases
        
           | datavirtue wrote:
           | You know what's annoying? People storing GUIDs in a case-
           | insensitive database.
        
             | codeulike wrote:
             | Classic RFC9562 GUIDs were often represented as hex like
             | {550e8400-e29b-41d4-a716-446655440000} and so were case
             | insensitive
        
         | garblegarble wrote:
         | This approach will randomly generate profanity. If the ID is
         | visible to users it can cause some to get upset (and best-case
         | simply looks unprofessional). On a purely technical level if
         | visible in URLs it can cause links to be blocked/altered by
         | e-mail filters / filtering proxies.
         | 
         | It's generally a good idea to drop vowels for this reason.
        
           | account42 wrote:
           | fvck cvnt b1tch
        
           | Knufferlbert wrote:
           | We once had a rather angry Irish customer calling our support
           | complaining that we called him a pikey (slur for gypsy).
           | After some back and forth it turns out we just gave him an
           | apikey.
           | 
           | We never had a similar issue with our random
           | numbers/letters/reset passwords or anything like that which
           | don't have any kind of "dont return profanity" protections.
           | Though I agree, someone getting a randomly generated customer
           | portal url or something containing fuck or similar would look
           | bad. Our cloudfront or something (or was it main public
           | facing s3 bucket? can't remember) starts with "gay" and was
           | never picked up on.
        
             | biztos wrote:
             | Totally off topic, but "gypsy" is itself a slur for Romani
             | people in much of the world.
             | 
             | https://en.wikipedia.org/wiki/Romani_people
        
               | isaacremuant wrote:
               | I'd say those articles show Wikipedia's political bias
               | and a tendency to overly politically correct instead of
               | portraying reality.
               | 
               | Many in real life do refer to themselves as gitano or
               | Gypsy and would ask for others to refer to them as such.
               | 
               | Of course it's very easy to find an article saying
               | otherwise and then using that as the end of discussion
               | for Wikipedia editors.
        
               | coooolbear wrote:
               | By 'find an article' you mean find ~10 real citations
               | including the resolution of an authority a long time ago
               | and to tell the reader it is not clear or definitive?
               | 
               | Better to be careful and let any individuals or
               | communities tell you what they want. I have Roma
               | connections in my family and at one point the word we'd
               | use is 'gypsy'. But, because I'm not Roma myself, if I
               | came across some other group I wouldn't assume I'm just
               | allowed to say it to them.
        
               | vsuperpower2020 wrote:
               | I don't care what they want. A lot of people are tired of
               | playing these language games.
        
               | bigstrat2003 wrote:
               | Also, "do what people want" is fine for your interactions
               | with an individual. But it's not a viable general rule
               | for language, where we need one single approach. I think
               | saying gypsy unless someone personally tells you they
               | would rather you don't call them a gypsy is perfectly
               | reasonable.
        
               | mst wrote:
               | It's in the annoying category where it _can_ be used as a
               | slur but also gets used as not-a-slur, including but not
               | limited to by the people it describes.
               | 
               | Locally (north west england) people generally use
               | "traveller" as a description ... but there are definitely
               | people who use _that_ as a slur.
               | 
               | Language be like that sometimes.
        
           | ThePowerOfFuet wrote:
           | I hear Scunthorpe is lovely this time of year.
        
             | brazzy wrote:
             | I hear someone got buttbuttinated there recently.
        
               | mst wrote:
               | A problem we've been having since the medireview age.
        
               | bigstrat2003 wrote:
               | At first I thought that this was the result of my cloud-
               | to-butt extension, and I was trying to figure out what
               | "cloudcloudinated" meant.
        
           | fregonics wrote:
           | True! For instance the example itself. "cus" is profanity in
           | Portuguese. If you were to localize the application, this
           | would be a factor.
        
         | theophaniel wrote:
         | I feel like https://github.com/jetify-com/typeid is the
         | solution to this
        
         | mettamage wrote:
         | Why is such a thing called a slug?
        
           | inopinatus wrote:
           | It's an old typesetting term that found its way into content
           | management systems.
           | https://archive.nytimes.com/www.nytimes.com/times-
           | insider/20...
        
           | corobo wrote:
           | Ooh I had this wonder a while back and jotted it down just in
           | case anyone else ever wondered about it:
           | 
           | Comes from the ye olde paper-based blogs they call
           | newspapers. When an article is being put together it's given
           | a short name, sort of like a project name. This name would
           | remain the same throughout the article's life - from reporter
           | through to editor - it left its trail through the process.
           | Like a slug.
        
             | meindnoch wrote:
             | That's not what Wikipedia says about its etymology though.
             | 
             | >The origin of the term slug derives from the days of hot-
             | metal printing, when printers set type by hand in a small
             | form called a stick. Later huge Linotype machines turned
             | molten lead into casts of letters, lines, sentences and
             | paragraphs. A line of lead in both eras was known as a
             | slug.
        
               | corobo wrote:
               | Ooh interesting. It looks like I have either
               | misinterpreted or found a source that misinterpreted (it
               | was a few years back, unsure if I came to the conclusion
               | or found it). I'll have to update my notes, cheers!
               | 
               | Apologies for the wetbrain hallucination, HN!
        
               | 8372049 wrote:
               | I'm not sure you're hallucinating. The dictionary I
               | checked lists the printing and journalism terms
               | separately. It's quite possible they have diverging
               | etymologies, meaning both can be correct:
               | 
               |  _5. Print. a. a thick strip of type metal less than
               | type-high. b. such a strip containing a type-high number
               | or other character for temporary use. c. a line of type
               | in one piece, as produced by a Linotype. 8. Journalism.
               | a. a short phrase or title used to indicate the story
               | content of a piece of copy. b. the line of type carrying
               | this information._
        
               | hprotagonist wrote:
               | or the journalism term itself diverged from the
               | typographic one.
        
               | corobo wrote:
               | Aye this is what it seems to be having double checked the
               | reply's claim
               | 
               | Got to the Wikipedia page
               | https://en.wikipedia.org/wiki/Slug_(publishing) which
               | could possibly support the slimy conclusion of "it's a
               | trail through the process" but that article has an
               | etymology section that refers to the metal slug
               | 
               | I guess it could mean both depending on whether you're
               | looking for the meaning of the word or the meaning of the
               | concept but I didn't find any other slimy grub references
               | (via an admittedly limited double check)
        
               | hprotagonist wrote:
               | another fun etymological rabbit hole for you: stereotype
               | and cliche both probably originated as typographer
               | jargon.
        
         | bambax wrote:
         | Nice, but better to remove ambiguous / hard to read letters:
         | ijlo, IJLO, and 01 (and maybe 7 as well?)
        
         | inopinatus wrote:
         | I'm not convinced that all Stripe IDs are wholly random
         | strings. I just decoded the base62 part of four genuine "acct_"
         | objects, which at 16 characters are just shy of representing a
         | 12 byte value (log2 62^16 =~ 95.3), and they all have a leading
         | byte of "00000011" and two of them even have a leading 32 bits
         | that is very suspiciously close to an epoch timestamp of a
         | couple of years ago.
         | 
         | There is a similarly suspicious pattern in some of their longer
         | identifiers, invoices for example at 24 characters (ca.143
         | bits) all seem to have a first byte of "00000010".
         | 
         | Even in the article you've linked to, look closely at the IDs:
         | pi_3LKQhvGUcADgqoEM3bh6pslE
         | pm_1LaXpKGUcADgqoEMl0Cx0Ygg          cus_1KrJdMGUcADgqoEM
         | card_1LaRQ7GUcADgqoEMV11wEUxU
         | 
         | Notice the consistently leading low-integer values (a 3, then
         | three 1s)? and how it's almost always followed by a K or an L?
         | That isn't random. In typical base62 encoding of an octet
         | string, that means the first five or six bits are zero and the
         | next few bits have integer adjacency as well. It also looks
         | like part of the customer ID (substring here, "GUcADgqoEM",
         | which is close to a 64-bit value) is embedded inside all of the
         | other IDs and then followed by 8 base62 characters, which might
         | correspond to 48 bits of actual randomness (this is still
         | plenty, of course).
         | 
         | Based on these values it seems there's a metadata preamble in
         | the upper bits of the supposedly "random" value, and it's quite
         | possible that some have an embedded timestamp, possibly
         | timeshifted, and a customer reference, as well as a random
         | part, and who knows maybe there's a check digit as well.
         | 
         | It's possible - albeit this is not analytical but more of a
         | guess - that the customer ID includes an epoch-ish timestamp
         | followed by randomness or (worst case, left field) is actually
         | a sequence ID that's been encrypted with a 64-bit block cipher
         | and 32 bits of timestamp as the salt, or something similar (pro
         | tip: don't try that at home).
         | 
         | My view is that either Stripe's engineering blog is being
         | disingenuous with the truth of their ID format, or they're
         | using a _really_ broken random value generator. If the latter,
         | I hope it 's only in scope of their test/example data.
        
           | dividuum wrote:
           | I took a look at a bunch of stripe customer ids I have stored
           | and at least mine look very random on first glance. I assume
           | their blog post uses demo keys or something similar.
        
           | wslwsl wrote:
           | In the comments it's mentioned, that the IDs contain a shard
           | key for faster lookups.
           | 
           | https://dev.to/stripe/designing-apis-for-humans-object-
           | ids-3...
        
         | strnisa wrote:
         | Thanks for all the improvement suggestions! Taking them into
         | account, the `makeSlug` function becomes:
         | function makeSlug(length: number): string {             const
         | alphabet = "0123456789abcdefghjkmnpqrstvwxyz";             let
         | result = "";             for (let i = 0; i < length; i++) {
         | result += alphabet[crypto.randomInt(alphabet.length)];
         | }             return result;         }
        
           | biztos wrote:
           | I think that 0 and 1 are likely to cause problems when
           | customers end up reading their "user ID" back to your
           | employees in Customer Support Country over the phone.
           | 
           | "It's one-three-oh-dee-ee-el. Yes, I'm sure, EL as in
           | elephant."
        
             | jagged-chisel wrote:
             | One can convert those to the correct characters. Since "oh"
             | and "el" are excluded from the alphabetic range, they
             | become "zero" and "one" - deciding whether that is done in
             | software for the help desk, or in the brain of the help
             | desk staff is left as an exercise to management.
        
             | smaudet wrote:
             | I wonder if Unicode could be used to alter the characters
             | such that these mistakes would be less possible, e.g. using
             | 0.
        
               | fhars wrote:
               | That is "AT", isn't it?
               | 
               | (No, it isn't, if you look closely enough.)
        
               | smaudet wrote:
               | Yeah...what I'd _really_ like to do would be to give a
               | character a  "natural" background color, e.g.
               | 
               | Then its simple for support to say "red is one, green is
               | ell". But you can't just add a color to a character,
               | because copy paste/rich formatting don't work everywhere,
               | or even transfer well...
               | 
               | Alternatively, if you use 0 and 1. it matters less if the
               | user says "at" or "zero", and more that they didn't say
               | "oh" or "one".
        
               | Piskvorrr wrote:
               | It's obviously a . Or perhaps a # . Maybe an A3/4, on odd
               | Wednesdays?
               | 
               | (Unicode has its strengths. Making up replacement
               | characters isn't one.)
        
               | codr7 wrote:
               | I happen to know that the biggest ski resort reservation
               | system in Scandinavia contains a function called
               | MaybeOnATuesday(), but to my knowledge it's never called.
        
               | WorldMaker wrote:
               | If you are trying for URL safe, Unicode is problematic
               | because of Punycode conversions and differing browser
               | behavior with Unicode URLs. (Some browsers always show
               | Unicode as Unicode in URLs. Some browsers always show
               | Unicode as Punycode in URLs. Some browsers switch between
               | the two based on a huge number of variables such as gTLD,
               | user preference, phase of the moon, etc.)
        
             | gvx wrote:
             | Anything that needs to be read over the phone should
             | probably be written out using something like the NATO
             | phonetic alphabet, split into smaller chunks if needed:
             | "The code? It's kilo eight niner; one three mike; delta
             | echo lima."
        
               | 8372049 wrote:
               | Having come from a military background where using that
               | is second nature, I'm constantly surprised how rarely I
               | meet civilians who understand it effortlessly. When
               | picking up a package I say "the code is Oscar Foxtrot
               | three-fife" and you see the person processing for a long
               | time to extract the first letter of the word. I've
               | started saying "OF, that's Oscar Foxtrot, 3-5" to help
               | them out.
               | 
               | In other words, asking a customer/consumer to be able to
               | recite something in phonetics is not realistic in most
               | cases.
               | 
               | Fortunately the code already takes this into
               | consideration and removes ambiguous characters.
        
               | hprotagonist wrote:
               | digikey phone personnel all speak NATO. it's wonderful.
        
               | hobs wrote:
               | First thing I drilled into Apple phone support folks.
        
               | mannykannot wrote:
               | My experience in the USA is that if I don't include the
               | phrase "as in" (as in "X as in Xray") most people still
               | will not realize what I am doing (the alternative "for"
               | can be confused with the digit.)
               | 
               | I also ask them to check my readback of key information
               | they have given me and vice-versa; usually that works
               | well.
        
             | 8372049 wrote:
             | i, L and o are left out of the alphabet in the snippet, so
             | there's not really any ambiguity.
        
             | jgalt212 wrote:
             | Or just drop the L and O, like CUSIP does.
        
             | kstenerud wrote:
             | I developed safe32 for this reason.
             | 
             | https://github.com/kstenerud/safe-
             | encoding/blob/master/safe3...
             | 
             | Notably, confusable characters are interchangeable when
             | being ingested (although a machine encoder MUST always
             | produce canonical output).
             | https://github.com/kstenerud/safe-
             | encoding/blob/master/safe3...
             | 
             | So a user can confuse 1 for l, 0 for o, I for l, u for v,
             | uppercase, lowercase etc, or the agent can say any of those
             | over the phone, and it won't matter.
        
               | vanderZwan wrote:
               | Oh that looks well-thought out and is probably the sanest
               | way to solve this particular problem!
               | 
               | It's obvious why the safe64/safe80/safe85 cannot do this,
               | but is there a reason why the safe16 version doesn't have
               | the same features?
        
               | kstenerud wrote:
               | Oh whoops that's an oversight! I'll fix that up tonight.
        
               | what wrote:
               | Isn't this just crockford encoding?
        
               | kstenerud wrote:
               | Every base-32 style encoding is Crockford at the core.
               | The difference is in the alphabet, and also whether it
               | requires padding or not (safe32 does not).
               | 
               | Crockford also incorporates error correction, which is
               | unnecessary in modern systems since the underlying
               | protocols do that already.
        
               | ss64 wrote:
               | Another approach is using base31 with all vowels removed
               | https://ss64.com/ps/syntax-base31.html
        
               | kstenerud wrote:
               | You still have the problem of 1 vs l. And also it doesn't
               | support user-error (reading 0 as O, or reading V as U).
        
             | inopinatus wrote:
             | My favourite ambiguous readback is to say "M for Movember".
             | 
             | I have a note from a few years ago that 367CDFGHJKMNPRTWX
             | may be a sufficiently unambiguous alphabet. Drop the one
             | you like the least (probably N) to obtain a faux hex
             | encoding.
        
           | kodisha wrote:
           | Or, just use cuid2 [1] + prefix.
           | 
           | [1] https://github.com/paralleldrive/cuid2
        
             | fnordsensei wrote:
             | One thing I've been looking for in an ID generator is a way
             | to supply a blocklist. There are a number of character
             | combinations I'd like to avoid in IDs, because they might
             | be offensive or get stuck in filters when copy-pasted (e.g.
             | in a URI).
             | 
             | This can be solved in user space by regenerating if the
             | character sequences are detected, but this a) skews the
             | distribution, and b) potentially takes time, especially
             | when the ID generator is made to not be "too fast". I want
             | to generate a single ID _that passes the blocklist_ in a
             | timeframe that is not too fast, if that makes sense.
             | 
             | Is there an ID generator that takes this into
             | consideration?
        
               | ceejayoz wrote:
               | Just take out the vowels and numbers that can look like
               | vowels. Nixing 0 means no b00bs IDs, and avoids 0/O; I
               | usually take out 1/I as well.
        
               | fnordsensei wrote:
               | That goes some of the way, but I can think of a few
               | problematic sequences that are only consonants and/or
               | numbers.
        
           | ceejayoz wrote:
           | I like to nix vowels and things that look like them, i.e. 0,
           | to avoid random b00bs sort of tokens.
        
             | travisgriggs wrote:
             | Let's not get sexist here! It also avoids random d1ck and
             | c0ck sorts of tokens.
        
             | madcaptenor wrote:
             | Sure, but I can see people still getting offended if "fck"
             | showed up.
        
               | Tangurena2 wrote:
               | This is why an open source project is now "CK Editor". It
               | was the author's initials, but too many people saw an
               | extra vowel in the name of the project.
               | 
               | https://en.wikipedia.org/wiki/CKEditor
        
           | inopinatus wrote:
           | Having a zero in your alphabet can be problematic, because
           | leading zeros are often stripped (e.g. Excel notoriously
           | mangles phone numbers thinking they are integers).
           | 
           | Multiple calls to a randomness generator can be expensive,
           | and a waste of entropy; production-scale random string
           | generators should still respect this and ask for a block of
           | bytes, then encode them, but with bias correction. You're off
           | the hook in this case, I think node's implementation of
           | randomInt is doing exactly that for you and conserving
           | remaining entropy in a cache.
        
         | scrollaway wrote:
         | I don't recommend using random bytes for this. What I have done
         | in my previous projects is take a uuid6, reserve a couple of
         | bytes inside of it to replace with an object ID, and I convert
         | that to obj_XXXXXXX.
         | 
         | This means you can store them in the db not as a string but as
         | a uuid, which is a lot more performant. You also get time
         | stamping for free.
        
         | cogman10 wrote:
         | A problem with this approach is it's not monotonical.
         | 
         | Especially if you want to use this thing as an index in a
         | database, you'll run into problems where you try doing middle
         | insertions frequently, which causes fragmentation.
         | 
         | The solution to this problem is making the higher order
         | characters time sorted [1]. You don't need to go all out like
         | uuid, you can have a pretty low resolution. It's more important
         | that new insertions tend to be on the same page. If you have a
         | low frequency insertions then minute resolution is probably
         | good enough. (Minutes since 2000 is an easy calculation).
         | 
         | To implement that here, I'd suggest looking at how base64 or 85
         | encoders work and use that instead of repeated mods. You can
         | then dedicate the upper bits to a time component and the lower
         | bits can remain random. [2]
         | 
         | [1] https://vladmihalcea.com/uuid-database-primary-key/
         | 
         | [2] https://github.com/mklemm/base-n-codec-
         | java/blob/master/src/...
        
           | reissbaker wrote:
           | IMO it's nice to have two keys:
           | 
           | 1. An auto-incremented 64-bit (unless you have a good reason,
           | in which case 32-bit is fine) primary key, used internally
           | for foreign key relations. This will generally result in less
           | index bloat on associated tables, and fast initial inserts.
           | 
           | 2. A public-facing random string ID. Don't use this
           | internally (other than in an index on the table it's defined
           | for), since it's large. But this should be the only key you
           | expose to end-users, to prevent leaking data via the German
           | Tank Problem:
           | https://en.wikipedia.org/wiki/German_tank_problem
           | 
           | Only create the second key if this is data you're exposing to
           | users, of course -- for data that's only used internally,
           | just use the 64-bit auto-incremented PK and skip the added
           | index bloat entirely.
        
             | cogman10 wrote:
             | For the number 2, I think one issue is that you are going
             | to be semi-frequently whacking the db to do a mapping of
             | that random string id back to the real id. OK for smaller
             | entities but might be a pain if there's a lot of those ids
             | to wrangle. You can throw a secondary index on it, but that
             | will still have some minor fragmentation issues.
             | 
             | One benefit of a random id is if you are working with more
             | complex data models it can make creating those
             | easier/faster. Instead of having a centralized location to
             | get new ids from (the DB) you can create ids on the fly
             | from the application which can turn the write into a single
             | action from the application rather than a dance of
             | inserting the main table, getting the new id, then
             | inserting to the normalized tables.
        
               | caeril wrote:
               | That's why the Good Lord invented caching. In most
               | applications, 90% of your workload will be over ids less
               | than a week old, so your hit rate is likely to be pretty
               | high for this sort of mapping.
        
               | cogman10 wrote:
               | First hit can be a beast. It's workload/entity
               | determinant if caching is enough for this.
               | 
               | Not great if you are spending 1 minute on the first
               | lookup just to do the mapping.
        
             | rav wrote:
             | Instead of a random string ID, you can devise a fixed
             | secret key and expose the auto-incremented ID xor the fixed
             | secret key as the public-facing ID. This saves you the
             | separate index but still avoids the German tank problem.
             | But it gives you a new problem, namely a secret that's hard
             | or impossible to rotate.
        
               | cogman10 wrote:
               | This is insecure. Assuming the user can get a few key
               | examples (which, we assume they would be able to if the
               | german tank problem is a problem) then the secret can
               | easily be revealed. [1]
               | 
               | [1] https://dev.to/wrongbyte/cryptography-basics-
               | breaking-repeat...
        
               | senderista wrote:
               | XOR isn't secure enough, but you're on the right track.
               | Instead, use an actual block cipher.
        
             | WorldMaker wrote:
             | A different approach to solve both 1 and 2 is timestamp-
             | oriented IDs. You can get useful cache locality/less "index
             | bloat"/fast initial inserts if your keys can be easily
             | ordered in time. Sorted by timestamp means very similar
             | behavior to B-Tree appends of a monotonic integer, even
             | sometimes in the worst cases where "same moment" IDs aren't
             | monotonic and rely more on random entropy.
             | 
             | I got some great DB cache/index performance from ULIDs with
             | a bit of work to order the ULID timestamp bits in the way
             | the DB's 128-bit column "uuid" sort best supported.
             | 
             | Now that UUIDv7 is standardized we should hopefully see
             | good out-of-the-box collation for UUIDv7 in databases
             | sooner rather than later.
        
             | senderista wrote:
             | I don't understand why you need to maintain two separate
             | keys: instead of generating a random key, why not just
             | encrypt the auto-increment key using a secret key? This is
             | the approach used by e.g. cloud providers that use auto-
             | increment keys internally but don't want them to be
             | guessable.
        
               | cogman10 wrote:
               | I think the biggest problem with this approach is it
               | effectively pins you to a encryption key and algorithm
               | (unless you embed some information in the key that lets
               | you version the key, gotta think of that upfront).
               | 
               | Imagine, for example, that you picked DES and "kangaroo"
               | as the secret several years back. You are now pinned to
               | an algorithm and key with known security problems and a
               | weak key.
        
           | echelon wrote:
           | This is a great technical modification that can be made to
           | work with "Stripe"-alike IDs or tokens.
           | 
           | Another hack for advanced active-active situations where you
           | may need to route events before replication completes:
           | encoding the author shard / region in the lower order bytes.
           | 
           | There are lots of interesting primary key hacks for dealing
           | with physical or algorithmic complications.
        
           | jordanthoms wrote:
           | This is dependent on the database you are using - if it's a
           | key-sharded distributed database, you want to have insertions
           | evenly spread across the key space in order to avoid having
           | all the inserts go into a single shard (which could overload
           | it)
        
             | cogman10 wrote:
             | This is the great thing about using random bits for the
             | lower bits. Because you are unlikely to use more than say
             | 2^64 database nodes, any sharding algorithm will have to
             | figure out how to spread a key with 64 bits (or however
             | many bits are in your key) across n nodes.
             | 
             | Because of the random portion of the key, that means you'll
             | get good distribution so long as the distribution algorithm
             | isn't something stupid like relying solely on the highest
             | order bits.
        
           | groestl wrote:
           | > A problem with this approach is it's not monotonical
           | 
           | Whether or not that's bad fully depends on your platform and
           | the number of writes you do. If you're using a massively
           | distributed database like Datastore, Spanner etc, you want
           | random keys as to avoid hot spots for writes. They produce
           | contention.
        
             | cogman10 wrote:
             | Well, you'd still likely want psuedo-random keys. You'd
             | rather not have the underlying database doing extra work to
             | shuffle around records as the pages get jumbled.
             | 
             | One solution to that is having more complex keys. For
             | example, in one of our more contentious tables the index
             | includes an account id (32bit int) and then the id of the
             | entity being inserted. This causes inserts for a given
             | account to still be contiguous (resulting in less
             | fragmentation) while not creating a writing hotspot since
             | those writes are distributed across various clients.
        
               | groestl wrote:
               | Not disagreeing. Point is, you need to know your domain,
               | your technology, your write patterns, your downstream
               | systems, etc to decide if a specific key scheme works to
               | your advantage or not. All the more reason not to use
               | natural keys, as they lock you in in that regard.
        
               | cogman10 wrote:
               | Absolutely agree.
               | 
               | I don't know how you can successfully maintain or develop
               | software without developing an understanding of the
               | underlying domain. I've seen devs try that route and the
               | quality of their work has never been high.
        
           | hot_gril wrote:
           | Depends heavily on what kind of database/index. Going with
           | anything other than a random uuid4 adds complexity; for one,
           | do you want to expose time info? I'd rather default to uuid4
           | as the client-exposed ID* and only change if there's a solid
           | measured reason to.
           | 
           | * not the same as your internal DB row primary keys, which in
           | Postgres should usually be bigserial
        
             | cogman10 wrote:
             | Yes but also no.
             | 
             | Yes, in that there are DB technologies not built in a
             | fashion where records are stored in a sorted order of some
             | fashion. No, in that they are very much not common
             | technologies. Most databases, relational, non-relational,
             | etc have some form of a B-Tree at their core somewhere.
        
               | hot_gril wrote:
               | I can see this. Spanner is an example where you don't
               | want this, idk if that's considered common enough.
               | Postgres and MySQL both support hash indexes that are
               | unordered, but the default in both is btree, and Postgres
               | hash indexes used to have some caveats that made them
               | unsuitable (idk about now) so I've gotten in the habit of
               | just using the default.
               | 
               | MySQL docs claim that a hash index is much faster if you
               | only need kv lookups, so it seems like uuid4 with hash
               | index would be suitable. Never tried it though, and can't
               | say whether it's faster than using a btree with uuid7.
               | Seems like in theory it would be.
        
               | cogman10 wrote:
               | As always, depends on the implementation.
               | 
               | Hash maps should generally always have faster lookups
               | than Btree based structures. However, they'll have slower
               | writes especially when contested. A key issue hash tables
               | have to deal with is what happens when a remapping needs
               | to happen. For example, when 2 keys have the same hash.
               | In that case, locking becomes a lot more messy.
               | 
               | For a btree this is simpler. It's built to be able to
               | handle reshuffling and rebalancing in a way that's semi
               | easy to have fine grained locks around.
        
               | hot_gril wrote:
               | Hm yeah, now I'm wondering how MySQL implements it.
               | Consistent hashing is one way to limit the scope of a
               | remap.
        
         | giancarlostoro wrote:
         | If I'm going to do that I think I'd use Bitcoins BASE58 which
         | avoids letters that could be confused for each other. The
         | number of times I see an O and 0 and wonder which is which,
         | because the font does not make it clear really annoys me.
         | 
         | Edit:
         | 
         | Other honorable mentions: ObjectID's as used by MongoDB which
         | contain the creation timestamp. Also Discord's snowflakes
         | (inspired by Twitter's iirc), which also contain the creation
         | timestamp.
        
           | kibwen wrote:
           | If you want random IDs to be human-readable (and human-
           | communicatable), I'd just recommend base 32 or even base16.
           | You don't actually save that many bytes from base58 or base64
           | when it comes to short IDs.
           | 
           | Case in point, the parent poster's base64 ID is 14 characters
           | long. When encoded as base32 that's still only 17 characters
           | (or 19 in base16), and now you have completely gotten rid of
           | all notion of casing, which is annoying to communicate
           | verbally.
        
             | giancarlostoro wrote:
             | I think the more I think of it, the more I favor the
             | Discord snowflake ID's because they're just integers, and
             | they can be generated on the fly. I think new messages
             | generated them on the client if I'm not mistaken.
        
         | cimnine wrote:
         | Yes, and I like to combine two established concepts instead of
         | rolling my own: URI and UUIDv7. So my IDs become `uri:customer_
         | shortname:product_or_project_name:entity_type:uuid`. An example
         | ID could be `uri:cust:super_duper_erp:invoice:018fe87b-b1fc-7b6
         | f-a09c-74b9ef7f4196`. It's even possible to cascade such IDs,
         | for example: `uri:cust:super_duper_erp:invoice:018fe87b-b1fc-7b
         | 6f-a09c-74b9ef7f4196:line_item:018fe882-43b2-77bb-8050-a1139303
         | bb65`.
         | 
         | It's immediately clear, when I see an ID in a log somewhere or
         | when a customer sends me an ID to debug something, to which
         | customer, system and entity such an ID belongs.
         | 
         | UUIDv7 is monotonic, so it's nice for the database. Those IDs
         | are not as 'human-readable' for the average Joe, but for me as
         | an engineer it's a bliss.
         | 
         | Often I also encode ID's I retrieve from external systems this
         | way: `uri:3rd_party_vendor:system_name:entity_type:external_id`
         | (e.g.
         | `uri:ycombinator:hackernews:item:40580549:comment:40582365`
         | might refer to this comment).
        
           | michaelt wrote:
           | _> It 's even possible to cascade such IDs, for example: `uri
           | :cust:super_duper_erp:invoice:018fe87b-b1fc-7b6f-a09c-74b9ef7
           | f4196:line_item:018fe882-43b2-77bb-8050-a1139303bb65`._
           | 
           | Let me guess - you're a Java developer, right?
        
         | phkahler wrote:
         | Something like that should have a built in check "digit" if
         | people are going to see it and possibly type it in.
         | 
         | For numeric values, making them all a multiple of 11 is a
         | simple way to catch all single digit errors or single
         | transpositions.
        
           | chuckadams wrote:
           | That's why VINs make a decent natural key, because they do
           | have a check digit. Plus they're not opaque: if you look up
           | the VIN and the make/model/year is completely different than
           | the car in front of you, you know you either have the wrong
           | VIN or the wrong car.
        
         | psychoslave wrote:
         | To my mind, it always felt so saddening that adoption of a
         | truly straightforwardly readable notation for numbers never
         | took of. I mean it's so easy to do. You can start for example
         | with a single syllable per digit, and for example only target
         | CV syllables.
         | 
         | From this there is many possibilities, but for example, let's
         | consider only a base ten. Starting with vowels order o, i, e,
         | a, u with mnemonic o, i graphically close to 0, 1 and then
         | cyclically continue the reverse order in alphabet (<-a, <-e,
         | <-i, <-o*, |-u). We now only need two consonants for the two
         | series of 5 cardinals in our base ten, let's say k and n.
         | 
         | So in a quick and dirty ruby implementation that could be
         | something like:                   $digits = %w{k
         | n}.product(%w{o i e a u}).map{it.join('')}         def
         | euphonize(number) =
         | number.to_s.split('').map{$digits[it.to_i]}.join('-')
         | euphonize(1234567890) # => "ki-ke-ka-ku-no-ni-ne-na-nu-ko"
         | 
         | That's just one simple example of course, there are plenty of
         | other options in the same vein. It's easy to create "syllabo-
         | digit" sets for larger bases just adding more consonants, go
         | with some CVC or even up to C0C1VC0C1 if sets for C0 and C1 are
         | carefully picked.
        
           | psychoslave wrote:
           | And of course a naive implementation of the reverse is also
           | trivial:                  def numerize(euphonism) = euphonism
           | .split(?-).map{$digits.find_index(it)}.map{it.to_s}.join.to_i
        
           | jmcphers wrote:
           | There is a old system for making numbers pronounceable as
           | words using a mapping from each number to a consonant value.
           | It's typically used to help memorize numbers:
           | 
           | https://en.wikipedia.org/wiki/Mnemonic_major_system
        
           | samatman wrote:
           | This is how the Urbit address scheme works, btw:
           | https://urbit.org/blog/the-urbit-address-space
        
       | mkleczek wrote:
       | The question is: if you have two restaurants having different
       | scores but distinguishable only by their surrogate keys, how do
       | you know which one to go to?
       | 
       | In other words - using surrogate key is an attempt (and the wrong
       | one!) to fix the problem of missing important information in the
       | database.
        
         | big_whack wrote:
         | If you use a surrogate key, you still need a unique constraint
         | in the table (probably the same columns you would otherwise
         | call your natural PK). If your unique constraint isn't
         | sufficient to capture the difference you mention, you need to
         | add more columns.
         | 
         | However, that's strictly better than the natural PK situation,
         | where you would need to not only add new columns to the key,
         | but also add those columns to all referencing tables.
        
           | mkleczek wrote:
           | > However, that's strictly better than the natural PK
           | situation, where you would need to not only add new columns
           | to the key, but also add those columns to all referencing
           | tables.
           | 
           | Foreign keys referencing surrogate key has different
           | semantics than fk referencing natural key - it is a can of
           | worms actually and can lead to unexpected anomalies.
           | 
           | Lets take the example from the article (with surrogate key):
           | 
           | Restaurant(id, name, city)
           | 
           | Now let's add a possibility to record visits:
           | 
           | Visit(restaurant_id references Restaurant(id), user, date)
           | 
           | We have a procedure to register visits to a restaurant:
           | 
           | register_visit(restaurant_name, user_name, date_of_visit) {
           | INSERT INTO visit SELECT id, user_name, date_of_visit FROM
           | restaurant WHERE name = restaurant_name }
           | 
           | I very much enjoy spending time in "Polish Kielbasa"
           | restaurant in Warsaw and I visit it everyday - I don't visit
           | any other restaurant at all.
           | 
           | Now changes of a restaurant name will lead to the database
           | containing misinformation:
           | 
           | register_visit('Polish Kielbasa', 'mkleczek', 2024-6-4);
           | update restaurant set name = 'Old Kielbasa' where name =
           | 'Polish Kielbasa' and city = 'Warsaw'; insert into restaurant
           | ('Polish Kielbasa', 'Warsaw'); register_visit('Polish
           | Kielbasa', 'mkleczek', 2024-6-4);
           | 
           | Question: what restaurants did I visit this year?
           | 
           | This kind of anomalies are avoided using _natural_ keys and -
           | first of all - defining proper _predicate_ for _each_
           | relation.
           | 
           | The predicate of relation visit(restaurant_name, city, user,
           | date) is quite obvious: "User [user] visited restaurant
           | [restaurant_name] in [city] on [date]"
           | 
           | Question: What is the predicate of relation
           | visit(restaurant_id, user, date)?
        
       | begueradj wrote:
       | One of the most interesting part to me is the fact he is a
       | university professor in a good country but rides a used/second
       | hand car... in an era where status and success are highlighted by
       | displaying what one buys and owns on social media.
        
         | vanderZwan wrote:
         | Highlighted by _whom_? University professors?
        
         | mobiuscog wrote:
         | > in an era where status and success are highlighted by
         | displaying what one buys and owns on social media
         | 
         | This is no different from the people who used to show off their
         | rolex in the pub, or park their BMW on display thinking it was
         | impressive.
         | 
         | The 'show' may have moved, but it's still the same people who
         | gain the same amount of 'respect' that they ever did (not
         | much).
         | 
         | Besides, I would guess that most professors are not so worried
         | about their car (and in fact many may prefer a more sustainable
         | mode of transport) as they are to their research and citations,
         | and the advancement of their students.
        
         | lordnacho wrote:
         | Cars in Denmark carry a huge tax, it's not unusual that you can
         | only afford a used car. Plus public transport in the cities is
         | pretty ok.
        
         | rsynnott wrote:
         | > in an era where status and success are highlighted by
         | displaying what one buys and owns on social media.
         | 
         | I mean, not for everyone? These things, pretty much regardless
         | of era, matter a lot to some people and not at all to others;
         | for still others it may be a negative signal of sorts.
         | 
         | Anecdotally, I know a lot of very well-off people; well under
         | half would have fancy cars.
        
         | fuzzfactor wrote:
         | This is an educated person who shows higher-class tendencies by
         | riding a bicycle, compared to those who seek mere automotive
         | eminence.
         | 
         | Even if the bicycle is older than the car, perhaps even more
         | so.
        
       | iso8859-1 wrote:
       | How it is possible for an article from 2024 to have a comment
       | from 2009?
        
         | jimbobthrowawy wrote:
         | I think it's a joke about natural keys, based on the date added
         | to the comment:
         | https://github.com/ploeh/ploeh.github.com/commit/6149bcdf0e2...
         | 
         | I think the date was chosen because that's when the blog
         | originated after the author left microsoft.
        
       | lysecret wrote:
       | Again, every time I have seen natural keys in a big long lived
       | system I have seen duplication. This is something you will have
       | to deal with.
        
       | porker wrote:
       | As with everything, natural keys are sometimes useful or even
       | good. Where business rules fit, I like using them. But you have
       | to know that the value will never change or be reused.
        
       | mootothemax wrote:
       | You can get performance benefits from using natural keys, as many
       | databases store rows in the order of a table's primary key
       | (sometimes called the _clustered_ index, though it may or may not
       | have a unique constraint requirement depending on the DBMS and a
       | few other bits).
       | 
       | In the author's example, if the first column in the natural index
       | was the city name (or city ID!), and locations are often pulled
       | from the database by city, you'll see a read time performance
       | benefit because each cityName's restaurants will be stored
       | together.
       | 
       | This is why UUID-based systems can suffer worse read + write
       | performance; their rows will be stored in the order of its UUIDs
       | (that is, randomly spread around), making read, insert, and
       | update performance lower.
       | 
       | What to do? I favour a mixed approach: have a unique integer ID
       | column used internally, expose a unique UUID to the public where
       | necessary and - with a BIG DO NOT OPTIMIZE PREMATURELY warning -
       | really think about how the data is going to be queried, updated,
       | and inserted. Ask - does it makes sense to create a clustered
       | index based on the table data? Is there enough data to make it
       | worthwhile? Where can the index fail if changes need to be made?
       | Under some circumstances, it might even make sense to use a
       | natural key with the integer column included right at the end!
       | 
       | The only hard rule I have is using UUIDs for clustered indexes.
       | Unless the tables are teeny-tiny, the system is most likely
       | suffering without anyone being aware of it.
        
         | williamdclt wrote:
         | > many databases store rows in the order of a table's primary
         | key (sometimes called the clustered index [...])
         | 
         | Note for readers: Postgres doesn't do that
        
           | mootothemax wrote:
           | >> many databases store rows in the order of a table's
           | primary key (sometimes called the clustered index [...])
           | 
           | >Note for readers: Postgres doesn't do that
           | 
           | And I am climbing out of the rabbit hole that is Postgres'
           | CLUSTER keyword :) Really funky, looks like it's a one-time
           | operation that isn't persisted?
           | 
           | Looks like Postgres stores its index entries in ascending
           | order by default (NULLs at the end), and if so, the point's
           | worth keeping in mind on that front too.
           | 
           | I _really_ need to do a deep dive on Postgres internals one
           | of these days, it 's an amazing - and different! - system its
           | developers have created.
        
           | sgarland wrote:
           | Nope, but it does have a Visibility Map due to its MVCC
           | implementation. Go ahead and do an index-only lookup on a
           | table using a UUIDv4. Then repeat with something k-sortable.
           | See how many buffer hits each one has.
           | 
           | I assure you, those add up. Nothing like explaining to a dev
           | that they have hit the theoretical memory transfer limit of
           | the hardware, and it cannot possibly go any faster unless
           | they refactor their schema (or a faster DDR version comes
           | out).
        
         | nicce wrote:
         | > This is why UUID-based systems can suffer worse read + write
         | performance; their rows will be stored in the order of its
         | UUIDs (that is, randomly spread around), making read, insert,
         | and update performance lower.
         | 
         | This isn't always the case anymore. UUID standards have been
         | developed in that mind, and they are not completely random
         | anymore. UUIDs can give a hint, for example, about the time
         | when it was created, which gives them some order.
        
       | usrbinbash wrote:
       | The problem is: As soon as a synthetic key is used by an
       | _EXTERNAL_ system, it stops being synthetic; for all intents and
       | purposes it is now an integral part of the record, because if you
       | change it, the external system may not be able to find the record
       | any more.
        
       | kstenerud wrote:
       | In databases, never rely on data you don't control. "Natural"
       | keys are an example of this.
       | 
       | Names can be natural keys, but you don't control them. You don't
       | control when or how a name changes, or even what makes a valid
       | name.
       | 
       | Addresses change. Or disappear. Or somehow can't be ingested by
       | your system suddenly.
       | 
       | Official registration numbers (SSNs, license plate numbers,
       | business numbers etc) seem attractive, but once again you don't
       | control them. So if the license plate numbering scheme changes in
       | some way that breaks your system, too bad. Or people without an
       | SSN. Or people in transition because an SSN needs to be changed
       | in a government system somewhere. Or any other number of things
       | that happen in a government office that affect you, yet you have
       | no control over.
       | 
       | Phone numbers? Well, we've already seen that mess with many
       | messenger platforms.
       | 
       | Fingerprints? Guess what? They evolve over time, and your system
       | will eventually break.
       | 
       | Retrofitting a system that relies on "natural" keys that have
       | broken SUCKS.
       | 
       | Use a generated unique key system that YOU ALONE control.
       | 
       | The first rule of software design is: Don't try to be clever.
       | You're not clever enough to see all of the edge cases that will
       | eventually bite you.
        
         | wvh wrote:
         | I agree with your general point, but it might still be cheaper
         | to redesign your system if a rare breaking change happens to
         | usually reliably stable external registration systems, rather
         | than to pay the cost of a thousand paper cuts by indirection,
         | computational cost or human confusion.
         | 
         | One edge case and good indicator is if your system is testable
         | by itself without third party cooperation and the ability to,
         | for instance, create license plate numbers.
        
           | kstenerud wrote:
           | Every system that uses SSN as the primary key is fully
           | testable by itself without third party cooperation. And yet
           | every one of these systems has required workarounds because
           | what the designers thought to be invariants, weren't.
           | 
           | And this is ALWAYS the case with identifiers that you don't
           | control. You don't make policy decisions about them, but
           | SOMEBODY does. And that somebody isn't aware of - and
           | wouldn't even care - about the invariants that you ASSUMED
           | they followed (and maybe they really did follow them, but
           | they sure don't anymore! Oops...)
           | 
           | Fixing broken invariants after the fact is always a
           | nightmare, because it only comes up once you get stuck -
           | either you can't enter something into the database that you
           | absolutely MUST by the weekend, or you can't change something
           | in the database that you absolutely MUST by the weekend. So
           | you do some last minute hacks to get things kind of working,
           | and then it works for awhile until the next problem (usually
           | involving your hack).
           | 
           | It's hardly any extra work or complexity to just use an ID
           | generator for the primary key. You'll still have the same
           | indices, the same foreign key linkages etc. You have no
           | reason not to do this.
           | 
           | And yet somehow people always seem to fall for the "Oh cool!
           | This existing ID does everything we want! Let's just use that
           | instead of adding one more field to the table! I'm so
           | clever!"
        
         | jimbobthrowawy wrote:
         | Licence plate numbers are an interesting one, since what those
         | mean varies from country to country. Here (Ireland), they are
         | assigned to the car itself via VIN, are _never_ meant to change
         | once assigned, and are backdated based on information about the
         | vehicle itself (e.g. year of first registration even if first
         | registered in another country, following an older format if
         | applicable), but in other countries they can be reassigned to
         | other vehicles.
         | 
         | e.g. 06-LK-12345 might be your car's plate if you move your
         | foreign 2006 car here and register it while living in Limerick,
         | but buying a new car might give you a plate like 242-L-12345
         | since the format of the first two fields changed since. If you
         | leave and later return with that car, re-registering it gives
         | you the exact same number.
         | 
         | https://en.wikipedia.org/wiki/Vehicle_registration_plates_of...
        
           | akira2501 wrote:
           | You don't have vanity plates in Ireland?
        
             | ThePowerOfFuet wrote:
             | Many, many countries do not.
        
               | cheema33 wrote:
               | Vanity plates are essentially free money for the govt. I
               | am surprised every govt. body doesn't go for it. If you
               | really don't like them, make them super expensive.
        
               | rsynnott wrote:
               | Tbh when you factor in the administration, I'd question
               | whether they actually make any money on these.
        
             | Macha wrote:
             | There aren't any, the closest is that some authorities
             | reserve low sequence number plates for e.g. the local mayor
        
             | jimbobthrowawy wrote:
             | You can get vanity numbers, but 1 is reserved for the
             | region's mayor, and anything 1000 or below is assigned
             | based on a raffle system. Other than that, it's first come
             | first served.
             | 
             | It can only go on that one car though, so there's not as
             | much value in doing it. I've mainly seen bus rental
             | companies do it.
        
         | andrelaszlo wrote:
         | Official registration numbers, such as Swedish personal
         | identification number, or "personnummer" (date of birth +
         | serial + checksum [Luhn], where even serials are used for
         | females and odd for males):
         | 
         | - It can take a few days before a newborn is assigned a number
         | 
         | - Non-citizens don't have one, but they can get a coordination
         | number on the same format but with the date part incremented by
         | 60 days.
         | 
         | - Citizens can have both a coordination number and a personal
         | identification number in certain cases.
         | 
         | - They can be changed if the wrong birth date or gender
         | registered at birth or during immigration, for protected
         | identities, or for gender transitions.
        
           | Aaargh20318 wrote:
           | In the Netherlands we have something similar: the BSN
           | (translated: citizen service number). This used to be the
           | social security number, but because it was used by an
           | increasing number or government agencies unrelated to social
           | security it was changed to BSN.
           | 
           | One of the major problems with this number is that it has a
           | special status under the law. There are very strict rules as
           | to who can process and/or store this number and for which
           | purpose. For example: a bank can process this when opening a
           | bank account, under anti money laundering regulations, but
           | they cannot use it to identify an existing customer.
           | 
           | If you originally set up your database to use SSNs you now
           | have a problem. This actually happened with our chamber of
           | commerce: if you registered a one-man business they used the
           | SSN as the business id and you're required as a business to
           | publish this. Now it's suddenly a number that is subject to
           | strict privacy rules and they have to renumber all one-man
           | businesses.
           | 
           | So that's another problem with data you don't control: the
           | legal status of this data can change.
        
           | wasmitnetzen wrote:
           | Nitpick: long-term residents get a personnummer as well, not
           | just citizens.
        
         | nailer wrote:
         | I was this recently. I moved to US in November and it was
         | February before I had an SSN. A bunch of companies had to put
         | in fake SSNs into their system, which they have a standard for.
        
       | zer0tonin wrote:
       | Yes, except sometime whole industries come together already has
       | developed a system of unique identifier, and I can just use this
       | as a natural key. For example, the Nasdaq stock exchange is
       | identified perfectly fine by the string 'XNAS', I'm not going to
       | generate an ID on top of it.
        
       | canterburry wrote:
       | I don't think I have seen natural keys used in a database since
       | the 90s.
       | 
       | While I too was taught natural keys, or combined keys, were the
       | "intended" way to identify data, I was corrected very quickly
       | that artificial primary keys were much more reliable and more
       | convenient due to all the auto increment features etc. I am
       | surprised this is even talked about anymore.
        
       | jbverschoor wrote:
       | The amount of inexperience is staggering
        
       | nuancebydefault wrote:
       | I'm a bit confused. I've never programmed databases, though i
       | have a lot of programming experience.
       | 
       | The student made a structure that can be seen as a unit or table
       | in a database. Is the database not clever enough to synthesize a
       | unique id? Why does the programmer need to care about keys?
       | 
       | It is clear that the end user will not know anything about a key,
       | it might be a byte offset in a file or a memory pointer. Why does
       | the programmer need a key?
        
       | alanhaha wrote:
       | ~~You'll regret~~ I regretted
        
       | contingencies wrote:
       | My traditional schema was always to have _id_ in every table and
       | to make it an integer, then for pretty much anything it 'd be two
       | tables, _nouns_ for the thing and a one-to-many relationship with
       | _noun_names_ for the names, where _noun_names_ was comprised of
       | _id /noun/language/name_, keyed on _noun_ and _language_ , with
       | the _language_ column FK 'd to _languages.id_. I 'd also have a
       | _languages.code_ column for ISO or IANA code (sometimes both
       | columns or aliases), and a _language_names_ table with
       | _language_used /language_named_. So you could name English in
       | Frisian, and Chinese in Tibetan. Haven't needed to build SQL in
       | ages, though that schema works great. Certainly never had an
       | issue with it once UTF-8 became standard. If you build schemas
       | without i18n, you're asking for trouble.
       | 
       | One good thing about using a standard integer key is the cost of
       | indexes is low (low memory use). One good thing about using _id_
       | everywhere is that it 's short and self-explanatory to
       | programmers from any culture.
       | 
       | Lots of casual queries built of subqueries like...
       | 
       |  _select * from noun_names where noun=(select id from nouns where
       | ...) and language=(select id from language where code= 'en');_
       | 
       | Always felt this was the most readable. Always felt that
       | LEFT/RIGHT JOIN stuff was bonkers. Onboarded a lot of serious
       | junior devs, never had an issue.
        
       | praptak wrote:
       | Modeling identity doesn't end with synthetic keys though.
       | "Generate own keys" only solves the problem of identity
       | ownership. You now own the identity but it doesn't mean your
       | identities are as they should be.
       | 
       | Say a customer used a different government-issued ID to re-
       | register with your bank. A year down the line you notice that you
       | have two identities for the same person. It might be a meh for an
       | online game but if you are a bank this can make you run afoul of
       | regulations. Can you handle the merge of all the relevant data?
       | And merging is usually the easier of the two glitches - can you
       | handle a split?
       | 
       | The point is that identity just like security requires thought
       | from the start of the design.
       | 
       | For a domain where identity is really hairy (although admittedly
       | with less consequences for screwing up) see
       | https://news.ycombinator.com/item?id=4493959 "The music
       | classifying nightmare". Also
       | https://en.wikipedia.org/wiki/Identity_(philosophy)#Metaphys...
       | for some philosophical perspective.
        
       | meekaaku wrote:
       | I am firmly in the control your primary key, preferably uuid.
       | 
       | Quick question for database gurus here. Is it ok to have a
       | currency table using the currency code as primary key?
        
         | brabel wrote:
         | I think long term, that is a bad idea.
         | 
         | In Brazil in the 80's and early 90's, the currency changed name
         | several times. It was called "cruzado", then "cruzeiro", then
         | "cruzado novo" (yep, "new" <old-name>, very creative) and then
         | I think it went back to just "cruzado" :D before finally
         | becoming Real (which I believe was the name of the currency
         | also during Monarchy, 100 years earlier).
        
         | kassner wrote:
         | Likely no, there are currencies without an ISO 4217 code, i.e.:
         | https://en.wikipedia.org/wiki/Faroese_kr%C3%B3na
        
           | zarzavat wrote:
           | This is interesting but it doesn't really seem like a
           | separate currency. It's more like Scottish pound vs English
           | pound, it's the same currency but different banknotes for
           | different regions in the same state.
        
       | bjourne wrote:
       | Hard disagree. You create tables for them to be queried by
       | applications. For the Restaurants example the application clearly
       | wasn't designed to handle Restaurant entries whose only field
       | differs is rank. Consequently, you shouldn't allow data the
       | application can't handle. Whenever the time comes and you think
       | "ah, wouldn't it be great if multiple people could review the
       | same restaurant in the same year?" you change the natural keys.
       | You can always relax the rules. Tightening the rules is much
       | harder because garbage data may already be present.
       | 
       | Keys are about the integrity of your application(s) and
       | preventing corner cases by making them impossible.
        
         | konschubert wrote:
         | You can use a UUID as a primary key and still enforce
         | uniqueness on a combined index of city and name.
        
           | nkozyra wrote:
           | If they conflict and are user facing identifiers aren't you
           | then forced to add uniqueness to the city and name, ala
           | /chicago/chipotle_s4, potentially bleeding some business
           | details?
           | 
           | I don't love uuids as public identifiers for a number of
           | reasons but not hinting details about your data is one nice
           | thing about them.
        
             | konschubert wrote:
             | Hmm, I am not sure if I get you.
             | 
             | The cool thing about enforcing uniqueness on a secondary
             | index is that you can just change or remove the uniqueness
             | constraint anytime without breaking foreign key relations.
        
               | nkozyra wrote:
               | I'm not sure why you want a uniqueness index then in this
               | case. If it's not publicly visible (effectively a
               | secondary index / reference), then you can't use it as an
               | identifier.
               | 
               | This means you need something else (a slug, as discussed
               | herein) or just use the ID (bad practice in general) or
               | UUID (bad for humans). If you use - in this example -
               | city and company name - you still have to enforce
               | uniqueness, so you have a pseudo-slug anyway.
        
           | bjourne wrote:
           | But what's the point? You're complicating the data for no
           | apparent gain. Only terrible RDBMSes which don't support
           | multi-column keys require surrogate keys.
        
             | konschubert wrote:
             | Because you can remove or change the uniqueness constraint
             | without having to worry about foreign key relationships.
        
               | bjourne wrote:
               | Eh, that's a feature of natural keys. If you try to fuck
               | up your model your RDBMS will complain.
        
         | zarzavat wrote:
         | You can't change primary keys, that's the point, because you
         | don't know _where_ they are.
         | 
         | For example if an old key is in a URL, and that URL is in a
         | browser bookmark, now you need redirects, so you need to keep
         | all the old keys around forever. Keys should be random or
         | sequential, never contain information.
         | 
         | If you want to enforce uniqueness then use a unique
         | index/constraint.
        
           | bjourne wrote:
           | I think you are confused about the terminology because a
           | primary key _is_ a uniqueness constraint. They can be changed
           | in any RDBMS worth its salt. Keys should not be random and
           | your URL example is a case in point. The url
           | /germany/berlin/2023/mcdonalds contains no surrogate key and
           | is immensely more useful than the url
           | /review?uuid=1kjksdhh3244ygdvvgdd2345.
        
             | zarzavat wrote:
             | A primary key _implies_ a uniqueness constraint. But you
             | can have a uniqueness constraint without a primary key.
             | 
             | You can change primary keys within a database but you can't
             | change them outside the database. A database is a map but
             | keys are part of the territory.
             | 
             | The second URL with the UUID is more secure and is
             | preferred in many situations where information leakage is a
             | concern.
             | 
             | The first URL is more descriptive but more prone to
             | breakage. It depends on the situation.
        
               | bjourne wrote:
               | A key is any set of unique columns. The primary in
               | primary key just hints to the RDBMS how to store the
               | data. Both URLs encode queries on the data. The first
               | says "give me the 2023 review(s) of McDonald's in Berlin"
               | (i.e., it has semantics) and the second "give me the row
               | with uuid 1kjksdhh3244ygdvvgdd2345". I don't think the
               | second URL is more robust because, for example, if the
               | row has to be recreated it will fail to work.
        
       | fiatjaf wrote:
       | Such a long article, no convincing examples. Why did he waste so
       | long on the stupid restaurant case that clearly makes no sense?
       | Natural keys are good as long as you think one minute before
       | deciding to use them.
        
       | bdcravens wrote:
       | Another surprising type of data that people think are unique are
       | tracking numbers. Some carriers (like UPS) do recycle them.
        
       | shireboy wrote:
       | In the example from the article I think the table should be named
       | RestaurantRank(year,rank,restaurantid), with a natural key
       | year,rank. It would have a fk to Restaurant(Id,name,etc). Choose
       | a natural key if there is some natural unique field or
       | combination. In this case unless restaurants can tie in their
       | ranking year,rank can be unique
        
         | noisy_boy wrote:
         | You'll have to modify the natural key if later it is decided to
         | add rankCategory ("best overall", "best ambience", "best value"
         | etc) via rankCategoryId foreign key to the RestaurantRank table
         | because a given restaurant can have the same rank in the same
         | year under multiple categories. A synthetic key avoids you
         | having to mess with the key to handle such cases.
        
       | tigerBL00D wrote:
       | This is a bit contrived to illustrate a problematic case. Every
       | business has a unique ID that's assigned to it before it can
       | operate. EIN or even a state id together with state would be a
       | fine natural key for most situations.
        
       | d--b wrote:
       | > what if we had a restaurant named China Wok in Springfield,
       | USA? Hardly unique. Add the state, you say? Probably still not
       | unique.
       | 
       | Ok sure, but then you have 2 restaurants which are
       | indistinguishable from one another in your database. It doesn't
       | matter that the thing has a unique id next to it. You can't know
       | which is which. That's not useful.
        
       | sashokbg wrote:
       | I disagree with this article. The author takes some very edgy
       | cases such as "undergraduate student" and "transgender CRP
       | number".
       | 
       | Even the example for the restaurant and "time based id number"
       | are bad because they all indicate that you just badly identified
       | the entities (in DDD terms) and their identity.
       | 
       | So being bad at DDD doesn't mean that you can't use natural keys
       | (although I myself have arguments against them)
        
       | nostrademons wrote:
       | There's a better solution for many of the exceptional cases that
       | the author describes: aliases & audit logs.
       | 
       | Take for example the Danish CPR number. That's perfectly fine as
       | a natural key; its definition is the _first_ CPR number assigned.
       | If a person 's CPR number changes because they've changed their
       | gender, you will want a separate table recording a.) the date of
       | the change. The new CPR number is not valid before that time b.)
       | the new gender c.) probably the _reason_ for the CPR number
       | change, since if the policy now is that they can change because
       | of a gender change, there 's a decent chance they'll be some
       | other policy in the future that results in a new CPR issuance.
       | 
       | Or the chassis number. Also fine as a natural key. If it's
       | changed because of a data-entry error, you also want to record
       | a.) the date of change b.) _who_ changed it. This opens up a
       | whole host of auditing, monitoring, and reporting functionality
       | that eg. lets you catch fraud, determines if a single person is
       | being sloppy, identify mass changes in policy, notify and update
       | external records of owners, etc.
       | 
       | URLs are another good natural key: they are defined to be unique
       | (otherwise your webserver won't work), they make for very easy
       | lookups when you're fetching from a web request, and if they
       | change, they break the web. Except that they do change. But when
       | a URL changes, _you don 't want to just update them in the
       | database everywhere_, because again, that will break the web. You
       | want to leave a redirect from the old to the new one. So you
       | create a redirects table of all the other aliases that point to a
       | given page, use it to generate server redirects, and you can
       | throw in other data like the time of change or hit counts on each
       | individual alias.
        
         | tapland wrote:
         | It's not really fine. I work on a Healthcare system in the
         | nordics.
         | 
         | Who you billed, who visited what doctor, who your primary care
         | provider is, all the people a doctors office has as patients,
         | refers to the SSN.
         | 
         | You don't want to lose that connection or to have to update
         | everything for any change.
         | 
         | You store a unique identifier for the person in the system, and
         | you can then pull the actual personal identification number
         | when needed.
         | 
         | You do not keep individual private lists of people changing
         | genders.
        
           | tapland wrote:
           | > You do not keep individual private lists of people changing
           | genders.
           | 
           | On this note: Some organization in Sweden actually does store
           | changes like suggested in the top comment. Both the tax
           | agency and police were interested in knowing who did, and who
           | had access, since it's the cause of recent mass doxxing
           | events.
        
           | nikita2206 wrote:
           | But there's no natural key for a person, not even the SSN.
           | This article also starts off by showing an example where no
           | natural key exists (Restaurants) and acknowledges it.
        
         | ironchef wrote:
         | You appear to be describing a classic type 2 or type 4 slowly
         | changing dimension.
         | (https://en.wikipedia.org/wiki/Slowly_changing_dimension )
        
         | lolinder wrote:
         | I'm a fan of audit logs, but what you're describing is more
         | event sourcing than audit logs. From what I understand what
         | you're suggesting is that you don't change the key in the
         | original table, you just record the change in a second table.
         | Presumably, then, when you need to read a row you must also
         | look up the list of diffs for that row to make sure that the
         | natural key hasn't changed.
         | 
         | Two things strike me about this proposed model:
         | 
         | First, under your proposal the version of the model that you
         | work with at the application layer _must_ have two copies of
         | the key field--one is the database key for when you need to
         | make changes or look up more data and one is the meaningful
         | business-model field that 's actually up to date. That's
         | exactly the same extra mental overhead that natural keys are
         | supposed to have solved, only worse because the fields will
         | have similar names and will contain the same content most of
         | the time, making it easy to accidentally use one where the
         | other was expected.
         | 
         | Second, if you're going to introduce effectively an event-
         | sourced data model then you've introduced a ton of new database
         | records already, so why not just give everything a proper
         | unique key while you're at it? Once you've done that you can
         | modify the original row after all (while retaining the audit
         | logs!) or, if you're serious about event sourcing, cache the
         | latest derived value and look it up by its database key instead
         | of carrying around an out of date natural key that's just
         | waiting to be used incorrectly.
        
         | phkahler wrote:
         | >> If a person's CPR number changes because they've changed
         | their gender, you will want a separate table recording a.) the
         | date of the change. The new CPR number is not valid before that
         | time b.) the new gender c.) probably the reason for the CPR
         | number change, since if the policy now is that they can change
         | because of a gender change, there's a decent chance they'll be
         | some other policy in the future that results in a new CPR
         | issuance.
         | 
         | But what if you have a bunch of records in some other table -
         | billing information for example, and it's indexed by the CPR
         | number (a foreign key). When they change the CPR number you can
         | no longer query for their entire billing history based on CPR.
         | None of your proposed extra complexity does anything about this
         | problem. The only good way to solve it is to use a synthetic
         | key with NO meaning. It would still be good to do as you say
         | and track all the CPR number changes for a given person, but
         | they will still need a unique key. So a sort of "identity
         | table" used to figure out what unique key you're dealing with.
        
           | wild_egg wrote:
           | Why could you not join with the audit table and find
           | historical billing information from the old CPRs?
        
             | lolinder wrote:
             | You could, but that introduces a lot of complexity that
             | would be saved by just giving the customer a synthetic ID
             | that uniquely and stably identifies them in your system.
             | 
             | The grandparent's proposal basically turns the first-
             | entered CPR into a meaningless ID field that should not
             | generally be used in a piece of business logic (unless for
             | some reason you need to display "first CPR they entered").
             | Once you've declared that you should look elsewhere if you
             | actually need the CPR, why did we even bother using it as
             | the primary key?
        
             | phkahler wrote:
             | >> Why could you not join with the audit table and find
             | historical billing information from the old CPRs?
             | 
             | 1) complexity and 2) what is actually tying the CPRs
             | together? We're not going to have a CPR table per-customer,
             | so all the CPRs of every customer are in the same table.
             | Presumably the CPR table has a unique key for the customer
             | that can be used to associate multiple CPRs with that
             | person, so we have come full circle - just use that unique
             | key in the audit table.
        
           | nostrademons wrote:
           | You need this complexity anyway - if you had surrogate keys,
           | the primary key would be an opaque identifier, and to do
           | anything involving the CPR, you'd need to join against your
           | CPR table (which, again, needs to be 1:many because CPRs
           | themselves are not a 1:1 relationship).
           | 
           | The first CPR in this case becomes identical to your
           | surrogate key - it's an opaque ID that you use to reference
           | other tables in the DB - but with the added benefit that _for
           | the common case, you don 't need any additional lookups_. You
           | only need to lookup CPR changes if you don't find the CPR
           | that the user gave you. And then it has other added benefits
           | in that you have a record of CPR changes, you can understand
           | how common a case this is, the CPR change table itself has
           | semantic meaning and you can query a wide variety of
           | properties without joining your primary user table, etc.
        
             | shkkmo wrote:
             | > but with the added benefit that for the common case, you
             | don't need any additional lookups. You only need to lookup
             | CPR changes if you don't find the CPR that the user gave
             | you
             | 
             | Doing the CPR lookup ONLY when you don't find a record is
             | really, really stupid. All you need is one bad cache, one
             | database update that go rolled back and suddenly you've
             | lost all recent updates the the user and will only see the
             | bad record with no indication you've failed.
             | 
             | Trying to use force changeable data into being a 'natural
             | key' means that the number of edge cases you have to
             | predict, write around and test for is going to rise
             | significantly for no real benefit.
        
             | lolinder wrote:
             | > The first CPR in this case becomes identical to your
             | surrogate key - it's an opaque ID that you use to reference
             | other tables in the DB - but with the added benefit that
             | for the common case, you don't need any additional lookups.
             | You only need to lookup CPR changes if you don't find the
             | CPR that the user gave you.
             | 
             | I'm not buying that this is a meaningful gain in either
             | performance or code complexity.
             | 
             | In the world of synthetic keys I look up the CPR in the CPR
             | table and join it to the user table using the synthetic ID.
             | If I find a record for the CPR+user join then I'm set, if I
             | don't then the customer doesn't exist.
             | 
             | In the world of natural keys, you're advocating that I
             | first query the user table directly by CPR. Then if I turn
             | nothing up I run a separate query with a join on the
             | original CPR. Then if I still don't turn something up the
             | customer doesn't exist.
             | 
             | The code in the second instance is obviously more
             | complicated than the code in the first instance. It has
             | increased risk of someone writing a bug because now there's
             | a very tempting CPR field that will be right most of the
             | time but wrong in some edge cases. Depending on the
             | database and usage patterns, indexing on the CPR may be
             | much less efficient than indexing on an autoincrementing
             | integer.
             | 
             | The only thing it has going for it is that I _might_ be
             | able to avoid a single join on a very specific path where a
             | user is looking up a customer by typing in the CPR. That
             | seems like the wrong thing to optimize for in the face of
             | all the downsides.
        
         | trevor-e wrote:
         | URLs can be tricky and have plenty of gotchas depending on what
         | you're trying to do. For example, the order of query params is
         | free to change but it's still the same URL. Nothing that can't
         | be worked around with a little normalization.
        
           | Karellen wrote:
           | > For example, the order of query params is free to change
           | but it's still the same URL.
           | 
           | Are you sure that's guaranteed by any spec? I thought an end-
           | point would be free to treat `?a=1&b=2` and `?b=2&a=1`
           | differently. I mean, it would be a nightmarish
           | implementation, but I don't think it would be non-conforming?
        
             | Piskvorrr wrote:
             | It's a mere convention: "a querystring of this type is an
             | array of parameters, the order of which is irrelevant" -
             | this is one of the tautological "valid except when it's
             | not" non-rules (a surprising number of cases).
             | 
             | Example: ?a=1&b=2&a=3 - will the server treat this to be
             | equivalent as ?a=3&b=2&a=1 , or ?b=2&a=1 , ?b=2&a=3 , or
             | something else entirely? You'd need to check the serverside
             | parsing implementation to be sure - those could even be
             | valid (and distinct) _filenames_ FWIW.
             | 
             | (And that's before you get to caching - "?b=2&a=1 is not to
             | be served as a cached version of ?a=1&b=2")
        
         | Pengtuzi wrote:
         | > URLs are another good natural key: they are defined to be
         | unique
         | 
         | No, see more info here: http://localhost:8080/
        
         | shkkmo wrote:
         | > Take for example the Danish CPR number. That's perfectly fine
         | as a natural key; its definition is the first CPR number
         | assigned.
         | 
         | This is a horrible idea. You now have two different pieces of
         | information that are identical in form and indistinguishable in
         | the majority of cases: "first CPR number" and "current CPR
         | number. Every place you enter, or use a CPR number, you now
         | must track which piece of information you have. If you make a
         | mistake doing this, it will be hard to catch. Now you've
         | decided that first piece of information is a natural key and
         | will be every single place the record is used, even if that
         | spot doesn't do anything specific to the CPR number. Every
         | single time a CPR number is ingested from an exterior source,
         | you need to do a lookup to make sure it is the original CPR
         | number and then track it. Ever place you don't do this lookup
         | is a place where an error can creep in if something changes in
         | your pipeline our source.
         | 
         | Even in an context where external sources are using something
         | like a CPR number as an key, I would still use a different key
         | internally since I see only downsides to using a "natural key".
         | 
         | > URLs are another good natural key: they are defined to be
         | unique (otherwise your webserver won't work), they make for
         | very easy lookups when you're fetching from a web request, and
         | if they change, they break the web
         | 
         | URLs are also horrible natural keys. They are not defined to be
         | unique and provide no guarantee that the content has not
         | changed or even that the same content is sent to different
         | users.
         | 
         | If the location for content changes, you may or may not get a
         | redirect. If you do get a redirect, you'd have to now go update
         | every single place that uses the key to the new value. It is
         | much better to map URLs to an artificial key and update that
         | mapping in a single place.
        
         | IggleSniggle wrote:
         | URLs are only really _uniform_ not necessarily _unique._ The
         | most obvious case is something like
         | `http://localhost/file.txt`, where the resource being located
         | is almost sure to be different on every single "localhost," but
         | this is true of _any server_. The pointer is neither unique nor
         | has any true guarantee that the resource being pointed to can
         | or should be considered unique for any given context. It is
         | merely unique within its calling context, which, when dealing
         | with URLs, is often _anywhere on the web where DNS responds._
         | 
         | Even if you presume that the resource being located is in fact
         | a unique resource in the general case, the "unique resource"
         | may not be unique in the way that you presume. Some URLs at a
         | given location are intended to be idempontent and cache-able,
         | others are not, and many are time-limited forwarders. And
         | there's no guarantee or even expectation that two identical
         | forwarding URL's will resolve to the same location; it may well
         | be network-topography dependent.
        
         | crote wrote:
         | > Take for example the Danish CPR number. That's perfectly fine
         | as a natural key; its definition is the first CPR number
         | assigned.
         | 
         | The problem is that the _new_ CPR number is now the one you
         | want to use for all display purposes and future interactions
         | with external systems. In other words, you can 't use the
         | "original CPR" field for anything _except_ as key. It 's no
         | longer a CPR field, because it no longer has any relation to
         | the person's CPR!
         | 
         | And at that point it'd be better to just use a GUID or
         | something as key and avoid any potential confusion between the
         | "real" CPR and "fake" CPR, because when the two are the same
         | 99% of the time it is _guaranteed_ to cause a shitton of bugs.
         | 
         | The only solution is to essentially rewrite all your records
         | with the new CPR as key, and leave a redirect entry at the old
         | CPR. That's pretty much what happens in Sweden when you change
         | your gender: your old identity _ceases to be_ and you 're
         | issued a completely new one.
        
       | stiiv wrote:
       | There is still something to be said for "real world uniqueness"
       | (GIS coordinates) or deferring to a third party to establish
       | identity (license plate numbers).
       | 
       | Identifiers like these aren't always available, but within many
       | domains will be sufficient.
       | 
       | The idea here is not that these keys can't be somehow "invalid,"
       | but rather that it isn't our system's problem -- it belongs to
       | some other authority.
        
         | jonstewart wrote:
         | TFA author cites many examples from human-oriented systems,
         | which perhaps are more common and traditional domains for RDBMS
         | design, where there often several layers of exceptions to the
         | rules - his argument that synthetic keys work well for these
         | domains lands.
         | 
         | I personally work far more with computer-oriented systems and
         | their data, and natural keys work well for me. When well-chosen
         | they allow me to do an initial load of the source data for
         | analysis, and then aggregate such databases together later on
         | for historical analysis without fear of conflict. The data are
         | often immutable in these domains, too.
        
       | hateful wrote:
       | After reading many threads here, I think the final ruling may be:
       | - If your record represents a physical being or object: Use a
       | surrogate key.
       | 
       | People change.
       | 
       | Unrelated - if you have a list of emails or SSNs or license
       | plates or VINs - we can think of these as foreign keys to a
       | database we don't control.
        
       | fl0ki wrote:
       | I inherited a project where a dozen different entity kinds were
       | all given UUIDv4s. Users mostly see namespaced strings, but
       | actually searching by those strings is janky and unreliable for
       | other reasons (including the fact that they _can_ change), so
       | whenever I 'm asked to debug something I insist on reducing to
       | UUIDs first so at least we're definitely looking at the same
       | entities.
       | 
       | I didn't like the UUIDs at first, but it ended up being an
       | unexpected boon for generic code to use the same key type for
       | different entity kinds. What was less of a boon was that the
       | string identifiers also have to be unique, but can change at any
       | time, and depend on three-level namespacing (yes really) so
       | there's far more that has to be tracked and enforced. The names
       | are very important for UI use cases, but can never be the way
       | that records reference other records, because that would just
       | make them much harder to change with confidence.
       | 
       | The essay seems to assume you'll have either unique natural keys
       | or unique synthetic keys, but having now worked on a project that
       | does both at the same time for many entities, I think it's a
       | third option worthy of its own analysis. My experience was
       | negative but I can't deny that the end result ticks a lot of
       | functional boxes.
        
       | thatjoeoverthr wrote:
       | One may notice this has something to do with mutability. If there
       | isn't a surrogate key, the record isn't mutable. The database may
       | let you change it, but the new record has a new identity.
       | Mutability as a concept requires a common identity across time.
       | Languages permitting mutability are using a pointer or reference
       | as an implicit surrogate identity. A typical database can't offer
       | this, hence the need to put explicit surrogate keys into the
       | schema. You cannot say "this changed" unless you can refer to
       | both samples as a common "this".
        
         | bjourne wrote:
         | While that is true, for correctness appends should supplant
         | updates. On HN, many users (like myself) have posted comments
         | for over a decade. Suppose I changed my username from "bjourne"
         | to "SpongeBob"... Should the comments I wrote in 2010 show up
         | as having been authored by "SpongeBob" or "bjourne"? I'd
         | strongly argue in favor of the latter since the former would
         | constitute falsifying history. The "change" in username should
         | be viewed as the creation of a new personae rather than a
         | "change".
        
       | theelous3 wrote:
       | Another counter example is when the thing you are recording
       | literally _is_ the natural key. Phone numbers are a good example
       | of this. There is no situation in which some data entry error
       | will break a number. That is the creation of a new distinct
       | (possibly) broken number. A phone number doesn't care about any
       | externality in the world. It is what it is.
        
       | banish-m4 wrote:
       | Use db-generated UUID "id" primary keys. Add a bigserial "order"
       | field to tables that require stable temporal ordering, and then
       | either create a view to "order by" it or always add "order by" to
       | stable-required statements.
        
         | TexanFeller wrote:
         | One downside is UUIDs take twice the space of a BIGINT which is
         | minor on the surface, but makes a huge difference when doing
         | lots of joins on that key. I used to see 3-4x difference in
         | some queries even using postgres' uuid type(not the string
         | column of fools). Doubling the space used by your keys also
         | means less of your indexes stay locked in RAM.
        
       | dwheeler wrote:
       | I've found that using a simple unique integer id as the key
       | usually works well. Integers are easy to explain and are simpler
       | to enter than uuids. There are always ttade-offs, but if there
       | are no other compelling issues, simplicity is best.
        
       | branko_d wrote:
       | The author seems to suggest that the choice is between natural
       | and surrogate key.
       | 
       | In fact, the choice is between natural and natural+surrogate key.
       | 
       | - If you have a natural key, you _have_ to enforce it, otherwise
       | you risk data corruption. The question is do you _also_ need a
       | surrogate key? Sometimes you do, sometimes you don't.
       | 
       | - If you don't have an obvious natural key, then your surrogate
       | becomes meaningful. You have to use something to distinguish
       | between two "equal but not identical" rows, so you end-up showing
       | the surrogate in the UI etc. In other words, it is no longer
       | "pure" surrogate.
        
       | JackSlateur wrote:
       | He is right: natural keys are difficult to handle
       | 
       | I'd just say this is not so right: As it turned out, though,
       | whoever made that piece of software knew what they were doing,
       | because the mechanic just changed the chassis number, and that
       | was that.
       | 
       | Because, yes, everybody can just reprogram your car and change
       | its VIN and stuff.
       | 
       | The manufacturer software have protection against that. But
       | nothing prevents you from using another software. At the end, you
       | can directly program whatever you want into the car : change the
       | VIN, change the odometer etc.
        
         | duped wrote:
         | > At the end, you can directly program whatever you want into
         | the car : change the VIN, change the odometer etc.
         | 
         | And you can wind up with an undriveable car this way.
         | Particularly the odometer if you roll it back and try and sell
         | it, your state's DMV will probably reject the title transfer.
        
       | mikl wrote:
       | If you use foreign keys in PostgreSQL, it is actually possible to
       | update natural keys with little trouble via the "ON UPDATE
       | CASCADE" mechanism on such keys.
        
       | DontchaKnowit wrote:
       | Yep. My last company had unique id columns for every table in the
       | DB. Everythibg worked flawlessly as far as associating data.
       | 
       | My current company uses natural keys all over the place and its a
       | total clusterfuck trying to associate data at times.
        
       | hosh wrote:
       | There's another gotcha I found out at my workplace -- for
       | Postgres, tables without a primary key has trouble when you want
       | to use logical replication. The times you might want to use
       | logical replication includes using AWS's database migration
       | service to shrink allocated storage.
        
       | cynicalsecurity wrote:
       | > From following the course curriculum, it'd be natural, however,
       | to define a key for the Restaurants table as the combination of
       | restaurantName, cityName, and year. The assumption is that name
       | and city uniquely identifies a restaurant.
       | 
       | That's a strange assumption.
        
       | teknopaul wrote:
       | Don't see the problem with ssn as I'd. People have two, get two
       | rows in your database. Not the end of the world often. database
       | it's a tool thus limitation is not an error. E.g. maybe trans
       | peeps want to forget their old ID, maybe not. your own key has no
       | chance of reflecting any reality in the world.
        
         | hathawsh wrote:
         | The numerous problems with SSNs were identified years ago.
         | 
         | https://www.gao.gov/products/gao-05-1016t
         | 
         | In plain text form:
         | 
         | https://www.gao.gov/assets/a112177.html
         | 
         | I remember when the school I went to changed from using SSNs
         | for all student records to using no SSNs at all. They had to
         | notify everyone about the changes, repeatedly. It was obviously
         | very expensive. Don't use SSNs as keys.
        
       | hot_gril wrote:
       | I don't get why relational DBs are usually taught around natural
       | keys. With MySQL/Postgres you almost always want bigserial PK.
        
       ___________________________________________________________________
       (page generated 2024-06-05 23:02 UTC)