[HN Gopher] You'll regret using natural keys
___________________________________________________________________
You'll regret using natural keys
Author : thunderbong
Score : 673 points
Date : 2024-06-05 01:25 UTC (21 hours ago)
(HTM) web link (blog.ploeh.dk)
(TXT) w3m dump (blog.ploeh.dk)
| duxup wrote:
| I just have some form of unique id field / synthetic key,
| everywhere.
|
| Even if just for documenting issues it makes life easier. "Table
| whatever id 12345 is the record in question."
|
| I've just seen data / relationships change too much too often in
| new and interesting ways to believe in using a natural key.
| ltbarcly3 wrote:
| Sometimes you will regret it, other times you won't.
|
| Notice that even if you believe that you should 'never use
| natural keys', that is NOT the same thing as 'always add a
| generated synthetic key to every table as the primary key'. You
| should _NOT_ always add a generated key to _every_ table, even if
| you use a trash ORM that really wants you to do this.
| aaronbrethorst wrote:
| That's quite the absolute statement to make without even one
| example.
| hansvm wrote:
| Maybe? Their statement was that you should "not always add,"
| as opposed to the much stronger "always not add."
| selcuka wrote:
| At least the latter provided some examples.
| Twisell wrote:
| Here's a real world example if you run a script once
| every 5 minutes that launch sub-task you might be tempted
| to add a non natural auto-increment number that identify
| each occurence to create a link between the script and
| the subtasks.
|
| However it will be way more painless to use a timestamp
| of the script starting point a natural key.
|
| This way when shit happen you have a relevant and stable
| way to identify each occurrence. And timestamp is easily
| indexable and ordered.
| aaronbrethorst wrote:
| Fast-forward a couple years: now I have scripts that
| launch sub-tasks more than once a second.
| yellowapple wrote:
| So use timestamps with sub-second precision, which
| virtually every SQL database supports (even ones like
| SQLite that don't have built-in date/time types).
| selcuka wrote:
| What about multiple scripts running in parallel?
| bruce511 wrote:
| If you want a timestamp, by all means add a timestamp
| field. And of course add a (non primary) key on the
| field.
|
| I would add a different surrogate field though for the
| primary key. Because I've -always- discovered edge cases
| which break my original model.
|
| In this situation I can think of 2. Firstly if the
| process triggers twice at the same time you have a
| problem (which happened to me in a case where a test
| suite fired up thousands of instances on a machine with
| many cores). Secondly twice a year with daylight savings
| (and when the owner of the system decided to change from
| local time to utc).
|
| Is there an actual natural kh out there that will never
| fail me? Probably. But I'm tired of looking for it.
| _flux wrote:
| How about:
|
| - Internet scanner that would use host IP as the primary key
|
| - A table connecting two other tables using their primary
| keys as the composite primary key (when there can only be
| zero to one such connections)
| ltbarcly3 wrote:
| I don't think you read it carefully.
| aaronbrethorst wrote:
| I did. I just hate encountering databases designed by
| people who take your advice to heart.
| Spivak wrote:
| Natural keys are great when they're someone else's synthetic
| keys. My Discord bots make liberal use of this.
| model: DiscordMessage key = discord_message_id
| key = discord_user_id key = discord_channel_id
| key = message_type model: UserData key =
| discord_user_id key = field_name model:
| GuildFeature key = discord_guild_id key =
| well_known_feature_name model: DiscordEvent
| key = discord_event_id key = discord_guild_id
|
| This is how I was always taught to use them so I'm kinda confused
| what the author is getting at. If the thing you're using as a key
| can change and that doesn't _define_ the row then you don 't have
| a key. If you're at a conference where everyone has a badge
| number then that's a great natural key when scanning people into
| your workshop. If you're at Disney and everyone has a magicband
| with ids then you can use that as a natural key all over the
| place when to you that's a visitor.
| fn-mote wrote:
| I think this is an excellent example of one of the pitfalls
| that the author is getting at.
|
| (Disclaimer: not a Discord bot programmer.)
|
| Suppose discord_user_id is an identifier like @Spivak#2024 that
| uniquely identifies the user. When Discord forces all users to
| change to unique ID's and eliminates the disambiguating numbers
| (or allows users to change them), then where does that leave
| you?
|
| (For the unfamiliar: something like this happened. Not sure if
| those are really the "discord user id" used in bot service
| though, but suppose that they are for the sake of the
| argument.)
| iaabtpbtpnn wrote:
| It probably leaves you in the same place you'd be with a
| synthetic key of your own: unable to track users across this
| key change without additional Discord data.
| foota wrote:
| Yes, but if you then reference that as a foreign key in
| some other table or system, then whatever migration you
| need to do will encompass those systems as well, whereas if
| you use a synthetic key from the start you wouldn't need to
| change them.
| jjmarr wrote:
| Discord's real user ID is a very lengthy number not visible
| to end-users nor used for other purposes.
|
| https://support.discord.com/hc/en-
| us/articles/206346498-Wher...
| tylersmith wrote:
| Which only reenforces the point that using what is exposed
| as the ID is not safe as a natural key.
| Spivak wrote:
| You're reading this wrong. The real ID (the Snowflake) is
| exposed in the API, and the UI if you turn dev tools on.
| That's the one I use. You don't use @username#1209 but
| 1247755691426542433.
| beeboobaa3 wrote:
| Discord user IDs are their numerical unique, unchangeable,
| id.
|
| https://support.discord.com/hc/en-
| us/articles/206346498-Wher...
| rsynnott wrote:
| Does Discord guarantee, _absolutely guarantee_, that they will
| never change their ids? This might be a safe practice if they
| explicitly claim that, and you trust them absolutely. Not
| otherwise.
| ray_v wrote:
| Feels like this could have used a few more solid examples up-
| front. I think another good example would be PlayStation Network
| using the natural key of "gamer tags" as the primary key to
| identify players would be a good example. Since this effectively
| locks players into having to keep a gamertag in order to uniquely
| identify them in the service -- instead of having a synthetic key
| that carries no meaning or data other than "uniqueness"
| robomc wrote:
| yeah the way he described it it's like... well who would ever
| do that. but foreign keying to emails or usernames is much
| easier to "accidentally" do and is a classic source of long-
| term headaches.
| jaggederest wrote:
| As a great example: My steam account name is the email I was
| using in 2003. I have largely not used it since then. The
| email on the account has been updated, but the account name?
| Stuck.
| jimbobthrowawy wrote:
| I was about to ask how you got past steamguard without
| email access. Makes sense if you were able to navigate off
| it.
| jaggederest wrote:
| Email was changed about 5 years before steamguard release
| :)
| mths wrote:
| Last I checked, Steam still has me logging in with my two
| decade old hotmail address as my account name. At least it's
| not something that shows publicly, I think.
| petepete wrote:
| Mine too, but with a British ISP that hasn't existed for
| fifteen years.
| t90fan wrote:
| ntl?
|
| me too!
| petepete wrote:
| That's the one! myinitials@ntlworld.com
|
| And NTL merged with Virgin in 2006, making it 18 years.
| Oh dear.
| nottorp wrote:
| I wouldn't swear it, but I think when I changed my Steam
| email the process didn't require access to the old email,
| just knowing the user and pass.
| zamadatix wrote:
| I'm able to log out and back into Steam using my plain
| username rather than email. The box actually says"enter
| account name" and wouldn't log me in if I tried to enter my
| email instead. I'm not sure if there are conditions to this
| e.g. my account was only created in 2009 and I'm not sure if
| I've ever toggled anything that "flipped it over" along the
| way either.
|
| On the API side I know Steam has something like 3 different
| forms of auto generated alphanumerical account ID and I
| imagine that's what everything is really keyed off of on the
| back end.
| bruce511 wrote:
| This is one of those cases where examples of natural keys
| failing are so ubiquitous that they are almost redundant.
|
| For the group who have forged a career with natural keys, and
| never regretted it, more power to you. Great.
|
| However to the rest of us, myself included, where ill-
| considered natural keys have caused endless opinions and
| suffering, my commiserations.
|
| If I could send back one piece of advice to junior-me it would
| be to avoid natural primary keys. (Ideally with the corollary
| to avoid sequences, but that's another thread for another day.)
| saghm wrote:
| On the other hand, Discord used to allow arbitrary usernames
| and would add a suffix for when you needed to disambiguate
| (e.g. if you used the username JohnDoe, your "id" would be
| something like JohnDoe#12345) but over the past year or so
| forced everyone to pick a fully unique username. In this case,
| the decision seemed to potentially be financially motivated,
| given that people with subscriptions were given priority for
| claiming usernames, and historically there really hasn't been
| much reason to pay for it in the first place.
| bux93 wrote:
| A phone company I was a customer of used my phone number as the
| customer Id. Which doesn't work great for people, or companies,
| with more than one phone number. They could also only provision
| one SIM per phone number, and not use a phone number tied to a
| SIM as a virtual number, and also the primary (customer Id)
| phone number had to be tied to a SIM.
|
| I ported my phone number away from them, but what if I ever go
| back? Will my old data be there? Including my old address? What
| if my phone number gets recycled, and some-one else gets that
| phone number and ports it to them? So many questions.
| davidhyde wrote:
| A natural key is what I would normally call unique index. Say
| first, surname and date of birth in an employee table as a bad
| (poor design) example. As opposed to a surrogate key like
| personId being an auto incrementing ID which is the norm since it
| can be easier to use when joining tables. I wish articles like
| these would explain terminology up front. I found it irritating
| to wade into the anecdotal trivia and not know what I was reading
| about and I'm very familiar with how databases work!
| bongodongobob wrote:
| One reason to not do that is because you could then figure out
| other users IDs by their hiring date. This ended up being a
| problem for me once with the auxiliary systems that relied on
| that ID, namely when it was used in links in other
| applications.
| tadfisher wrote:
| Another reason is sharding, in which case any non-random (to
| be more precise, non-uniformly distributed) bits are going to
| skew the partitions.
| eddd-ddde wrote:
| Why is knowing someone's ID an issue?
|
| They are meant for identification, if anything it should be a
| benefit that they are easily guessable.
| selcuka wrote:
| > Why is knowing someone's ID an issue?
|
| It increases the attack surface as an authorization
| vulnerability will allow an attacker to enumerate and
| access all records. Yes, it is security through obscurity,
| but a random (e.g. UUID) scheme makes it harder.
| kubanczyk wrote:
| ~128 bits worth of obscurity is considered real security
| for the time being. Assuming a cryptographically secure
| PRNG.
|
| Thats like guessing a password 18 ASCII chars long.
| bongodongobob wrote:
| Exactly, it's not that _my_ systems use security through
| obscurity, it 's the other ones mine ties into.
|
| This was years ago and you don't see it as much anymore,
| but think autogenerated links to shitty CRM, ticketing,
| and project management software where the link is the
| query aka - Blahsoftware.local/info/bunchofgarbage?=useri
| d+garbage+view+sensitiveinfo.html type stuff.
| daft_pink wrote:
| Anyone that's spent large amounts of time using vlookup in excel
| can totally relate. Mashing some fields together as a key works
| great until it doesn't.
| projektfu wrote:
| The devil's advocate cases:
|
| 1. When duplication occurs and goes unnoticed because the natural
| key isn't being used, and then perhaps the problem is "corrected"
| by an administrator in a way that doesn't make sense.
|
| For example, someone sets up an account with an street address A,
| then forgets they had it and sets up an account with street
| address B. They call and complain that they can't find an old
| order, or whatever, and the duplication is discovered. The
| administrator later clobbers one of the addresses but both are in
| use. A natural key (behaviorally speaking) may have presented the
| duplication, assuming address isn't part of the natural key. This
| can be satisfied by uniqueness constraints, etc.
|
| 2. When you are browsing the database, you see a bunch of
| synthetic keys and have to perform various joins to be able to
| see the relevant data. The synthetic keys make joins easier to
| write, but make may make ad-hoc queries more time-consuming and
| difficult.
| roenxi wrote:
| A synthetic key is adding an additional dataum - so any problem
| that could be solved with the original data can be solved by
| the new data too (worst case, ignore the primary key). That is
| one of the best reasons to add a proper index - there is almost
| literally no downside. We're talking maybe a few bytes/record
| and arguably a trivial amount of administrative overhead.
|
| Bearing that in mind:
|
| 1. Add a unique constraint on the natural key.
|
| 2. If the accuracy of joining by natural key is acceptable,
| join based on it. If that doesn't work, then a table design
| without a synthetic key wouldn't be fit for purpose either.
| kubanczyk wrote:
| In other words, don't mistake a primary key with a key or a
| constraint.
| azlev wrote:
| I think somewhat different than the article in some ways.
|
| The first example is why `UPDATE CASCADE` was implemented. So
| it's possible to use natural keys as identity without the fear of
| children table. At least in most databases it works.
|
| The drawback of enumeration is real, so if you expose this key
| you'll need some authentication/ authorization mecanism.
|
| Another good thing in natural keys is that you can eliminate part
| of the joins. You don't have to join the father table because the
| key in child table is known.
|
| I think the biggest challenge is how to map logins to people.
| It's very common to interpret both as the same, but they are not.
| arthens wrote:
| > The first example is why `UPDATE CASCADE` was implemented. So
| it's possible to use natural keys as identity without the fear
| of children table. At least in most databases it works.
|
| This is only true in a system using a single database, not
| replicating data in external services, and not offering APIs.
|
| And while it might work today when the service/website is still
| fairly small and self contained, the requirements might change
| at any time. So the title (You will [future] regret it) still
| applies to this solution
| RecycledEle wrote:
| For those who do not know, natural keys are when you have a
| primary key in a database table that is derived from the data
| itself.
|
| An example is using a car's chassis number as the key for the
| record describing that vehicle.
| rhelz wrote:
| Perhaps a better example would be a fingerprint or a retinal
| scan instead of social security numbers as natural keys for a
| person.
| parpfish wrote:
| There another reason not mentioned -- if your key is something
| like a UUID, it's very easy to define the logic for joining and
| filtering based on that key.
|
| If you were using some sort of string like an email address or
| username, you have to think about case sensitivity and trimming
| white space and all sorts of preprocessing and then make sure you
| do it consistently EVERYWHERE
| Cyph0n wrote:
| Ideally, you should aim to sanitize/normalize strings on the
| write side rather than resanitizing on every read.
| SoftTalker wrote:
| Yes but any kind of bug could later introduce tainted values.
| A foreign key contstraint might save you, but not always.
| parpfish wrote:
| That helps part of it, but there are still places for
| problems to pop up.
|
| - you need to do an adhoc query to look something up so you
| have to type in the key in a where clause
|
| - Or you used different sanitation methods in two different
| databases and you need to join things now
|
| - you try to join the table that had the sanitized email key
| to a different table that just so happens to have email but
| it wasn't sanitized because it was an optional field and not
| the key
| phatskat wrote:
| And then you get the other stuff like one-off migrations
| that are done by someone unaware of the business
| constraints, or a contractor who completely misses the ORM
| method, etc.
|
| These of course should be caught by checks and balances,
| and you can't count on anyone knowing what they need to
| know ten years after an implementation is done
| sgarland wrote:
| > you need to do an adhoc query to look something up so you
| have to type in the key in a where clause
|
| Having a surrogate key here doesn't help. A WHERE predicate
| can be rewritten as an INNER JOIN quite easily. Or you
| could use a subquery, or a CTE. Many options.
|
| The other problems discussed are an engineering culture
| problem. Either you value correctness and consistency or
| you don't.
| parpfish wrote:
| not sure how converting a WHERE to a JOIN helps when
| you're doing adhoc queries like this to look up records:
|
| `SELECT * FROM users WHERE email LIKE
| 'FirstName@domain.net'`
| sgarland wrote:
| You could also have CHECK constraints on those columns to
| ensure that nothing is written incorrectly, and/or a pre-
| write trigger that casts them to lowercase.
|
| Or you could make the email column case-insensitive, since
| it's generally accepted to be CI anyway.
| shkkmo wrote:
| Side note about using email addresses.
|
| It is pretty common to assume that email addresses aren't case
| sensitive since many email providers treat them that way. Email
| addresses absolutely are case sensitive so you need to preserve
| case when storing them and be careful about case when using
| them.
|
| This makes email addresses particularly unsuited to being used
| as a natural key. Most people treat them as case insensitive
| and you need to work around that, but you can't safely just
| treat them as case insensitive yourself.
| keepamovin wrote:
| The way I think of it is a key, in a database, is a _record_ key.
| It is intended to identify a particular logical record, not the
| logical _thing_ which is referred to by the record.
|
| An an footnote, I'll add this doesn't mean that you need to have
| complexities like record versioning, record history, or anything.
| But couched in those conceptual terms where those things are
| possible, is a happy and safe space to be. In this space, a
| database records entries, as if they were each a single paper
| form with boxes where you fill out, in pencil, to erase if you
| like or not, the particulars of the thing you're recording. This
| form comes pre-stamped with a number: the _record key_.
|
| In this cozy little world, you can be imperfect, and mistakenly
| (or deliberately, as your use case requires), file multiple slips
| that refer to the same _logical thing_ , and yet all have
| different file ("record") numbers - or keys.
| jklowden wrote:
| You think your surrogate key will save you? It will not. The
| world has an external reality that needs to be reflected in your
| database. If the unique identifier for your object -- VIN, CUSIP,
| whatever -- if it changes, the world will henceforth refer to it
| by both. You will need to track both. Adding a synthetic key only
| means you have to track all three. Plus you have to generate a
| meaningless number, which is actually a choke point in your data
| throughput.
|
| The natural key forces you to think about what makes the row
| unique. What _identifies_ it. Sometimes, it makes you go back to
| the SME and ask them what they mean. Sometimes it makes you
| reconsider time: it's unique now, but does it change over time,
| and does the database need to capture that? In short, what are
| the boundaries of the Closed World Assumption? You need to know
| that too, to answer any "not exists" question.
|
| To use our professor's car's example, we actually do not know the
| database design. It could well be that the original identifier
| remained the primary key, and the "new id" is entered as an
| alias. The ID is unique in the Car table, identifying the
| vehicle, and is not in the CarAlias table, where the aliases are
| unique.
|
| Oh, you say, but what if the bad old ID gets reused? Good
| question. Better question: how will the surrogate key protect
| you? It will not. The reused ID will be used to query the system.
| Without some distinguishing feature, perhaps date, it will serve
| up duplicates. The problem has to be handled, and the surrogate
| key is no defense.
|
| Model your data on the real world. Do not depend on spherical
| horses.
| kayodelycaon wrote:
| Of course they won't save you from external data. The whole
| point is for your system to have a way to identify rows
| internally so you can deal with external systems getting wonky
| without corrupting your own data.
|
| All of your concerns are easily solved by a unique index.
|
| Using external keys as a forcing function to prevent people
| from representing data wrong is not great.
| crabmusket wrote:
| The surrogate key uniquely identifies _a row in your database_
| , which is an entity just as real and significant as the car or
| the employee or what-have-you. Don't confuse the two!
|
| I agree with you that having a surrogate key isn't going to
| save you from the reasons why natural keys can be difficult.
| The complexity has to go somewhere. But _not_ having a unique
| identifier for each row is going to make things _extra_
| difficult.
| hobs wrote:
| The main thing is that the synthetic key should never leave
| the database and never be displayed in the app - if you want
| to have another key that represents the human oriented key do
| it, but it should be another field, an indexed field even,
| but one that has a lot less monotonic sequential properties
| that are inherent to synthetic database identifiers.
|
| You want to change that human key? Sure. You want to to
| complain that the keys are not sequential? Sure. You want to
| actually make them weird strings that are harmful to my
| brain? Why not? You want to update primary keys in the
| database? No. Absolutely not.
| The_Colonel wrote:
| I assume you hint at the security aspect of monotonic keys?
|
| I've found this issue a bit overblown. It's basically
| security by obscurity, which is a nice bonus, but not
| something your security model can be based on.
|
| I mean, it is a good practice to expose some kind of non-
| sequential key (e.g. UUIDv7), but it doesn't seem to me
| like a dealbreaker.
| crabmusket wrote:
| I love it when my competitors use sequential integer IDs.
| throwaway8486k wrote:
| Security by obscurity makes it more difficult for bad
| actors as an additional layer that they need to break. I
| don't see a reason why that's a bad thing and doesn't
| take a lot of effort to implement in this case.
|
| I have been in a startup where competitors used our
| sequential keys to scrape a list of customers. Lesson
| learned the hard way with actual business consequences!
|
| Sequential keys also leak information (German tank
| problem)
|
| Your competitors can estimate number of customers, your
| growth rate and other stats that you often don't want
| them to know.
|
| https://en.m.wikipedia.org/wiki/German_tank_problem
| The_Colonel wrote:
| I'm not saying it's a bad practice, the opposite
| actually.
|
| > I don't see a reason why that's a bad thing and doesn't
| take a lot of effort to implement in this case.
|
| True, if you start your application from scratch. Like if
| I started designing a new app today, I'd just choose the
| UUIDv7 for the primary key.
|
| It's not an easy thing to add into an existing
| application, though. I see applications leaking their
| internal IDs all the time, but usually it's not worth the
| effort to fix that, because it's a comparatively minor
| problem.
| throwaway8486k wrote:
| > I'm not saying it's a bad practice, the opposite
| actually
|
| I'm sorry that I didn't make that more clear. I saw that
| you mentioned it as a best practice and are aware of the
| advantages. It's just that there are so many others that
| don't have the balanced view as you seem to have.
|
| I have been involved in many discussions at my work place
| where "security by obscurity" is used as a way to shut
| down discussions. They changed their minds about
| sequential keys after the incident I mentioned, but it
| still has the power to "win" other discussions. Sure, we
| need to have rate limiting on ip-addresses, auth and
| other mechanisms, but they are not perfect and bugs
| happen all the time. An "unguessable" id is an additional
| security layer
|
| > It's not an easy thing to add into an existing
| application, though
|
| I agree, but there are ways to reduce the attack surface.
| You could add an extra "public id" field that can be used
| for lookup in addition to the existing id. In this way
| you can have a gradual migration where you go through
| each endpoint and migrate them individually without
| changing the foreign keys and other relations in the
| database (they would still use the sequential key). Maybe
| you end up not having time to migrate them all, but at
| least you can reduce the attack surface on the most
| sensitive endpoints.
|
| If you have low volume endpoints you could perhaps even
| simply add a mapping layer where you do a simple db
| lookup to replace the public key with the internal
| without changing the existing queries. You could even
| cache this easily to reduce the db load. (both ids are
| permanently fixed and can be cached forever).
| yellowapple wrote:
| > I have been in a startup where competitors used our
| sequential keys to scrape a list of customers.
|
| If your system allows customers to see each other (or
| worse: unauthenticated users to see customers) in this
| fashion in the first place then whether you're using a
| sequential integer v. a random UUID is the least of your
| problems.
| mgkimsal wrote:
| The 'customers' could be free tier users - a social media
| type system where everyone has a public profile -
| intended for the public - would still be scrapable by
| /profile/1, profile/2, etc. Doesn't necessarily require
| 'authentication' for the exposing of sequential integers
| to have a bad outcome.
| throwaway8486k wrote:
| You're right. The urls were public to be shared (think of
| marketing material / ecommerce), so there was not a
| security incident.
|
| But it did give our competitor free highly qualified
| leads that they could use to poach customers. This
| product was new to our customers, and we had spent a lot
| of time selling and convincing them that it was useful.
| inopinatus wrote:
| UUIDv7 may not be generated sequentially, but it is still
| temporally ordered and discloses a timestamp, which may
| be an undesirable leakage for some
| applications/environments. When obfuscation matters that
| much, use a UUIDv4 and eat the index performance hit.
|
| Some might suggest, "encrypt your object identifiers on
| the way in/out", but there's a ton of pitfalls involved
| since for most applications they are now rolling their
| own crypto, and it also makes identifiers much longer.
| The_Colonel wrote:
| Yes, you can go to great depths, but they each have
| trade-offs - in performance, increased complexity etc.
| and it's necessary to make a judgment for each particular
| app instead of applying the most overengineered solution
| everywhere.
| hobs wrote:
| Sort of, I have way more experience with clients saying
| that an invoice is missing (which turned out to be a
| rolled back transaction) and then I get to explain how
| transactions work, and then if the customer is smart
| enough they'll say something like "then why the hell is
| the invoice number part of that?"
| stouset wrote:
| Precisely. And if I have a surrogate key to identify "a car",
| _I_ get to define what makes a car unique for my purposes.
| Maybe that really is the VIN. Maybe today it's the VIN, but
| tomorrow it's whatever vehicle is currently being driven by
| its owner. Maybe it's something else.
|
| Some day I may need to track multiple VINs for a vehicle
| (maybe it's got parts from multiple VINs and I want to track
| that). I can still always decompose that table and have an
| n-to-1 relationship between cars and VINs _without migrating
| the rest of my data model_.
| hackit2 wrote:
| A entity can exist over more than one row in your database,
| but it is useful to uniquely identify each row as the lowest
| common denominator.
| mkleczek wrote:
| > The surrogate key uniquely identifies a row in your
| database, which is an entity just as real and significant as
| the car or the employee or what-have-you. Don't confuse the
| two!
|
| But the DBMS already maintains a row identifier (called rowid
| or ctid or whatever depending on the DBMS). Why do you need
| an explicit one?
| unchar1 wrote:
| It may be useful if you have data that originates from
| another source or if something outside of our system
| references your entity. In that case you need to keep some
| form of an externalRef, so it's usually easier to just use
| an id that you can control, for referencing both internally
| and externally.
| magicalhippo wrote:
| In the DB we use[1] the internal row id is not stable:
|
| _The value returned by the function is not necessarily
| constant between queries as various operations performed on
| the database may result in changes to the row identifiers
| of a table._
|
| _So, users should refrain from using the ROWID function in
| ordinary situations; retrieval by primary key value should
| be used instead._
|
| [1]: https://infocenter.sybase.com/help/index.jsp?topic=/co
| m.syba...
| simonw wrote:
| Be careful with those. SQLite has a rowid concept, but it's
| not guaranteed to be stable - running a VACUUM against a
| table can reassign the rowids for every row!
|
| https://sqlite.org/rowidtable.html says:
|
| > If the rowid is not aliased by INTEGER PRIMARY KEY then
| it is not persistent and might change. In particular the
| VACUUM command will change rowids for tables that do not
| declare an INTEGER PRIMARY KEY. Therefore, applications
| should not normally access the rowid directly, but instead
| use an INTEGER PRIMARY KEY.
| sgarland wrote:
| Because every DB can and will shift those as needed. They
| reference the physical location on disk for a given tuple.
| They are not meant for general consumption.
| polemic wrote:
| In the wise words of Patsy, "it's only a model".
|
| The real world is resistent to clean abstractions and
| abstractions are distressingly subject to change. What made
| your row unique today is quite likely to become non-unique in
| the days/months/years to come.
|
| Always use surrogate keys. Your future self will thank you.
| hehdhdjehehegwv wrote:
| And always kids, write code for one person, and one person
| only: your future self.
| M95D wrote:
| Yeah, fsck your co-workers and your replacement when you
| quit.
| hehdhdjehehegwv wrote:
| If you're good to your future self you're also good to
| them.
| tadfisher wrote:
| In that case, using this natural datum as an ID is a
| contradiction, because we are now removing the uniqueness
| constraint. It's fine to model the real world, but the real
| world also includes your data, and you may want to identify
| unique _records_ as it is not a universal truth that data can
| be corrected on entry.
| huygens6363 wrote:
| There is a model of a thing and there is a row that stores a
| representation of that model of a thing.
|
| They both are things. Ignoring the last one might be tempting,
| but it's not practical.
|
| Interestingly your own way of thought is applied, but now a
| level deeper again.
|
| How do you model a row? What makes it unique? A surrogate ID is
| the only sensible unique identifier for such a thing as there
| is no "natural key" that would make sense for instances of
| something so general as "Row".
|
| What you were saying amounts to "don't model the thing holding
| the model", but experience shows the thing holding the model is
| itself an (often unwilling) active part of systems.
|
| Someone here gave the example of wrongly entered PK's by
| administrative personnel handing live customers. That's IMO a
| good example of why you need an extra layer on top of your
| Actual Model(c). I can think of more.
| crabmusket wrote:
| > They both are things.
|
| Corollary: your app is part of the real world.
| TeMPOraL wrote:
| Furthermore: bugs in your app are part of the real world,
| so woe to those who used keys from your app as
| natural/external keys in their app.
| jtriangle wrote:
| >Model your data on the real world. Do not depend on spherical
| horses.
|
| This took me years to realize, and once I did things became
| much, much simpler.
| CrimsonRain wrote:
| Sorry. This is a very bad advice. I just had to fight tooth and
| nail to make my lead turn around from this disastrous decision.
| Using a lot of external IDs as our own row primary keys and
| then they get propagated to all other tables as foreign keys
| and what not. One day the foreign key chances or God forbid,
| the formatting changes in external systems, now we need to fix
| our whole database and all codes instead of a small isolated
| place.
|
| Generate your own unique keys for everything; add a few more
| unique constraints if needed. A bit more work but never a
| regret.
| aidos wrote:
| Yeesh. I once made the mistake of using an external ID as a
| primary key. What a day it was when they were changed on me.
| spoiler wrote:
| Can you share more about this? Wouldn't you run into the
| same problems if you used a surrogate pk? Without the
| nat/external pk and fk, you run the risk of having validity
| issues.
|
| Conversely, if the ID changes, isn't the friction what you
| want?
|
| I feel like optimising for unlikely edge cases instead of
| integrity because of a single incident is too reactionary.
|
| Writing a query or script to update a string, even for
| millions of rows, isn't that big of a deal?
|
| Obviously there _are_ cases where nat pks/fks are bad, but
| not all of them.
| mixmastamyk wrote:
| Instead of migrating one column, now you have that and
| every table with a foreign key to the original. Requiring
| transactions and possibly locking etc.
| naasking wrote:
| > I once made the mistake of using an external ID as a
| primary key. What a day it was when they were changed on
| me.
|
| I've kept with this advice for the most part, but I'm
| tempted in some cases to use the external id when there's
| some guarantee of stability and universality. Like 2 and 3
| digit ISO country codes.
| Piskvorrr wrote:
| Not that I'd get about 5 different ISO country code
| changes (with some flipflopping) just by sitting in this
| very same spot for a couple of decades. "Stability" in
| country codes, bah humbug.
| naasking wrote:
| Your geographical location's sorting into a country might
| not be stable, but I'm referring to the ISO codes that
| name the country, which should be relatively stable.
| devjab wrote:
| In terms of the Danish CPR that is mentioned here, the way we
| actually solved the challenge in the national architecture
| strategy was to define your social security number(s) as what
| we call an address on your person. I'm not sure why it was
| called an address, but it's basically a UUID and the
| information. Maybe it's because it was first used to store
| addresses?
|
| Anyway. Unless your system has not yet implemented the national
| strategies (which by now are approaching 24 years) then
| changing a Danish Social Security won't actually matter because
| it's just an added address to your person "object". So
| basically you'll now simply have two, and only which is marked
| as the active. Similar to how you've got an array of addresses
| in which you have lived but only one main address active.
|
| It did indeed cause a lot of issues historically because it was
| used as a natural key... though with it being based on dates,
| it was never really a good candidate for a key since you
| couldn't exactly make it into a number unless you had some way
| to deal with all the ones beginning with 0 being shorter than
| the rest. Anyway... it was used as a key, and it was stupid.
|
| Anyway I both agree and disagree with you. Because we've
| successfully modelled the real world with virtual keys, but you
| access it by giving any number of natural keys. Like... you can
| find the digital "me" by giving a system my CPR number, but you
| can also find me by giving a system my current address.
| Technically you could find me by giving a name, but good luck
| if you're dealing with a common name. There is a whole range of
| natural keys you can use to identify a "digital" person, but
| all of it leads from a natural key into a web of connected
| virtual keys.
|
| All of it is behind some serious gatekeeping security if you're
| worried. Or at least it's supposed to be.
| mrweasel wrote:
| It's been years since I worked on systems with CPR numbers,
| but I seem to recall that there is also a policy in place,
| stating that you are not allowed to use the CPR number as a
| "primary key". Many companies did anyway, because they never
| actually bothered to read the guidelines and regulations.
|
| All the systems I've seen always had the CPR as a lookup for
| a UUID, which as you say is the the "address" of the actual
| person object.
| paulddraper wrote:
| Terrible advice.
|
| Surrogate keys are keys are a layer of indirection.
|
| They don't fix all problems, but they fix some problems.
|
| Not least of which is performance. Often natural keys are
| character strings, whereas surrogate keys can be fixed size
| integers, saving index sizes on your FKs.
| mypalmike wrote:
| Conversely, certain queries can be much faster by using
| natural keys when the FK is all that you need in the result
| rather than additional fields in the primary table. In this
| case, the primary table doesn't need to be queried at all.
| This doesn't generally overcome the benefits of synthetic
| keys, but it is an optimization sometimes put into practice.
| crabmusket wrote:
| Aren't you literally describing an index?
| mypalmike wrote:
| No. Maybe an example helps. You have a users table with
| username as natural PK. You have an access table with
| timestamps of users hitting a service, with FK of
| username. If you query the access table for a range of
| timestamps and just want the usernames, they're right
| there in the results of the access table query. If you
| had instead used a synthetic user_id key, the db would
| have to do an additional lookup in the users table via a
| join for each user_id to get the usernames.
| crabmusket wrote:
| Oh of course, that makes sense. I'm not sure how I
| misunderstood that!
| mrkeen wrote:
| > surrogate keys can be fixed size integers
|
| This launches into the _other_ debate about PKs: using UUIDs
| rather than sequential keys.
| sgarland wrote:
| This is less of a debate, and more of an indicator of who
| has had to work with a DB at scale using UUIDv4 everywhere.
|
| Don't blow up your B+trees.
| mrkeen wrote:
| I guess.
|
| Either your system is happy enough to route every new
| entity through "one DB at scale" so it can let your "one
| DB at scale" be in charge of an auto-incrementing long,
| or it isn't.
| sgarland wrote:
| A common method is to have a small app (which can quite
| easily be HA) that hands out sequential chunks to each
| shard, interleaving and/or gapping as necessary.
| SPBS wrote:
| > You think your surrogate key will save you? It will not. The
| world has an external reality that needs to be reflected in
| your database.
|
| Yes, it will. It is precisely because of a messy external
| reality that you need an unchanging internal ID that is
| unaffected by changes to the external ID. If the software
| designer in the article had followed your advice, changing the
| chassis number would have likely resulted in broken car
| ownership records.
|
| Whoever is reading this in the future, please don't follow the
| parent comment's advice. Use surrogate/synthetic keys for your
| primary key.
| yashap wrote:
| This is bad advice.
|
| Say I have a user table, and the email is unique and required,
| and we don't let users update their email, and we don't have
| user deletion. If I'm going natural PK, I make email the
| primary key.
|
| But ... then we add the ability for users to update their
| email. But it should still be the same user! This is trivial if
| we have a surrogate primary key, a nightmare if we made email
| the natural primary key.
|
| Or building on that example, maybe at first we always require
| an email from our users. But later we also allow phone auth,
| and you just need an email OR a phone number. And later we add
| user name auth, SSO, etc. Again, all good with surrogate
| primary keys, a nightmare with natural primary keys.
|
| There are countless examples like this. You brought up cars,
| same thing with licence plates, for example. Or even Social
| Security Numbers/Social Insurance Numbers - in Canada SINs are
| generally permanent, but temporary residents can have their SIN
| change if they later become permanent residents, but they're
| still the same person.
|
| You want your entities to have stable identity, even if things
| you at one time thought gave them identity change. Surrogate
| primary keys do that, natural primary keys do not. Don't use
| natural primary keys, use surrogate primary keys with unique
| constraints/indexes.
|
| I challenge you to come up with a single plausible example
| where you're screwing yourself by choosing surrogate PK +
| unique constraints/indexes. Meanwhile there are endless
| examples where you're screwing yourself by choosing natural PK.
| maweki wrote:
| > I challenge you to come up with a single plausible example
|
| So I come from academia, but generally if you use a natural
| key as PK in a foreign key constraint it may be possible to
| express additional consistency criteria as CHECK-constraints
| in the referencing table.
|
| So this is a bad example, but say you have Name and Birthdate
| as your PK, and you have a second table where you have
| certain special offers sold to your customers and there is
| this special offer just for Virgos and Pisces, you could
| enforce that the birth date matches this special offer. Some
| modern systems also technically allow FK on alternate keys,
| so you could still do it that way, but database theory often
| ignores that.
|
| But second, while I agree that surrogate keys are often a
| good idea, I find your argument, that you must design for
| every conceivable change, not convincing.
| cjfd wrote:
| "for every conceivable change".
|
| That is not what he is arguing at all. He is showing that
| there are very many highly plausible changes that are
| problematic with natural keys. And he totally correct about
| that. Frankly, the fact that a post arguing for natural
| keys makes it the top of an HN comment thread is extremely
| weird. The original article is correct that natural keys
| are bad.
| zelphirkalt wrote:
| Why does anything need to be a primary key anywhere in
| order to enforce some constraint? At least from ORMs I know
| I can set for example any group of attributes unique. Other
| constraints can be implemented in some general method that
| is called when persisting in the actual database. Even if
| no ORM, you can write a wrapper around your persisting
| procedure.
| sgarland wrote:
| It doesn't, you're right. However, indexes aren't free,
| so if your data is such that a natural PK (composite or
| otherwise) makes sense, you'll save RAM. Also, for
| clustering RDBMS like MySQL, data is stored around the
| PK, so you can get some locality boosts depending on
| query patterns.
| a57721 wrote:
| > At least from ORMs I know I can set for example any
| group of attributes unique
|
| Just in case, ORMs send DDL statements with uniqueness
| constraints to the DBMS, they don't do any magic here.
| lstamour wrote:
| > Some modern systems also technically allow FK on
| alternate keys
|
| As far as I can tell, all modern systems allow it, as it is
| part of the SQL standard that foreign keys can be either
| primary keys or unique indexes. Here's a brief quotation
| from a copy of ISO/IEC 9075-2:1999 (not the latest version)
| that I randomly found online:
|
| > If the <referenced table and columns> specifies a
| <reference column list>, then the set of <column name>s
| contained in that <reference column list> shall be equal to
| the set of <column name>s contained in the <unique column
| list> of a unique constraint of the referenced table.
|
| So it mentions unique constraints first. Then afterward it
| says:
|
| > If the <referenced table and columns> does not specify a
| <reference column list>, then the table descriptor of the
| referenced table shall include a unique constraint that
| specifies PRIMARY KEY.
|
| If I'm reading this right, it means that in the base case,
| where you specify the column to reference, it can be any
| unique constraint, where a primary key is just another
| possible unique constraint (as all primary keys are by
| definition unique). And only if you don't specify the
| fields to reference does it then fall back to the primary
| key instead of a named unique constraint.
|
| I'm not disagreeing with you entirely - it's true that
| often there's an assumption in database theory that primary
| keys are natural and foreign keys are primary keys. But
| this isn't a hard requirement in practice or in theory, and
| it partly depends on the foreign key's purpose, why you
| need it in the first place. This StackOverflow answer also
| explains it well:
| https://softwareengineering.stackexchange.com/a/254566
|
| I should add that there is also a set of database design
| wisdom that suggests you should never use database
| constraints such as foreign keys, only app/api constraints,
| but that's a whole different tangent.
| sgarland wrote:
| > I should add that there is also a set of database
| design wisdom that suggests you should never use database
| constraints such as foreign keys, only app/api
| constraints, but that's a whole different tangent.
|
| That's less a DB design thought and more of a "devs with
| little formal training in RDBMS who only want to use it
| as a dumb store" thought.
|
| Use the DB to its strengths. CHECK constraints add an
| infinitesimal amount of overhead for writes, and
| guarantee that there will never be invalid data written.
| A bad code change could allow, however briefly, for that
| to occur.
| spoiler wrote:
| Another example is where you use a service that provides
| you with a stable id. It makes little sense to add a
| surrogate id and a fk on that surrogate id. It violates
| data quality and integrity just for a hypothetical
| situation.
|
| Data integrity/quality matters. Adding friction to prevent
| accidents also matters. I don't want something accidentally
| and trivially updating a field that's used to reference
| thing externally.
|
| Something about the nat key is about to change? Fine, we
| can write migrations. Even if it affects millions of rows,
| it's not a big deal.
|
| I understand people have been burnt by bad design decisions
| involving nat keys, but they're not some devil's key
| everyone here dogmatically makes them out to be. You can
| mess up using anything.
| maweki wrote:
| > a service that provides you with a stable id
|
| I think there's the important point. Is your key
| _actually natural_ or is it someone else 's surrogate key
| anyway? Going back to the vehicle identification number:
| that's already a surrogate key. You just did not assign
| it yourself.
| sgarland wrote:
| A VIN is not a surrogate key. A surrogate key must, by
| definition, have no semantic meaning, and not be composed
| of parts from multiple domains (among other
| requirements).
|
| A VIN encodes the following:
|
| * Country of origin
|
| * Manufacturer
|
| * Vehicle type
|
| * Vehicle model
|
| * Engine type
|
| * Transmission type
|
| * Model year
|
| * Manufacturing plant
|
| * Serial number
| shkkmo wrote:
| > It makes little sense to add a surrogate id and a fk on
| that surrogate id. It violates data quality and integrity
| just for a hypothetical situation.
|
| I would still almost always use an internal artificial
| key on top of the external id. If you want to enforce
| data integrity, you can still enforce uniqueness on the
| external id. "Stability" of an external identifier is
| almost always an assumption and one that I've seen fail
| enough times to want that internal id by default.
| knallfrosch wrote:
| > then we add the ability for users to update their email.
|
| At this point, you should verify the new email. At least
| until it is verified, you must track the old email. At this
| point, you realize you can now introduce a synthetic key and
| you're fine.
|
| Let's say you have a duplicate customer entry and the
| customer demands their accounts be merged. Now you can't
| identify the user by their key alone, since by definition,
| they can't be the same (yet.)
| epcoa wrote:
| > At this point, you realize you can now introduce a
| synthetic key and you're fine.
|
| Except for having to update all foreign references. Some of
| those may be external further complicating issues.
|
| Emails are often among the worst keys because they are not
| terribly stable and they are reusable often enough to burn
| you.
| jrs235 wrote:
| Also are emails case sensitive or not? In some systems
| (that you don't control mind you) they are and others
| they are not...
| sgarland wrote:
| Per RFC5321, the local part (before @) _may_ be case-
| sensitive, but in practice, it almost never is, and
| relying on case sensitivity is a recipe for disaster.
|
| The domain must always be case-insensitive.
| jrs235 wrote:
| My point being is if you're using email as a key you
| "have to" treat it as case sensitive even though for most
| it's not. And yes, I agree it will be a recipe for
| disaster.
| brigandish wrote:
| You require three fields (or four): email at registration, a
| date for that entry (together these create a natural key),
| and current email (this one not part of the key and
| editable).
|
| We're almost all the way to a Tag URI[0], so you could
| combine it with the user's name or username or any other
| identifier that fits the spec[1] (you could even use the
| website's own name) and you have a (definitely two thirds,
| probably 100%) natural key.
|
| It's stable over time and unique, easy to mint, and has a
| standard behind it. The user also gets to change their
| contact details without any problem related to the key.
|
| [0] https://taguri.org/
|
| [1] http://www.faqs.org/rfcs/rfc4151.html
| TeMPOraL wrote:
| Except you're encoding PII in the ID, which makes them
| plainly visible to people who should not have access to
| user data, and hard or impossible to change. Sure, I could
| e.g. change my e-mail and the contact data would be
| updated, but _you still have the old e-mail associated with
| my account_ via ID. I 'm not sure this would fly under
| GDPR.
| brigandish wrote:
| Erm, don't show the ID to people who don't need it.
|
| Aside from that, it's not a violation of GDPR to keep
| personal information (that they consented to you having)
| in order to process business for that person. Using an
| email address as a unique identifier is not a violation,
| using it to spam them would be. If they're willing to
| give you their current email why not an old one?
| lolinder wrote:
| > Erm, don't show the ID to people who don't need it.
|
| How do you communicate with other people in your company
| about a customer without sending around PII if the
| customer's ID is PII?
|
| Maybe we could create a field that uniquely identifies
| the customer that isn't PII. Then that could be used to
| uniquely identify a customer in places where we don't
| want to expose their PII. But then... why not just use
| this unique ID as the key?
| brigandish wrote:
| Do you often send the auto-incremented int (that would be
| the default substitute to this) when communicating with
| others? Then why would you send this?
|
| It's so strange an argument. Right now you have my
| username but not my email address, yet you can still
| query the website database and get certain data that
| you're allowed to see. There are so many ways to query a
| particular user's data, and they would all depend on what
| you're trying to do, needing the specific key would mean
| you should have access to it anyway _and_ it could be
| given on per case basis anyway.
| lolinder wrote:
| > Do you often send the auto-incremented int (that would
| be the default substitute to this) when communicating
| with others?
|
| It's not an int, but yes, we have a unique synthetic
| identifier that serves as the database PK and as a means
| of communicating about a customer in insecure channels
| without exposing PII. "Customer ID ### is having an issue
| with such-and-such."
|
| To turn your second part back around: why a natural key?
| What is the function of minting a natural key if humans
| are meant to use something else?
| brigandish wrote:
| > To turn your second part back around: why a natural
| key? What is the function of minting a natural key if
| humans are meant to use something else?
|
| Because non-natural keys are unnecessary in the presence
| of a natural key, and unnecessary things bring in
| complexity.
|
| > "Customer ID ### is having an issue with such-and-
| such."
|
| Then you need access to the customer's ID, but the devil
| here is in the detail you didn't add, the such-and-such.
|
| > communicating about a customer in insecure channels
|
| Use secure channels...
| lolinder wrote:
| > Use secure channels...
|
| When it comes to PII at my company, secure channels means
| "encrypted email only". No Slack, no Jira, no chat in
| video calls.
|
| That's just not feasible for 100% of communications.
| shkkmo wrote:
| > Because non-natural keys are unnecessary in the
| presence of a natural key, and unnecessary things bring
| in complexity.
|
| None of the things you've presented are actually
| "natural" keys, they are pieces of information that
| you've made assumptions about to shoehorn them into being
| usable as a "natural key".
|
| > Use secure channels...
|
| No channel is perfectly secure. As channels become more
| secure, they become harder to use and add complexity. The
| more places you store customer data, the more risk you
| create. The attempt to force semantic data to serve as
| "natural key" has now added risk and complexity to your
| entire communication infrastructure.
| shkkmo wrote:
| > Do you often send the auto-incremented int (that would
| be the default substitute to this) when communicating
| with others?
|
| Frequently yes. It is extremely common to communicate
| about specific records using the ID for that record. The
| fact that this sort of behavior is extremely common is
| pretty clearly indicated by the question itself.
|
| > There are so many ways to query a particular user's
| data, and they would all depend on what you're trying to
| do, needing the specific key would mean you should have
| access to it anyway
|
| A responsible organization at scale with limit and log
| access to customer data. I should be able to determine if
| two people are talking about the same customer record
| without needing access to that record's PII.
|
| It is much better to have an artificial key that is
| linked to this data. There is no upside to the natural
| key and many, many downsides.
| mewpmewp2 wrote:
| But now then if you want to expose a detail page for that
| user the id for identifying that page has to include all
| this potentially personal information about them?
|
| e.g. instead of mysocialmedia.com/users/2374927
|
| you would be showing
|
| mysocialmedia.com/users/email@example.com-2024-06-05-mysoci
| almedia.com
|
| Then exposing a lot of information that you may have not
| wanted to expose.
| sgarland wrote:
| You don't have to use the PK as the URL slug. Even if you
| want to route that way, you can have an internal ID and
| external ID. This is one way to use something random like
| a UUIDv4 for display without incurring the costs (at
| least, some of them) of having it as a PK.
| mewpmewp2 wrote:
| And then if you want to list other entities to that user
| you will have to start mapping the external id and
| foreign relationships every time to external users?
|
| And also if you are doing exception logging, for
| ids/primary keys there's higher odds of them being logged
| out, including your own logs and also external platforms.
|
| It feels like having primary key set up like this just
| will complicate everything unnecessarily for the future
| including many edge cases that you don't foresee.
|
| Just have the main ID not have any meaning.
|
| It shouldn't contain information about the date, it
| shouldn't be auto increment, it should really be just
| random.
| sgarland wrote:
| The solution I outlined is the one GitLab and PlanetScale
| both use internally, so it has been tested at scale and
| works well, for both Postgres (the former) and MySQL (the
| latter).
|
| > It shouldn't contain information about the date, it
| shouldn't be auto increment, it should really be just
| random.
|
| That's a great way to tank performance. You want your PK
| to be k-sortable.
| brigandish wrote:
| > And then if you want to list other entities to that
| user you will have to start mapping the external id and
| foreign relationships every time to external users?
|
| If we're talking about relational database engines,
| that's what they do, relate things. One join statement is
| much the same as another.
| arnorhs wrote:
| Another benefit of having stable identies / surrogate primary
| keys is that any relations (FKs) will be much simpler.
|
| Sure, like the post poster you replied to is pointing out,
| you _can_ use natural keys, and then also relying on dates or
| other parts of the data - but creating a relation for that
| can end up being extremely cumbersome.
|
| - Indexes generally become larger - relationships become
| harder to define and maintain - Harder for other developers
| to get up to speed on a project
| ourmandave wrote:
| _If I'm going natural PK, I make email the primary key._
|
| Welcome to the Mr. Cooper mortgage provider website.
|
| Your logon is your email and you can't change it.
|
| If you used your cable provider email you're stuck with them
| for the life of your 30 year mortgage.
| bradleyankrom wrote:
| Oh, also, we've been breached and your information is
| available for purchase on the dark web. Fun!
| nottorp wrote:
| Funny you should say that...
|
| I've been very slooowly degoogling myself, and that includes
| changing all logins that have a gmail address to a different
| non google email.
|
| I'd say only like 1/3 of the sites I made logins for have the
| option of changing the email.
| twobitshifter wrote:
| ON UPDATE CASCADE is not the nightmare that you are making it
| out to be. (My impression from the article is that this is a
| single SQL database being discussed.)
| mewpmewp2 wrote:
| > My impression from the article is that this is a single
| SQL database being discussed.
|
| Even if it's initially single, it's bad to assume that it
| will be so forever and that you are not going to use third
| party providers in the future.
|
| How well does ON UPDATE CASCADE work if there's millions of
| existing relations to that entity?
| twobitshifter wrote:
| YANGNI for 99% of projects and databases. When you get to
| global sharded nosql etc. you need to use UUIDs for
| anything and incrementing IDs falls over too.
| mewpmewp2 wrote:
| I'm using UUIds by default for everything. Main point
| being that I don't have to worry about future
| restrictions.
|
| And incrementing IDs are also problematic yes, since they
| hide business information data within them.
|
| And I do think that I need it for much more than 1% of
| projects and DBs.
| sgarland wrote:
| You'll have to worry about performance tanking instead.
| If you're using UUIDv7 then less so, but it's still (at
| best) 16 bytes, which is double that of even a BIGINT.
|
| Anyone who says UUIDs aren't a problem hasn't dealt with
| them at scale (or doesn't know what they're looking at,
| and just upsizes the hardware).
| chuckadams wrote:
| Most databases with a UUID type store them as 128-bit
| integers, typically the same as a BIGINT. It's not like
| 378562875682765 is the bit representation of a bigint
| either. And if you're not using uuidv7 or some other kind
| of cluster-friendly id, you'd best be using a hash index,
| and if you're doing neither, you probably don't care
| about their size or performance anyway. You don't pick
| UUIDs blindly, but on balance, they solve a lot more
| problems than they cause.
| sgarland wrote:
| Postgres' UUID type is 16 bytes. MySQL can store them as
| BINARY(16) once encoded. Conversely, a BIGINT for either
| is 8 bytes. Not sure about SQL Server or Oracle.
|
| > You don't pick UUIDs blindly, but on balance, they
| solve a lot more problems than they cause.
|
| IME, this is precisely the problem - devs choose them
| blindly, because then you don't have to think about
| proper modeling, you can arbitrarily create a key in your
| app and be nearly guaranteed of its uniqueness, etc.
| sgarland wrote:
| > incrementing IDs falls over too
|
| This is a myth. Planetscale [0] uses integers. They are
| assuredly at scale.
|
| As for auto-incrementing, there are plenty of solutions
| for distributed systems.
|
| [0]:
| https://github.com/planetscale/discussion/discussions/366
| sgarland wrote:
| I was going to say; this is a perfect use case for a
| cascading FK.
| jval43 wrote:
| It's never a single database in the real world. As soon as
| you integrate something or have an API to something the
| keys are out there. Unless you add a translation layer, but
| then you could just as well use surrogate keys directly.
| mst wrote:
| Using a surrogate UUID for communicating with the outside
| world is often very useful.
|
| This is true for an internal PK that's an auto-inc id as
| well as for natural keys, though.
|
| Using a natural PK -inside- your own database can still
| be a lot more pleasant to work with, even if you don't
| let it escape.
| shkkmo wrote:
| > Using a natural PK -inside- your own database can still
| be a lot more pleasant to work with
|
| Until you need to do anything like described above. The
| advantage of artificial keys is that they have no
| semantic content. Anything with semantic content carries
| the risk that the role that semantic content plays in
| your system can change and cause problems. Having a non-
| semantic identifier protects you from that.
|
| This is not to say that you should never use a semantic
| identifier as a key. However, you should always have a
| non-semantic artificial key as the identifier and use the
| semantic identifiers only when necessary.
| yashap wrote:
| Sure, if all data is in a single DB. But in the real world
| you've generally got some/all of:
|
| - 1 or more data warehouses
|
| - Other services storing said data (e.g. the user id will
| live in many databases in a service oriented architecture)
|
| - External API integrators who have their own data stores
| totally out of your control that also have copies of parts
| of your data
|
| - Job queues. It's common to have jobs scheduled for the
| future on some other system (Redis, etc.) that say "do X
| for user with id Y in Z days". If the "id" changes these
| fail
|
| - Caches. Say I cache data by user email instead of a
| surrogate key, user changes their email, and another user
| signs up with the old email. Cache hits for the wrong user!
|
| - etc.
|
| Changing the primary key becomes an absolute nightmare
| project once data lives in multiple places like this, and
| in my experience it mostly does.
|
| Having truly stable identity for your entities just solves
| so many future problems, and it's SO easy to do. In almost
| all cases, natural PKs are really all downside, virtually
| zero upside, except slightly less storage.
| bioneuralnet wrote:
| My first job in the late 2000's was at a small university
| with a home-grown ERP system originally written in the 80s
| (Informix-4GL). Student records, employee records,
| financials, asset tracking - everything. It used natural
| compound keys.
|
| Even worse than the verbose, repetitive, and error-prone
| conditions/joins was the few times when something big in the
| schema changed, requiring a new column be added to the
| compound key. We'd have to trawl through the codebase and add
| the new column to every query condition/join that used the
| compound key. It sucked.
| aswerty wrote:
| Surrogate keys do mirror reality though. As I once read in a
| Terry Pratchett book; if you replace the handle of an axe and
| then replace the head, is it still the same axe?
|
| For me, the answer is yes - since we imbue the axe with an
| identity outside of it's integral parts.
|
| That is what a surrogate key is. An identity. Which is an
| abstract concept that exists in the real world.
|
| And to pile on. The top comment is bad advice! Surrogate keys
| provide sanity - god save you if you have to work in a database
| solely using natural keys.
| vintermann wrote:
| Yes. Ultimately, "are these the same" and "are these
| different" are philosophical questions. Or to be more
| precise, _teleological_ questions.
|
| Because those questions have no meaning except with an "for
| our purposes here" added. And it's up to us to decide what we
| care about, if anything.
|
| If we care about what other people want with the data though,
| or suspect that our own wants might not be set in stone, then
| we should also care to model identity independently in the
| system (that is, use surrogate keys).
| fuzztester wrote:
| >As I once read in a Terry Pratchett book; if you replace the
| handle of an axe and then replace the head, is it still the
| same axe?
|
| Yes and no. It is the Axe of Theseus ;)
|
| https://en.m.wikipedia.org/wiki/Ship_of_Theseus
| aswerty wrote:
| Thanks - I think somewhere in the back of my mind I was
| also aware of the Ship. But Pratchett is too good. The
| quote:
|
| > This, milord, is my family's axe. We have owned it for
| almost nine hundred years, see. Of course, sometimes it
| needed a new blade. And sometimes it has required a new
| handle, new designs on the metalwork, a little refreshing
| of the ornamentation . . . but is this not the... axe of my
| family?
| mgkimsal wrote:
| Or Trigger's Broom:
|
| https://www.youtube.com/watch?v=BUl6PooveJE
| akira2501 wrote:
| > since we imbue the axe with an identity outside of it's
| integral parts.
|
| Typically for the purposes of ownership. So it's really part
| of a a hierarchical identity scheme.
| groestl wrote:
| Adding to the list of comments damning this post to ensure none
| of my future colleagues follow this advice.
|
| > You think your surrogate key will save you? It will not.
|
| It definitely will, say my 20+ years of experience.
| globular-toast wrote:
| How do you deal with the Ship of Theseus/Trigger's broom?
| There's literally nothing that defines said object apart from
| its _history_.
| zer00eyz wrote:
| Your database starts to look like git...
|
| Think about a jet engine. Let's say you figure out that a
| part is defective and will cause a failure. You want to
| identify every plane that has one of those parts in it. What
| if you find that some of them had a bad oil pump that
| shortened the life of some bearings. You want to know every
| engine that had one of those pumps, so you can replace other
| parts.
|
| I dont know if they do this with jets but there are quite a
| few places that take thee concept much further than this.
| smaudet wrote:
| > Your database starts to look like git...
|
| No, your table archives start to look like git. Which is
| not a bad thing, version control on data in a database is
| very difficult, if not impractical to realize, I would take
| any history that I can get...
|
| "Synthetic" keys are just that, keys for tables that only
| have meaning within the context of that data. You can of
| course relate your records to as many other tables as you
| wish, but you identify records via those keys, nothing
| more.
| RedShift1 wrote:
| Every time I used a natural key I have to come to regret it.
| th3byrdm4n wrote:
| When I integrate systems, I use that system's natural key (love
| it when it's a unique ID, but in the systems I work in - it
| almost never is).
|
| That said, I use that natural key as the "link" to my
| internally managed, normalized database.
|
| There's nothing that says I cannot add unique identifiers that
| would replicate the natural key. In fact, that's good design.
| hyperman1 wrote:
| In my experience, this won't end well. Some examples:
|
| Belgium has the RNR/INSZ identifying each person. But it can
| change for a lot of reasons: It gets reused if someone dies. It
| encodes birth date, sex, asylum state, so if something changes
| (which happens about every day), you need to adapt your unique
| key.
|
| Belgium also has a number identifying medical organizations.
| Until they ran out of numbers. Then someone decided to change
| the checksum algorithm, so the same number with a different
| checksum meant a different organizations. And of course they
| encode things in the number and reuse them, so the number isn't
| stable.
|
| An internal COBOL system had a number as unique key, and this
| being COBOL, they also ran out of numbers. This being COBOL, it
| was more easy to put characters in the fixed width record than
| expand it. And this being French COBOL, that character can have
| accents. So everyone using the 'number' as unique key now had
| to change their datatype to text and add unicode normalization.
| Not fun.
|
| In my experience: don't use anything as an ID that you didn't
| generate yourself. Make it an integer or UUID. Do not put any
| meaning in the ID. Then add a table (internal ID, registering
| entity, start date, end date or null, their key as text). You
| 'll still sometimes have duplicates and external ID updates as
| the real world is messy, but at least you have a chance to fix
| them. The overhead of that 1 lookup is negligable on any scale.
| zer00eyz wrote:
| > Model your data on the real world. Do not depend on spherical
| horses.
|
| Yes, and normalize it.
| https://en.wikipedia.org/wiki/Database_normalization
|
| > Adding a synthetic key only means you have to track all
| three. Plus you have to generate a meaningless number, which is
| actually a choke point in your data throughput.
|
| This is true up to a point. You can add more data to the system
| to continue to generate natural, composite keys. However at
| some point you move from a database to an event stream, or you
| have to track events that aren't really needed for what your
| doing...
|
| Denormalization then takes precedence and a generated key makes
| sense again. https://en.wikipedia.org/wiki/Denormalization
|
| > how will the surrogate key protect you?
|
| It isnt about protection, it's about not collecting the natural
| data to identify the event that caused the issue. Its
| denormalization by omission in effect.
| somerandomqaguy wrote:
| >The natural key forces you to think about what makes the row
| unique. What identifies it. Sometimes, it makes you go back to
| the SME and ask them what they mean. Sometimes it makes you
| reconsider time: it's unique now, but does it change over time,
| and does the database need to capture that? In short, what are
| the boundaries of the Closed World Assumption? You need to know
| that too, to answer any "not exists" question.
|
| Not really because your natural ID has to also account for the
| problem of garbage data and account for the SME's not actually
| being experts. And I can give a real world example of this
| happening; the Canadian Long Gun registry.
|
| For anyone that doesn't know, prior to the early mid 90's or so
| Canadian gun laws only required registration of pistols. I
| might be incorrectly remembering but IIRC the records were not
| handled at a national level either but I could be wrong. Around
| that time new laws were introduced that among other things
| required registration.
|
| Most guns by then had serial numbers so all you had to do was
| tie a gun to a serial number with some characteristics and
| voila, you've got a natural identifier, right? That's what the
| experts say.
|
| Well as it turns out, reality isn't quite so kind. A lot of
| firearms makers such Cooey from around the 1940's to the 1970's
| produced guns without a serial number. In other cases the
| serial number was present from the factory but were damaged, or
| the part that had thee number had been replaced without
| replacing the number or had been replaced with a wrong number.
| In rare cases the serial number from the factory was wrong
| because of a mistake when the worker manually stamped in the
| number.
|
| So already the idea to uniquely identify using some sort of
| simple classifier was already flawed. They attempted to solve
| the issue of guns without serial numbers but those stickers
| were cheaply made and readily fell off, and owners that were
| already peeved about the program to not bother with trying to
| paperwork correct.
|
| Which segways to the next problem. There was an extremely high
| rate of errors in registration forms being submitted. The most
| famous example I'm aware of was someone registering a Black and
| Decker soldering gun as a firearm, something he had done in
| protest. As humorous as it was, it the revelation that a
| soldering gun had been classified as a firearm unveiled another
| fundamental problem.
|
| The error rate was so high, and the pressure to show progress
| so great, that the someone in leadership (I can't remember if
| it was the government or RCMP) directed the data entry clerks
| to just plug the data in with no validation as is. Didn't
| matter if the data was wrong, or made no sense, or contradicted
| other entries already in the database. The intent being to just
| get all the data in as is so that they could fix it later. So
| all that wrong information? Got pushed straight into the
| database.
|
| Like I said, this was real world mess that occurred from 1995
| until 2012 when a new government dropped the requirement for
| non restricted firearms to be registered with little fanfare
| and only squeaking protests.
|
| It's not to say that you shouldn't think about a 'natural key'
| persay. But the problem assuming that there is a 'natural key'
| requires that you or your subject matter expert is actually an
| expert that can identify a good enough model for that to exist.
|
| But what happens when your SME is just plain wrong and it in
| turn introduces fundamental flaws in your model? Or outside
| influences forces garbage data in? How is a database designed
| to model only the real world supposed to cope with that?
| ratherbefuddled wrote:
| > how will the surrogate key protect you? It will not.
|
| Yes it will. Your changes will be confined to only the table(s)
| where the natural key is present, not spread across every table
| where there's a foreign key.
|
| Of course you will still have to deal with the reality that the
| natural key is now not unique, and model reality, but your
| implementation work in doing so is far simpler.
|
| In more years than I care to count I've regretted someone using
| natural keys as a primary key for a table many times, and
| surrogates never.
| wruza wrote:
| In short, you advise us to foresee the future, explore unknown
| unknowns and expect high-precision true answers from the
| outside. Good advice, not for this universe. You can only get
| false negatives in this one.
|
| A synthetic key means "we think exists". There exists a
| contract, a medical record, a person, in a real world, in our
| opinion. We record these existence identities into our model by
| using an always-unique key. Then there's a date, an appointment
| #, a name, etc. You can refer to an entity by its identity, or
| search by its attributes. If you use searches in place of
| identity references, you get non-singletons eventually and your
| singleton-based logic breaks.
| datadrivenangel wrote:
| This is why it's hard to be a DBA... Everyone thinks you're a
| Cassandra user the way other developers ignore your
| prophecies.
| DrScientist wrote:
| > The natural key forces you to think about what makes the row
| unique. What identifies it.
|
| When designing a table - you should always be clear about what
| the natural key is and make sure it has uniqueness constraint.
| Mindlessly having a surrogate key without thinking about what
| the natural key is, is an anti-pattern. So totally agree here.
|
| That doesn't stop you also having a surrogate key though.
|
| Another aspect of natural versus surrogate keys is joins as the
| key often ends up in both tables.
|
| Using natural keys can mean in some circumstances you can avoid
| a join to get the information you want - as it's replicated
| between tables.
|
| There is also the question of whether you surface the surrogate
| key in the application layer or not - some of the problems of
| surrogate keys can be avoided by keeping them contained within
| the app server logic and not part of the UI.
|
| So via the UI - you'll search by car registration number, not
| surrogate key, but in terms of your database schema you don't
| join the tables on registration number - you use a surrogate
| key to make registration numbers easier to update.
| thefz wrote:
| Been vehemently against surrogates my whole life, glad to find
| a kindred soul.
| imtringued wrote:
| >Plus you have to generate a meaningless number, which is
| actually a choke point in your data throughput.
|
| No it isn't. Working with natural keys in general involves
| using compound primary keys since it is unlikely that any lone
| field is suitable as primary key. Comparing an integer is
| quick. Comparing three string fields in a join is not.
| abujazar wrote:
| Natural keys can change, and synthetic keys never have to. That
| alone is reason enough to use synthetic keys.
|
| Performance is another major argument for synthetic keys, as
| they are can be made sequential, which is rarely the case for
| natural keys.
| IlliOnato wrote:
| Sorry, by your logic an ISSN would be a good key for a database
| of scientific journals. It's exactly what ISSN is invented for!
| Right? Right?
|
| Been there, done that. Journals that changed their names (and
| identities) but not ISSN. That changed the ISSN but not the
| name/identity. Journal mergers which instead of obtaining a new
| ISSN kept one of the old ones. "Predatory journals" that
| "borrow" an ISSN (you may not consider them real journals, but
| you've got to track them anyway, even if only to keep them from
| being added to the "main" database). The list may go on and on.
|
| And don't even start me on using even more natural ID, the
| journal name, perhaps in combination with some other pieces of
| data, like year the publication started, country of origin,
| language, etc... Any scheme based on this will need to have
| caveats after caveats.
|
| (A fun fact: there were journals that ceased publication but
| later on "returned from the dead". Such resurrected journals
| are supposed to be new journals and to get a new ISSN.
| Sometimes this rule is followed...)
|
| At the end, a "meaningless number" you assign yourself is the
| only ID that reliably works (in combination with fields
| representing relationships between journals).
|
| The problem with keys that "have meaning" is that they appear
| to carry information about your entity. And in vast majority of
| cases this is correct information! So it's almost impossible to
| resist "extracting" this information from a key without doing
| actual database lookup at least mentally, and often in your
| software too. Hidden assumptions like this lead to bugs that
| are really hard to eliminate. A meaningless number on the other
| hand does not tempt one :-)
| sgarland wrote:
| Your post brings up a critical difference I've noticed when
| working with devs (I'm a DBRE): those who actually do rigorous
| data modeling, and those who view it as an annoyance impeding
| their project.
|
| Spend time modeling your schema. Ask yourself, "does every
| attribute in this table directly relate to the primary key? And
| is every attribute reliant upon the primary key?" Those two
| alone will get you most of the way through normalization.
| indoordin0saur wrote:
| I agree with both the original author's thoughts and yours.
| Surrogate keys are generally necessary but also overrated.
| Natural keys will always remain important. I think it's
| important to keep natural keys as reliable as possible whenever
| you control the system. Creating a new surrogate key every time
| the data moves creates it's own headache as you have to keep
| track of more and more keys referring to the same concept.
| mythas wrote:
| Another example that happens surprisingly often in healthcare. A
| registration clerk will incorrectly enter a personal health
| number (PHN) into the system. Then the actuall holder of that PHN
| shows up. If this were the PK then the system just wouldn't
| handle this case and the reg clerk would have a huge mess to sort
| out on the spot. A surrogate key on the Person table allows this
| registration to be made where 2 people have the same PHN in the
| system. Then cleanup can be handled after the fact to track down
| the first person, determine their correct PHN and update the
| record.
| jen20 wrote:
| This sounds like it should probably be a workflow instead.
| Modelling the intent here is actually important.
| mythas wrote:
| Care to expand on this thought? Curious what you have in
| mind!
| ttyprintk wrote:
| (not parent)
|
| Think about these different intentions, each deserving an
| audit trail with different attached metadata:
|
| - brand new MRN for a newborn. The system should be able to
| provide info on the mother, at all points of care for the
| newborn.
|
| - unknown person arrives in the emergency department
| unconscious. You'll see imaging named John Doe in this
| case. The system should be able to retrieve info if and
| when identity is established.
|
| - the machine-readable bracelet given at admission needs to
| be replaced. This is really two different cases: you intend
| to create a duplicate bracelet, or to correct an error in
| which all current info in the system needs to abruptly
| update.
| ethbr1 wrote:
| One of the most surprising things I learned about the US
| healthcare system is how often everything is mutating or being
| retroactively updated (assume it applies to other countries
| too).
|
| Things you'd think would be constant after Step 1 often aren't,
| and processes are tolerant of their being corrected / re-
| entered after Step 52 has already been completed.
| romdev wrote:
| Finding and eliminating duplicates is a very common software
| problem that is rarely solved in a reusable, user-friendly way
| that preserves history while eliminating redundant data. In
| fact, in 40 years of working with computers I can't think of a
| single UI that I'd want to emulate.
| Terr_ wrote:
| Ultimately the key is the business-modeling that captures
| "duplicates" as Things That Happen.
|
| That's the precondition for any sane UI, and sometimes it's
| not even obvious because the "duplication" has been
| transformed, reified into its own concept.
| Kon-Peki wrote:
| > how about a personal identification number? In Denmark we have
| the CPR number, and I understand that the US Social Security
| Number is vaguely analogous.
|
| The US SSN is not guaranteed to be unique, the SSN assigned to a
| person could change, there is no guarantee that a person with an
| SSN assigned to them is a US citizen, and there is no guarantee
| that a US citizen has an SSN - they must be requested, and you
| don't need one unless you do something that requires having one.
| There are also things called ITINs and ATINs that look like an
| SSN but are not, yet can be used in place of an SSN in a huge
| range of SSN-required situations!
|
| (Please don't use the SSN as a database key!)
| selcuka wrote:
| > The US SSN is not guaranteed to be unique
|
| The cases you listed do _not_ mean SSNs are not unique, unless
| there are people who share the same SSN. You can still define a
| unique index for the SSN column. A column can be both nullable
| and unique as each null is different in SQL.
| ars wrote:
| > unless there are people who share the same SSN
|
| That happens all the time. First of all people steal SSN's
| and use them (and you are not the police, so it's not your
| responsibility to do anything about that). Second people make
| up fake SSN's because they don't want to give you their SSN.
|
| People also make typo's, and you can end up with the same
| SSN.
|
| An SSN is not unique in the real world.
| Kon-Peki wrote:
| EDIT - SSA claims that numbers are not recycled. But there
| are known cases where the same number has been assigned to
| multiple people.
|
| Note that in less than 100 years, more than half of all
| possible SSNs have already been used...
| bloopernova wrote:
| So we need SSNv6? ;)
| dqv wrote:
| > It's not as uncommon as you might think. In fact, some 40
| million SSNs are associated with multiple people, according
| to a 2010 study by ID Analytics.
|
| https://www.pcworld.com/article/424392/a-tale-of-two-
| women-s...
| pseudosavant wrote:
| Funny enough, I used to work at IDA.
|
| I'll add that there is a huge difference between the SSN
| database that the Social Security Administration maintains,
| and the list of SSNs that have been associated with a
| person. Especially because it is very common to change a
| single digit of your SSN when performing credit fraud -
| because they've already burned their real one. Some people
| will have dozens of SSNs attached to them.
|
| IDA was very good at determining who a person is through
| the graphs that represent identities in our world (names,
| DOBs, phone numbers, addresses, SSNs, etc.)
| cmiles74 wrote:
| People also type them in wrong, so there's another thing:
| where is the data coming from.
| Tangurena2 wrote:
| > _The most misused SSN of all time was (078-05-1120). In
| 1938, wallet manufacturer the E. H. Ferree company in
| Lockport, New York decided to promote its product by
| showing how a Social Security card would fit into its
| wallets. A sample card, used for display purposes, was
| inserted in each wallet. Company Vice President and
| Treasurer Douglas Patterson thought it would be a clever
| idea to use the actual SSN of his secretary, Mrs. Hilda
| Schrader Whitcher._
|
| > _The wallet was sold by Woolworth stores and other
| department stores all over the country. Even though the
| card was only half the size of a real card, was printed all
| in red, and had the word "specimen" written across the
| face, many purchasers of the wallet adopted the SSN as
| their own. In the peak year of 1943, 5,755 people were
| using Hilda's number. SSA acted to eliminate the problem by
| voiding the number and publicizing that it was incorrect to
| use it. (Mrs. Whitcher was given a new number.) However,
| the number continued to be used for many years. In all,
| over 40,000 people reported this as their SSN. As late as
| 1977, 12 people were found to still be using the SSN
| "issued by Woolworth."_
|
| https://www.ssa.gov/history/ssn/misused.html
|
| As late as the 1970s, the first 3 digits of your SSN told
| what office issued your number to you. And the next 2
| digits told workers at that office what filing cabinet held
| your application.
|
| https://www.ssa.gov/policy/docs/ssb/v45n11/v45n11p29.pdf
| otherme123 wrote:
| In Spain we have the DNI number, that a lot of people asume is
| unique, even database designers that use is as a natural key.
|
| Turns out the DNI can have, and actually have, a lot of
| duplicates. The police has a page explaining it
| (https://citapreviadnipasaporte.es/dni/dni-duplicados-espana/),
| and how it's not a primary key in their databases, but a number
| entered manually from a pool of possible numbers. And number
| re-using is a possibility. They estimate the number of
| duplicates in 200,000 for a population of 50,000,000.
|
| The point is that if you asume DNIs are unique and use them as
| PK your database is exposed to the bad design of the DNI
| database. There are some stores that use the DNI as the
| "unique" identifier for fidelity cards.
| prmoustache wrote:
| I've seen banks or insurers use DNI as user login.
| benhurmarcel wrote:
| All of them do I think. But from experience, they can
| change it for your account.
| SoftTalker wrote:
| The user login (hopefully) is not the internal primary key
| for the user. It should be unique at a given point in time,
| obviously, but certainly there are reasons it might need to
| change.
| Piskvorrr wrote:
| And I've been bitten by this design numerous times. The
| most common bug: literally the birthday paradox, i.e.
| "these two people have been assigned the same number, and
| now we need to distinguish them in our database".
| benhurmarcel wrote:
| Also your DNI can change. Typically foreigners get a NIE
| (used for the same thing but a different format), and get a
| new DNI if they ever get Spanish nationality.
| steinsgatezero wrote:
| I remember having to deal with this and other identifiers
| like NIF and NIE while working on the academic titles
| homologation platform for the MCIN. I didn't understand why
| it wasn't used as, not necessarily PK, but an identifier
| logging in. Thanks for letting me know that the hellish time
| spent integrating Cl@ve wasn't in vain.
| delusional wrote:
| We have the same problem in Denmark, most people just don't
| realize it. At my dayjob we get at least one person every year
| who changes gender and consequently gets a new SSN (the final
| digit is supposed to signify gender). Most people don't store
| SSNs so they never realize, but it does happen fairly
| frequently.
| winternewt wrote:
| If Denmark is anything like Sweden there's also:
|
| - SSN from different ID space gets assigned to immigrants;
| when they become citizens they are assigned a new permanent
| SSN.
|
| - SSN:s have a long and a short form; the short form which
| cuts off century information can be the same for someone who
| is 5 years old and someone who is 105 years old.
|
| - When an unconscious patient comes in to the E.R. you don't
| know their SSN, so a temporary one is assigned for use in
| patient records. Such temporary SSN:s are not coordinated
| nation-wide so multiple patients may have the same SSN. In
| some hospitals they don't even have a local standard for
| ID:s. The staff just makes something up on the spot. It
| happens that the SSN they made up collides with a valid SSN
| for another person.
| kassner wrote:
| Also: the personnummer carries a date that often means
| birth date, but there are cases where it's not, but I've
| seen a few system that just assumes it's the same.
|
| > SSN from different ID space gets assigned to immigrants;
| when they become citizens they are assigned a new permanent
| SSN
|
| Having gone through that, my personummer didn't change.
| Maybe that doesn't happen anymore?
| tapland wrote:
| It does, but if you already have a Personnummer or get a
| residence permit right away so that you are eligible for
| one you don't get the temporary Samordningsnummer.
| kalleboo wrote:
| And the samordningsnummer isn't only for immigrants -
| it's also for Swedes born abroad who have never been
| folkbokford.
| tapland wrote:
| They're still still an immigrant. The numbers are for
| residents (at some point in time) and citizenship isn't
| reflected by it.
| kalleboo wrote:
| Interesting, I've never heard of anyone referring to a
| citizen as an immigrant, but looking up the legal
| definition, they certainly are. TIL.
| mrweasel wrote:
| Immigrants and refugees will get a "replacement CPR
| number", which is outside the normal space/range of CPR
| numbers. Once registered as living in Denmark, they'll get
| a real CPR number.
|
| Danish SSNs doesn't have a long or short form, you used the
| seventh digit to do a table lookup to see if the person is
| born in 18XY, 19XY or 20XY. The date of birth is always
| ddmmyy, there is no long form. So if the seventh digit is
| 9, and the 5-6 digit is between 00 and 36, then you're born
| in between 2000 and 2036, if the 5-6 digits are 37 to 99,
| then your born in the 1900. But you need the published
| table to figure that out.
|
| Last point, there is backup system for unconscious
| patients, but it should be the same across all medical
| records as these are somewhat standardized.
| delusional wrote:
| > - SSN:s have a long and a short form; the short form
| which cuts off century information can be the same for
| someone who is 5 years old and someone who is 105 years
| old.
|
| We don't do that. instead, we shove that extra bit of
| information into the digit following the last two digits
| through a table: https://da.wikipedia.org/wiki/CPR-
| nummer#Under_eller_over_10...
| rsynnott wrote:
| In Ireland, until the 90s, if a woman got married, she gave
| up her PPS number (essentially a social security number) and
| took her husband's, with a 'W' appended (this fact tells you
| a lot about pre-90s Ireland...) If she subsequently divorced
| or the husband died, she got a _new_ PPS number.
|
| While this was abolished in the 90s, _some people still have
| these 'W' numbers_, ticking timebombs for anyone relying on
| them as a key.
| mikedelfino wrote:
| How to uniquely identify an American citizen?
| hprotagonist wrote:
| ha ha only serious; ask them for their papers and if they say
| "who the fuck are you?", they're americans!
|
| a slightly more serious answer: you largely don't. The US
| doesn't have a national ID, proof of birth is not even close
| to standardized, etc.
| Longhanks wrote:
| How about not doing that?
| mikedelfino wrote:
| Then I guess it could lead to duplicate records. Now,
| whether this is a problem or not depends on the business
| and how much one cares about data integrity. I work with
| higher education, so uniquely identifying students and
| keeping all their records organized is somewhat important.
| Granted, I don't work in the US, and here we have a unique,
| national number. So this is covered, except for foreign
| students. I was curious how this problem is solved in other
| countries.
| Kon-Peki wrote:
| The US government doesn't assume that it can uniquely
| identify a citizen with 100% certainty. When things get
| tricky, we rely on the judicial system to weigh the
| evidence and make a decision. Which could later get
| changed.
|
| If you want to design your system to assume that you can
| do it, that's your problem. Literally, it is now _your_
| problem, and nobody is going to step in to help.
| j16sdiz wrote:
| You can't.
|
| That's why everybody use SSN.
| vharuck wrote:
| The population health database I work with uses generated IDs
| for people and relies on health systems to link records
| before submission (they put a lot of effort into data used
| for billing). But we do check for problems by looking at
| duplicates of SSN with date of birth. We don't consider
| names, because they could be transcribed differently. Even
| though the SSN+DOB pair should be unique, there have been
| cases where a widow provides her deceased husband's SSN.
| Likely because she used it for Social Security benefits and
| forgot her own long ago.
| polygotdomain wrote:
| To give a slightly more technical answer, at a large
| insurance company I used to work for, the legal department
| had provided definitions on what conditions we would consider
| various components of PII a match. So between SSN, DoB, First
| Name, Last Name, and a couple others, there were potential
| combinations that our system would say, "yes this is the same
| person". Note that we didn't necessarily need exact matches
| on things like names, but "close enough" matches were
| sometimes sufficient.
| gwbas1c wrote:
| You can get kind-of close with driver's license numbers. They
| will change when US citizens move among states, (and we do
| that quite frequently.)
|
| (People who don't drive have to get state-issued IDs that
| have the same number. At least, they have to if they want to
| buy do anything that requires proof of identification.)
| Tangurena2 wrote:
| You cannot.
|
| I used to work for one state's Department of Motor Vehicles.
|
| Part of why REAL ID is such a charley foxtrot is because
| there exists no such identifier and since every state counts
| approximately as a sort-of-almost sovereign nation (this is
| basically what the 10th Amendment guarantees), there is and
| can be no possible nationwide citizen identifier.
|
| There are about 3500 different counties in the US. All of
| which may or may not issue their own birth certificates.
| States are supposed to do that now.
| pradn wrote:
| Pernicious assumptions!
|
| Google Cloud projects have three attributes: user-friendly
| names, system numbers, and system names. System names are
| alphanumeric. They can be chosen by the user, derived from the
| friendly name if there's no collision.
|
| But! There's some system names from the olden days that are
| actually all numbers - so not actually alpha-and-numeric.
| Thankfully we don't run into those often.
| ajuc wrote:
| Yeah, these things are almost never as simple as they are
| supposed to be.
|
| Poland has PESEL numbers since 70s. It was supposed to be
| unique, only apply to Polish citizens, never change, and have a
| checksum digit. Every Polish citizen gets one at 18 when they
| get their national ID document, and you can request it earlier
| if you want to.
|
| Turns out there are duplicated PESEL numbers. A LOT of non-
| Polish citizens have them assigned (mostly Ukrainian refuges
| but not only). The checksums are sometimes wrong. And some
| people have several PESEL numbers.
|
| If you used PESEL as database key you're fucked.
|
| The system works perfectly, but it interfaces with external
| world through computer-human-paper-human-computer interface.
| And at some point the mistake propagates so far that it becomes
| the truth assumptions be damned.
| rhelz wrote:
| Natural keys are (quite literally) essential to defining
| entities. Quine had a great slogan for this: "No entity without
| identity!" Natural keys are how you determine the identity of an
| object, which is to say, if you have two different referring
| expressions, how can you tell whether they are referring to _the
| same_ object, or different objects?
|
| Suppose you take the advice of this article, and use, say, social
| security numbers to identify people. (Let's ignore the fact that
| this only works for the U.S.) Suppose one fine day, somebody
| tries to enter in a record about a new person--but some typo has
| happened somewhere, and the new person's SSN clashes with
| somebody's SSN who is already in the system.
|
| Sure, the database will notify you that something went wrong. But
| how do you know which social security number is correct, and
| which is incorrect? You will have to find a set of fields which
| uniquely identifies each person _ANYWAYS_. I.e. you 'll have to
| differentiate them by name, birthday, place of birth, etc etc.
|
| Now even worse!!! What if somebody attempts to add the same
| person TWICE to your database, but mistypes their social security
| number. Now your database can't even tell you that something has
| gone wrong. It will happily record duplicate or contradictory
| information about the same person--and in order to resolve the
| mess, again, you have to find out what uniquely identifies the
| people _ANYWAYS_.
|
| Now, even worser than worse--what if you have taken the advice of
| this article, and you _haven 't_ bothered to identify a set of
| fields which are genuinely unique to each person. You just have a
| social security number and a name. How are you going to even
| going to correct the fact that John Smith is in your system
| twice, when you have ten John Smiths? How can you possibly tell
| which two John Smiths are the same John Smith?
|
| Yeah, it take some time and careful thinking to properly come up
| with natural keys for your entities. But unless you do, you
| haven't actually specified your entities at all. This is one area
| where long years of experience really pay off. Expert data
| modelers spend years and decades honing their craft, observing
| the work of others, etc etc. Eventually they acquire the wisdom
| needed to be able to know what kind of information is really
| needed to nail down what kind of entity the database needs to
| know about.
| yuriks wrote:
| > Suppose you take the advice of this article, and use, say,
| social security numbers to identify people.
|
| You seen to have misunderstood the point of the article: the
| author is recommending NOT using the SSN (a natural key) for
| primary keys, and instead to use an artificial, automatically
| generated key, so that the SSN is decoupled from the record and
| can potentially be updated.
| rhelz wrote:
| But SSN's _are_ artificial, automatically generated keys.
| They are not natural keys, they cannot be natural keys for
| persons, because not every person has one.
|
| A natural key is a set of attributes which an entity has even
| if it is _not_ in your database.
| iblaine wrote:
| It's funny that indexing used to be unavailable when processing
| big data on Hadoop. Now that we have options to do so we now
| care. As tech evolves we adapt.
| wavemode wrote:
| Another massive annoyance with natural keys - privacy. If your
| table's primary key contains personal information, that PII now
| infects every other table that holds a foreign key to that table.
| hehdhdjehehegwv wrote:
| Extremely good point.
| whyever wrote:
| AFAIK, even using a UUID is still considered PII if it uniquely
| identifies the user.
| gregw2 wrote:
| Thank you for your comment. It spurred me to think about this
| issue more thoroughly in a way I hadn't, even though this
| comment may disagree -- to some extent -- with your
| perspective.
|
| I have not seen clear guidelines about whether an
| organization's surrogate keys for persons are considered PII.
| (And this ambiguity has frustrated me for some time as I am
| unclear whether to take an aggressive or conservative view on
| labelling PII where I work.) When I have read the guidelines,
| it seems ambiguous but on balance I think it disagrees with
| your implied claim (that PII is not present in a table that
| uses a surrogate foreign key to a person/user.)
|
| The definition of PII per NIST
| https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpubli...
| is:
|
| "Any information about an individual maintained by an agency,
| including (1) any information that can be used to distinguish
| or trace an individual's identity, such as name, social
| security number, date and place of birth, mother's maiden name,
| or biometric records; and (2) any other information that is
| linked or linkable to an individual, such as medical,
| educational, financial, and employment information."
|
| A surrogate key associated with a person is arguably "any other
| information that is linked or linkable to an individual"
| meaning that all tables containing the surrogate key remain PII
| and remain "infected". It's true the surrogate key only allows
| linkability within the context of the data ecosystem in which
| it resides, but such distinctions (of "internal" to the system
| vs "external from the system") are not made in the language of
| the definition. Additionally, from a pure risk and PII
| disclosure impact, usually the whole database gets dumped, not
| just the "non-person" tables. If you have financial/medical
| transactions in table "A" and a personal numeric ID linking
| data to a person table "B", both tables contain PII, right?
|
| From a privacy standpoint, if you can SQL JOIN the data to
| trace the person involved either within your system or even
| data reasonably obtainable outside your system, (or if an
| attacker can), it's PII.
|
| Intranet IP addresses are called "linked PII" in section 3.2.2
| of the above NIST guidelines for example, and NIST does define
| some related terms that would seem to apply to surrogate keys
| like: * Distinguishable Information: Information that can be
| used to identify an individual. * Linkable Information:
| Information about or related to an individual for which there
| is a possibility of logical association with other information
| about the individual. * Linked Information: Information about
| or related to an individual that is logically associated with
| other information about the individual.
|
| As a data engineer, the above interpretation means PII is in a
| zillion tables and labeling a table with a boolean indicator
| yes/no isn't that helpful. But as a policy person, NIST seems
| to be recommending gauging PII more at the system (not table)
| level and with a PII Confidentiality Impact Level of
| low/medium/high that takes into account the context and overall
| risk and that seems sensible. From a data cataloging standpoint
| (gauging what's "infected" to use your term), I think it's
| probably helpful to identify particular transactional tables or
| personal table as having "high" PII disclosure impact vs "low";
| the presence of "infection" from a virus ("PII") is mostly
| irrelevant in a sufficiently large system where viruses/some
| PII is inevitable but what matters is the severity/impact. A
| zillion tables will be "low" (e.g. if most tables have audit
| column saying which user last changed a particular record) but
| certain transactional or user tables may be "high" and should
| be recognized as such and the focus of any risk discussions
| with the business or legal or breach notifications to customers
| or what have you.
|
| Going back to your original point, I don't think the choice of
| natural vs surrogate key impacts the PII risk of the system or
| even its individual tables. I would slightly concede that a
| surrogate key (which in general I am in favor of) would make it
| easier to reduce a particular individual's PII from a system by
| concentrating it in one or a small number of tables with names,
| etc. which might be helpful for enabling GDPR right to be
| forgotten or something. But the degree of that PII elimination
| from a system by blanking out or archiving a particular
| user/person record is not necessarily reducing the PII for them
| to 0 at least definitionally unless the transactional records
| themselves are also removed as the AOL 2006 search data scandal
| demonstrated (where a woman identified solely by a surrogate
| key was able to be identified from her search term transactions
| alone.) (Legally there would appear to be carveouts around
| transactional deletion for some financial transaction records
| and backups, but IANAL...)
| wavemode wrote:
| Your perspective is very thorough and interesting, but I can
| simplify the matter a lot - yes, from my personal experience
| it absolutely does matter for privacy regulation whether you
| use natural or surrogate keys.
|
| I've worked on systems in the biotech space where certain
| tables and certain columns in our dataset were considered PHI
| (personal health information), and others weren't. An auto-
| incrementing user ID does not qualify as PHI under HIPAA.
| Whereas a person's name or email address most certainly does
| (since, the very fact that a particular person had received
| lab testing at a particular lab, is itself considered PHI.)
|
| Those PHI bits of data were scrubbed from the dataset that
| most regular employees had access to. Casually allowing all
| employees to have access to PHI would not have been compliant
| with the law.
|
| The restrictions on what can even be logged (when there's a
| system error, for example) was very much controlled. "User
| 12345 experienced <some exception>" was perfectly fine (and
| greatly aided operational investigation, and customer
| support.)
| gregw2 wrote:
| Thanks for the reply. I will concede or defer to you in
| regards to PHI and HIPAA... it seems the philosophy behind
| HIPAA/PHI is very different than PII or GDPR. HIPPA is
| prescriptive. PII/GDPR are principle-based. HIPAA, it
| seems, has some text that it's not PHI if the risk is "very
| small" based on the opinion of someone with statistical
| expertise documents that it could be de-identified OR if
| the person avoids an explicit list of 18 things that it
| cares about (see items (A) through (R) on page 96 of https:
| //www.hhs.gov/sites/default/files/ocr/privacy/hipaa/ad...
| ).
|
| One of the first 17 things might be a surrogate key
| ("account number") in one's system but if you look through
| the others, the rest are things like name, SSN, biometrics,
| IP#s, etc, which are definitely not surrogate keys.
|
| The "OR" language makes the statistical expertise (and
| "principles" of privacy) irrelevant if you avoid the 18
| things; that avoidance forms a "safe harbor" of sorts so
| you don't have to do any heavy thinking/lifting.
|
| The 18th ("(R)") element of what is considered PHI does
| seem to refer to surrogate keys but in a manner which
| creates a clear carve-out/safe-harbor for them not being
| PHI. That 18th form of PHI is "Any other unique identifying
| number, characteristic, or code, except as permitted by
| paragraph (c) of this section;"
|
| But that paragraph (c) section seems indicate identifiers
| such as surrogate integer/guid keys kept within a system,
| as long as they are 1) not derived from an individual's
| information (ie like integer or UUID surrogate keys) and 2)
| which are maintained solely in the system are not
| considered as element-18-"R"-PHI:
|
| "(c) Implementation specifications: Re-identification. A
| covered entity may assign a code or other means of record
| identification to allow information de-identified under
| this section to be re-identified by the covered entity,
| provided that: (1) Derivation. The code or other means of
| record identification is not derived from or related to
| information about the individual and is not otherwise
| capable of being translated so as to identify the
| individual; and (2) Security. The covered entity does not
| use or disclose the code or other means of record
| identification for any other purpose, and does not disclose
| the mechanism for re- identification."
|
| By my reading, a surrogate key maintained within a system
| is thus clearly not PHI under HIPAA. I never looked at the
| details of HIPAA until today since it hasn't applied much
| to my data and have been focused more on PII/GDPR. I
| appreciate you describing the context of your remarks.
| wavemode wrote:
| The other difference between HIPAA and GDPR is compliance
| rules. Under GDPR it is not a problem if all your
| employees who have access to your database have access to
| PII (since, from the perspective of GDPR, the customer
| has given consent to share their information with your
| organization as a whole). Sharing with third parties
| outside your organization is where you get into trouble.
|
| But under HIPAA, even your own employees need to have a
| specific, documented justification for accessing customer
| PHI. If they can do their job just as well by only
| accessing a non-PHI dataset, you're required by law to
| only allow them access to a scrubbed, non-PHI dataset.
| whyever wrote:
| I don't think GDPR is that different in that regard, you
| also have to minimize access to PII within the
| organization. See Article 25:
|
| > The controller shall implement appropriate technical
| and organisational measures for ensuring that, by
| default, only personal data which are necessary for each
| specific purpose of the processing are processed. That
| obligation applies to the amount of personal data
| collected, the extent of their processing, the period of
| their storage and their accessibility. In particular,
| such measures shall ensure that by default personal data
| are not made accessible without the individual's
| intervention to an indefinite number of natural persons.
| randomdata wrote:
| > SSN ... which are definitely not surrogate keys.
|
| Surely SSN is a surrogate key? They are not naturally
| derived. The early ones were serial (i.e. an auto-
| incrementing field) and more recent ones are randomly
| generated (i.e. a UUID).
| shkkmo wrote:
| SSN is absolutely not a surrogate key. If you received a
| piece of information from an external source, it is data,
| not a surrogate key.
|
| If you use data as a key, then that is a natural key, if
| you invent a value to use as an identifier, that is an
| artificial or surrogate key.
|
| If an API provides you an ID for a record, that is data.
| If you use it as a key, that is also a natural key in
| your system.
| randomdata wrote:
| _> If you received a piece of information from an
| external source, it is data, not a surrogate key._
|
| It may not be _your_ surrogate key, but it is someone 's!
| shkkmo wrote:
| Possibly, you don't actually know since it is external
| data.
| randomdata wrote:
| We do know because the database owner has openly talked
| about how the keys are derived. That still doesn't make
| it a good key for your database, but I can assure you
| that the world doesn't revolve around you. It is
| someone's surrogate key - therefore it is a surrogate
| key.
| hathawsh wrote:
| Conceptually, any information created or consumed outside
| your organization is not valid as part of a surrogate
| key, so SSNs are not a surrogate key. Furthermore, any
| time you reveal the primary key, that key can become the
| thing that people come to depend upon to find that
| database row, which leads to the possibility that
| someday, someone will have an important need for some
| primary keys to change, even if the primary key was
| supposed to be a surrogate key. The longer a database
| lives, the more likely that surrogate keys morph into
| natural keys. That's what happened to SSNs.
| twodave wrote:
| Agreed, and also they tend to be more guessable. Make a page
| available with an email address as the id and just watch the
| hackers use it to discover users of your service and attempt to
| log in as them.
| sgarland wrote:
| That is an AuthZ problem, orthogonal to DB key debates.
| wodenokoto wrote:
| Just to add on to the CPR number. As mentioned it contains a
| birthdate and a gender identification as well as a checksum.
|
| Since many refugees don't know (or can't prove) their birthdate,
| they are given first of January. And enough first of Januaries
| are handed out that for some years there aren't enough valid
| numbers, so numbers that fail the checksum are handed out too.
| knallfrosch wrote:
| Shows how the assumption that everyone has, or reveals, a known
| or at least approximate birth date cascades through the system.
| ccppurcell wrote:
| Here in Czech republic, everyone has an id number (rodne cislo)
| but as a foreigner, I have multiple id numbers. Further
| complicating matters, I was given a rc with the wrong gender (the
| gender is encoded in the number) and my nationality was at some
| point incorrectly listed as "Ireland". I do wonder if the Czech
| government thinks I am just two completely different people.
| Culonavirus wrote:
| seems like a problem, you should get that czeched out
| jpnc wrote:
| The title could probably be extended to 'You'll regret using
| natural keys as primary keys' and it would be right in some ways.
| Personally, I've come to the conclusion that it's probably best
| to use both surrogate and natural keys. Surrogates (IDENTITYs,
| UUIDs what-have-you) as PK from a technical perspective and
| natural key as 'PK' from a business/data modeling perspective.
| __MatrixMan__ wrote:
| Natural keys are often the best you can do if you're trying to be
| partition tolerant.
|
| With surrogate keys you end up with duplicate entities whenever a
| partition heals (supposing the entity appeared to both partitions
| while they were still separate).
| karmakaze wrote:
| A simpler and more compelling reason in my experience is that
| relating tables is so much easier with synthetic keys. You can
| decide how large your synthetic keys are based on uniqueness
| requirements, natural keys have their own size and format that
| would then be copied to make relations. And if uniqueness depends
| on several values to be unique then dealing with compound foreign
| keys isn't fun or efficient. At this point, there would have to
| be a very compelling reason to want to use natural keys to offset
| this and I've rarely seen them. A case that may come up is for a
| miscellaneous table to use the synthetic key from another table
| with another natural value that uniquely identifies a row. I
| wouldn't choose this route if you ever think you want to relate
| anything to rows in this table, they should be leaves related
| only to the synthetic key being used.
| kijin wrote:
| "Natural keys" just means that someone else can change them under
| your feet.
|
| People change names, their phone numbers, passport numbers.
| Governments change their numbering schemes all the time.
| Corporations merge and split. Heck, even governments merge and
| split, surprisingly often, at the municipal level. All without
| your knowledge or consent. And you're left wondering why you
| suddenly have duplicate key errors in your database.
|
| The only key that you can trust is one that you, and only you,
| control. That's the point of the surrogate key. Whether it's
| BIGINT or UUID v7 is beside the point.
| tedunangst wrote:
| Every blog post about database schema should use as its example
| the most obvious example, a blog. People are pretty familiar with
| the domain, and it even has many of the same challenges. You
| probably have a postid field in there somewhere, but lots of blog
| software also uses url as a natural key. Which is why sometimes
| the old title appears in the url. Can't change that. Except you
| can. Some software will do redirects, by creating stub records
| for old URLs that map to new ones. So that's not an intractable
| problem. Etc.
| pphysch wrote:
| The article should be renamed "natural primary keys". Natural
| primary keys have fatal flaws, but natural "secondary" keys are
| all sugar. Best of both worlds.
| MikePlacid wrote:
| > I already predicted that changing the number would prove to be
| either impossible, or have all sorts of cascading effects,
| ultimately terminating in official records no longer recognizing
| that the car is mine.
|
| Oh my, I can feel that pain. Here's what happened last week,
| caused by a change in... SSN? no - in our home address.
|
| My wife earned some unemployment benefits three years ago, which
| were put on a plastic card issued by The Bank. She finally found
| time to access the funds (she's a middle school teacher), but
| when she went to The Bank, they said they didn't have the money
| anymore--they'd sent it back to California. So, she called
| California. They were like, "No problem, we'll send you a check.
| Oh, you have a new address? Let's change it. Wait, what is
| happening... oh, now your account is locked, it says: potential
| fraud"... They needed a supervisor to unlock it, which took 20
| minutes. The supervisor unlocked the account, but because it was
| marked as potential fraud, they couldn't mail a check anymore.
| Instead, they linked the account back to The Bank (20 more
| minutes), and she had to go there in person with her ID to get it
| checked.
|
| So, she went to The Bank. But you can't just walk in The Bank and
| show your ID to get it checked; you need an appointment. And to
| get an appointment, you need an account with The Bank. But her
| California benefits account? Oh, it is marked as potential fraud
| - it didn't count. So, they spent 20 minutes to open a new
| account for her. She got an appointment for later that day, in 4
| hours, went back to The Bank, and had her ID checked (yes, the
| second time in one day - they need to check your ID to open an
| account too).
|
| Did she get the money then? Of course not--the account is still
| flagged as potential fraud. No cash possible. Call California
| again. California agreed to mail the check to the new address,
| probably by some oversight.
|
| So my question is: do these "potential fraud" flags in databases
| ever die a natural death?
|
| With some hope, sincerely, a Husband of a Potential Fraudster.
| knallfrosch wrote:
| > Mailing a check to transfer money
|
| Real 1800 vibes here.
| the_gipsy wrote:
| In Spain the ID numbers are assigned at birth, carry no
| information, and cannot be changed. Each police comissary that
| registers birth gets a unique set of IDs to assign (per year or
| whatever).
|
| However. Mistakes still happen. A colleague had the same ID as
| someone else. He said he tried to change it, but it was
| impossible because it was such an impossible concept to any
| public servant involved. In the end he gave up and just lived
| with the fact.
| fcatalan wrote:
| Not really assigned at birth, but at the time you first request
| the card. It's a very very very crappy database ID number:
|
| -There's a significant amount of duplicates, enough that if
| your database is big enough you will find one sooner or later.
|
| -Format is variable: There are older 7 digit numbers, modern 8
| digit ones, some people pad the 7-digit with a 0, some don't,
| some consider the CRC letter part of the key, some don't, some
| just append the letter, some hyphen it. If you deal with manual
| data entry anywhere, you will suffer.
|
| -Foreign people exist: You start using their passport number,
| the format of your key is now completely arbitrary, can't
| validate it. If you were using 8 digit NIFs as keys now some
| countries use 8 digit passports, increasing your chances of
| duplicates.
|
| -Foreign people stay: They get a resident card and they use
| that. Format is almost like NIF but not really so you need to
| account for that. Someone you registered initially with a
| passport now has the card and registers for something else with
| it. Have fun cleaning their duplicated identity from the whole
| system.
|
| -Foreign people become Spanish: Now they get a NIF. If you have
| dealt with them for a long enough time, have fun again fixing
| their records for the second time.
| benhurmarcel wrote:
| Foreigners change from a NIE to a new DNI if they get Spanish
| nationality. So their ID number changes then.
| shkkmo wrote:
| Even if this is true now (which it isn't entirely), that is no
| guarantee that it will be true in the future. Any time you use
| an external value as a key, you run the risk of a policy
| change. An internally generated artificial key is the only safe
| way to ensure immutability and lack of semantic content.
| EVa5I7bHFq9mnYK wrote:
| How is name and city a key for a restaurant? There are a dozen of
| mcdonalds in my city.
| layman51 wrote:
| That combination of columns being a candidate key depends on
| the world or domain that your dataset is modeling. In this
| example, I think it was a list of the top 50 restaurants in the
| world and so it would make a fine key. But if that database was
| ever expanded then definitely you would run into collision
| issues.
| perlgeek wrote:
| Natural keys have a tendency to change over time in what is
| considered unique.
|
| For example, you have a company, and every employee has a unique
| employee number generated by HR... until the company merges with
| another, that also has unique employee numbers, and suddenly the
| identifier becomes the tuple (organization, employee number) that
| becomes unique. If you've used the employee number as a foreign
| key in other tables, you have to change those too.
|
| This is a somewhat contrived example, but I've had enough real,
| annoying examples happen to me in my career that I avoid natural
| keys.
| knallfrosch wrote:
| Ah, how I remember changing my name. People think it only
| happens now and only to 0.1% of people who decide to change
| their gender in a leftist Western society, but somehow they
| forget that people have been changing family names for
| centuries. And of course, my work email address also changed.
| Now THAT is fun.
| raverbashing wrote:
| > Many were the times, earlier in my career, when I decided to
| use a 'natural key' as a key in my own database. As far as I
| recall, I've regretted it every single time.
|
| Correct. Infamously correct
|
| Every "natural key" will have some way of fumbling things down
| the line.
|
| Nothing is unique, not even joining your natural key with other
| deduplication info. Just save yourself the trouble.
| DeathArrow wrote:
| Synthetic key is the easiest choice. Because the primary key in
| one table is going to be foreign key in another table. If I have
| an employee table I can use their email address as the key but
| then I will have to use the email in any other table that
| references employee table, such as salary table, days off table
| and so on.
|
| Most of the times using an auto incrementing integer or UUID is
| just fine.
|
| In NOSQL there's usually a synthetic ID used to uniquely identify
| a document, I've never seen people using natural keys or compound
| keys.
| webprofusion wrote:
| As a junior dev decades ago I was once asked to change a
| perfectly good database (with integer IDs) to use natural keys,
| so to look up (or SQL join to) a particular office for instance
| your query would filter on Country, City and Street with the
| expectation that you get one result.
|
| This was the dumbest technical decision I have ever been asked to
| be a part of.
| gpvos wrote:
| In the example given, the most logical natural key to me would be
| (year, rank). Why otherwise would they have the year in there?
| mathgorges wrote:
| That was my first thought too. But like TFA says, that still
| wouldn't prtect you from clerical errors.
| moi2388 wrote:
| I've just always considered natural keys public keys, and added a
| surrogate key as private key.
|
| From the outside you won't see it, but internally it saves a lot
| of headaches.
|
| Space, speed, migrations, I have just never seen an actual
| downside to using a surrogate.
|
| And if you are _absolutely_ sure, add some unique constraints.
| Easier to change when you _inevitably_ have to.
|
| Mainly; I've learned that my initial assumptions are never
| correct.
| samtheprogram wrote:
| I technically use a natural key, if you could call it that, when
| using Supabase. Having a `public.users` table with a primary key
| that's the same as the key for the internal user auth table (from
| Supabase).
|
| Arguably not a natural key, or at least a contrived example, but:
| any downsides?
| mobiuscog wrote:
| It's also important to put as much thought into any synthetic
| key.
|
| It's _easy_ to just use an auto-generated sequence... but then
| you start having to export /import or otherwise merge data, and
| other manipulations that often use the primary key and find there
| are collisions everywhere. There can also be problems when
| needing to support multiple databases, or update versions.
|
| UUIDs (or equivalent synthetic keys that are independent of the
| database itself) are often the best answer for this reason.
|
| Having been bitten by sequences so many times in the past, I find
| them to often be more trouble than natural keys in the first
| place - just a lazy approach.
| jonathan-re wrote:
| Well, UUIDs bring their own challenges. Dropping that one here:
|
| Be Careful with UUID or GUID as Primary Keys
| https://news.ycombinator.com/item?id=14523523
| globular-toast wrote:
| It's pretty easy to migrate to UUIDs later if required, as long
| as you don't leak your keys.
|
| My work made their own uuid-like scheme, similar to UUIDv1
| which incorporates several elements like machine ID and
| timestamp. The mistake was twofold: first they exposed them (so
| people started using them) but, worse, they made them easily
| reversible, so people started decoding the information in them.
| People would naturally see one and think "oh this is a record
| from place X". Of course that might not be true following
| subsequent data corrections, but the key can't change.
| akvadrako wrote:
| UUIDs are only good when you don't care about ergonomics or
| performance.
|
| Snowflake IDs are much more reasonable. 41 bits timestamp (ms)
| + 10 bit machine id + 12 bit serial.
|
| But if you care about ergonomics and privacy the most then
| short and random IDs are really the best.
| otikik wrote:
| One of my first jobs involved an account management software
| which created an ID for each user based on first name and
| surname. So John Smith would get SMIJOH. It was the user natural
| key and it could not be changed once created.
|
| One woman got married and changed her name; se became really
| upset when she found out that we couldn't change her login.
| canary-hill wrote:
| What has worked best for us is to have a sequential integer as
| the primary key as well as UUIDv4 as a surrogate key for every
| row. The applications would expose the objects only using uuid
| and never the primary key. The primary key always remain internal
| at the database level and never gets referenced or used at the
| application layer. In this way the security and privacy of
| objects are maintained and having sequential integral primary key
| for the object ensures the database design remains robust.
| Essentially, separating out the key for the database and the
| application. Is there any potential issue with this approach?
| sigtstp wrote:
| If anyone works in bioinformatics, please, please, for the love
| of god, generate your own unique IDs. Database identifiers are
| not generally unique (the same id might get reused for e.g.
| protein variants). Even sequences are problematic: I've found the
| same seq with different ids (can't remember the db now), and they
| can change (sequencing or human error might've occured and they
| get updated).
| teaearlgraycold wrote:
| What I'm doing right now is putting synthetic (auto incremented)
| ids on every table, but also adding lots of compound unique
| indices. We have a lot of long background processes that might
| need to be restarted in the middle. The natural key definition
| allows for easy idempotency in subsequent runs (the jobs only
| upsert with. collisions handled by a simple timestamp update).
|
| Most entities are tied to a version record (more or less a God
| object). But _some_ span across versions. I'm able to have the
| entities that aren't tied to a version float around because they
| reference the natural key of the versioned entities, minus the
| version's synthetic key.
| strnisa wrote:
| I've become a fan of unique, relatively short and "human-
| readable" IDs, such at the ones used by Stripe, e.g.
| `cus_MJA953cFzEuO1z` for an ID of a customer. Here's a Stripe dev
| article on the topic: https://dev.to/stripe/designing-apis-for-
| humans-object-ids-3...
|
| If you use JavaScript/TypeScript, you can make them like this:
| function makeSlug(length: number): string { const
| validChars =
| "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
| const randomBytes = crypto.randomBytes(length); let
| result = ""; for (let i = 0; i < length; i++) {
| result += validChars[randomBytes[i] % validChars.length];
| } return result; } function
| makeId(tableName: string): string { const idSpec =
| TABLE_NAME_TO_ID_SPEC[tableName]; const prefix =
| idSpec.prefix; const slugLength = idSpec.length -
| prefix.length - 1; return
| `${prefix}_${makeSlug(slugLength)}`; }
| vanderZwan wrote:
| Nice!
|
| For the record: the valid chars string is 62 characters, so
| naively using a modulo on a random byte will _technically_
| introduce a bias (since dividing 256 values by 62 leaves a
| remainder). I don 't expect it to really matter here, but since
| you're putting in the effort of using crypto.randomBytes I
| figured you might appreciate the nitpick ;).
|
| Melissa E. O'Neill has a nice article explaining what the
| problem is, and includes a large number of ways to remove the
| bias as well:
|
| https://www.pcg-random.org/posts/bounded-rands.html
|
| (in this case adding two more characters to the validChars
| string would be the easiest and most efficient fix, but I'm not
| sure if that is a possibility here)
| naasking wrote:
| > For the record: the valid chars string is 62 characters, so
| naively using a modulo on a random byte will technically
| introduce a bias
|
| Indeed, there's no reason you couldn't just add "_" and "-"
| or "." as well to complete the set. Your identifier will
| still be URL-safe. I've been using this type of encoding for
| years [1] for these kinds of ids to use in URLs, and
| encoding/decoding is super-fast with some bit shifts. And
| unlike Base64, you don't need padding characters.
|
| [1] https://sourceforge.net/p/sasa/code/ci/default/tree/Sasa.
| Web...
| notfed wrote:
| - and _ tend to break text selection.
| naasking wrote:
| I'm not sure what you mean by "break". If you mean that
| touching or double-clicking on a block of text only
| extends up to the nearest symbols, that's true. But if
| your text selection UX is not terrible then it should be
| simple to extend that further.
|
| That said, iOS and Android text selection have gotten
| worse recently, IMO.
| whyever wrote:
| Please don't use % to generate integers from a range, it's not
| uniform, which can be disastrous if you rely on your numbers
| not being predictable. You can use crypto.randomInt instead.
| CodesInChaos wrote:
| That is definitely an improvement, but I find that concern a
| bit exaggerated.
|
| If all you need is unpredictability, a minor bias is sub-
| optimal but not disastrous. The 256 % 62 bias used here
| should reduce the min-entropy per character by 5%. And you
| can easily minimize the bias by using larger integer than a
| single 8-bit byte. There few algorithms where minor biases
| cause a disaster, like DSA.
| vincentdm wrote:
| I don't like the prefix idea: besides the duplication of
| information, it also becomes a liability if you ever rename
| things.
|
| Imagine you prefix all customer IDs with `cus_`, but at some
| point decide to rename Customer to Organization in your
| codebase (e.g. because it turns out some of the entities you
| are storing are not actually customers). Now you have some
| legacy prefix that cannot be changed, is permanently out of
| sync with the code and will confuse every new developer.
| codeulike wrote:
| I wouldn't worry about that - I still think its worth it.
| I've had systems which during development we thought the
| Contact page was going to be called 'Contact' in the UI but
| at the end it got re-labelled to 'Individual' but in all the
| code it was still called Contact and the IDs all started with
| a C - but you know what? It was still useful to look at an
| ID, see the C and know that it was an Individual.
| shawabawa3 wrote:
| > Now you have some legacy prefix that cannot be changed
|
| Yes you can.
|
| You can support the old cus_<ID> prefix as well as the new
| org_<ID> prefix, but always return org_<ID> from now on
| wongarsu wrote:
| Reddit prefixes their IDs with t1_ for comments, t2_ for
| accounts, etc. That sidesteps the renaming issue.
|
| Though I believe they mostly do it because their IDs are
| sequential, so without prefix you wouldn't easily notice if
| you use the wrong kind of id. They also only apply prefixes
| at the api boundary when they base36 encode the IDs, the
| database stores integers
| chuckadams wrote:
| I have that exact issue with a couple different identifiers,
| and it's not a big deal. Usually it goes along with some data
| model change you already have to write compatibility code
| for, the new and old names tend to be related, and the old
| name tends to stick around other parts of the code anyway.
| Opaque IDs don't reduce the confusion there, documentation in
| appropriate places does.
| codeulike wrote:
| Except the Stripe ones are case sensitive which can be annoying
| with some databases
| datavirtue wrote:
| You know what's annoying? People storing GUIDs in a case-
| insensitive database.
| codeulike wrote:
| Classic RFC9562 GUIDs were often represented as hex like
| {550e8400-e29b-41d4-a716-446655440000} and so were case
| insensitive
| garblegarble wrote:
| This approach will randomly generate profanity. If the ID is
| visible to users it can cause some to get upset (and best-case
| simply looks unprofessional). On a purely technical level if
| visible in URLs it can cause links to be blocked/altered by
| e-mail filters / filtering proxies.
|
| It's generally a good idea to drop vowels for this reason.
| account42 wrote:
| fvck cvnt b1tch
| Knufferlbert wrote:
| We once had a rather angry Irish customer calling our support
| complaining that we called him a pikey (slur for gypsy).
| After some back and forth it turns out we just gave him an
| apikey.
|
| We never had a similar issue with our random
| numbers/letters/reset passwords or anything like that which
| don't have any kind of "dont return profanity" protections.
| Though I agree, someone getting a randomly generated customer
| portal url or something containing fuck or similar would look
| bad. Our cloudfront or something (or was it main public
| facing s3 bucket? can't remember) starts with "gay" and was
| never picked up on.
| biztos wrote:
| Totally off topic, but "gypsy" is itself a slur for Romani
| people in much of the world.
|
| https://en.wikipedia.org/wiki/Romani_people
| isaacremuant wrote:
| I'd say those articles show Wikipedia's political bias
| and a tendency to overly politically correct instead of
| portraying reality.
|
| Many in real life do refer to themselves as gitano or
| Gypsy and would ask for others to refer to them as such.
|
| Of course it's very easy to find an article saying
| otherwise and then using that as the end of discussion
| for Wikipedia editors.
| coooolbear wrote:
| By 'find an article' you mean find ~10 real citations
| including the resolution of an authority a long time ago
| and to tell the reader it is not clear or definitive?
|
| Better to be careful and let any individuals or
| communities tell you what they want. I have Roma
| connections in my family and at one point the word we'd
| use is 'gypsy'. But, because I'm not Roma myself, if I
| came across some other group I wouldn't assume I'm just
| allowed to say it to them.
| vsuperpower2020 wrote:
| I don't care what they want. A lot of people are tired of
| playing these language games.
| bigstrat2003 wrote:
| Also, "do what people want" is fine for your interactions
| with an individual. But it's not a viable general rule
| for language, where we need one single approach. I think
| saying gypsy unless someone personally tells you they
| would rather you don't call them a gypsy is perfectly
| reasonable.
| mst wrote:
| It's in the annoying category where it _can_ be used as a
| slur but also gets used as not-a-slur, including but not
| limited to by the people it describes.
|
| Locally (north west england) people generally use
| "traveller" as a description ... but there are definitely
| people who use _that_ as a slur.
|
| Language be like that sometimes.
| ThePowerOfFuet wrote:
| I hear Scunthorpe is lovely this time of year.
| brazzy wrote:
| I hear someone got buttbuttinated there recently.
| mst wrote:
| A problem we've been having since the medireview age.
| bigstrat2003 wrote:
| At first I thought that this was the result of my cloud-
| to-butt extension, and I was trying to figure out what
| "cloudcloudinated" meant.
| fregonics wrote:
| True! For instance the example itself. "cus" is profanity in
| Portuguese. If you were to localize the application, this
| would be a factor.
| theophaniel wrote:
| I feel like https://github.com/jetify-com/typeid is the
| solution to this
| mettamage wrote:
| Why is such a thing called a slug?
| inopinatus wrote:
| It's an old typesetting term that found its way into content
| management systems.
| https://archive.nytimes.com/www.nytimes.com/times-
| insider/20...
| corobo wrote:
| Ooh I had this wonder a while back and jotted it down just in
| case anyone else ever wondered about it:
|
| Comes from the ye olde paper-based blogs they call
| newspapers. When an article is being put together it's given
| a short name, sort of like a project name. This name would
| remain the same throughout the article's life - from reporter
| through to editor - it left its trail through the process.
| Like a slug.
| meindnoch wrote:
| That's not what Wikipedia says about its etymology though.
|
| >The origin of the term slug derives from the days of hot-
| metal printing, when printers set type by hand in a small
| form called a stick. Later huge Linotype machines turned
| molten lead into casts of letters, lines, sentences and
| paragraphs. A line of lead in both eras was known as a
| slug.
| corobo wrote:
| Ooh interesting. It looks like I have either
| misinterpreted or found a source that misinterpreted (it
| was a few years back, unsure if I came to the conclusion
| or found it). I'll have to update my notes, cheers!
|
| Apologies for the wetbrain hallucination, HN!
| 8372049 wrote:
| I'm not sure you're hallucinating. The dictionary I
| checked lists the printing and journalism terms
| separately. It's quite possible they have diverging
| etymologies, meaning both can be correct:
|
| _5. Print. a. a thick strip of type metal less than
| type-high. b. such a strip containing a type-high number
| or other character for temporary use. c. a line of type
| in one piece, as produced by a Linotype. 8. Journalism.
| a. a short phrase or title used to indicate the story
| content of a piece of copy. b. the line of type carrying
| this information._
| hprotagonist wrote:
| or the journalism term itself diverged from the
| typographic one.
| corobo wrote:
| Aye this is what it seems to be having double checked the
| reply's claim
|
| Got to the Wikipedia page
| https://en.wikipedia.org/wiki/Slug_(publishing) which
| could possibly support the slimy conclusion of "it's a
| trail through the process" but that article has an
| etymology section that refers to the metal slug
|
| I guess it could mean both depending on whether you're
| looking for the meaning of the word or the meaning of the
| concept but I didn't find any other slimy grub references
| (via an admittedly limited double check)
| hprotagonist wrote:
| another fun etymological rabbit hole for you: stereotype
| and cliche both probably originated as typographer
| jargon.
| bambax wrote:
| Nice, but better to remove ambiguous / hard to read letters:
| ijlo, IJLO, and 01 (and maybe 7 as well?)
| inopinatus wrote:
| I'm not convinced that all Stripe IDs are wholly random
| strings. I just decoded the base62 part of four genuine "acct_"
| objects, which at 16 characters are just shy of representing a
| 12 byte value (log2 62^16 =~ 95.3), and they all have a leading
| byte of "00000011" and two of them even have a leading 32 bits
| that is very suspiciously close to an epoch timestamp of a
| couple of years ago.
|
| There is a similarly suspicious pattern in some of their longer
| identifiers, invoices for example at 24 characters (ca.143
| bits) all seem to have a first byte of "00000010".
|
| Even in the article you've linked to, look closely at the IDs:
| pi_3LKQhvGUcADgqoEM3bh6pslE
| pm_1LaXpKGUcADgqoEMl0Cx0Ygg cus_1KrJdMGUcADgqoEM
| card_1LaRQ7GUcADgqoEMV11wEUxU
|
| Notice the consistently leading low-integer values (a 3, then
| three 1s)? and how it's almost always followed by a K or an L?
| That isn't random. In typical base62 encoding of an octet
| string, that means the first five or six bits are zero and the
| next few bits have integer adjacency as well. It also looks
| like part of the customer ID (substring here, "GUcADgqoEM",
| which is close to a 64-bit value) is embedded inside all of the
| other IDs and then followed by 8 base62 characters, which might
| correspond to 48 bits of actual randomness (this is still
| plenty, of course).
|
| Based on these values it seems there's a metadata preamble in
| the upper bits of the supposedly "random" value, and it's quite
| possible that some have an embedded timestamp, possibly
| timeshifted, and a customer reference, as well as a random
| part, and who knows maybe there's a check digit as well.
|
| It's possible - albeit this is not analytical but more of a
| guess - that the customer ID includes an epoch-ish timestamp
| followed by randomness or (worst case, left field) is actually
| a sequence ID that's been encrypted with a 64-bit block cipher
| and 32 bits of timestamp as the salt, or something similar (pro
| tip: don't try that at home).
|
| My view is that either Stripe's engineering blog is being
| disingenuous with the truth of their ID format, or they're
| using a _really_ broken random value generator. If the latter,
| I hope it 's only in scope of their test/example data.
| dividuum wrote:
| I took a look at a bunch of stripe customer ids I have stored
| and at least mine look very random on first glance. I assume
| their blog post uses demo keys or something similar.
| wslwsl wrote:
| In the comments it's mentioned, that the IDs contain a shard
| key for faster lookups.
|
| https://dev.to/stripe/designing-apis-for-humans-object-
| ids-3...
| strnisa wrote:
| Thanks for all the improvement suggestions! Taking them into
| account, the `makeSlug` function becomes:
| function makeSlug(length: number): string { const
| alphabet = "0123456789abcdefghjkmnpqrstvwxyz"; let
| result = ""; for (let i = 0; i < length; i++) {
| result += alphabet[crypto.randomInt(alphabet.length)];
| } return result; }
| biztos wrote:
| I think that 0 and 1 are likely to cause problems when
| customers end up reading their "user ID" back to your
| employees in Customer Support Country over the phone.
|
| "It's one-three-oh-dee-ee-el. Yes, I'm sure, EL as in
| elephant."
| jagged-chisel wrote:
| One can convert those to the correct characters. Since "oh"
| and "el" are excluded from the alphabetic range, they
| become "zero" and "one" - deciding whether that is done in
| software for the help desk, or in the brain of the help
| desk staff is left as an exercise to management.
| smaudet wrote:
| I wonder if Unicode could be used to alter the characters
| such that these mistakes would be less possible, e.g. using
| 0.
| fhars wrote:
| That is "AT", isn't it?
|
| (No, it isn't, if you look closely enough.)
| smaudet wrote:
| Yeah...what I'd _really_ like to do would be to give a
| character a "natural" background color, e.g.
|
| Then its simple for support to say "red is one, green is
| ell". But you can't just add a color to a character,
| because copy paste/rich formatting don't work everywhere,
| or even transfer well...
|
| Alternatively, if you use 0 and 1. it matters less if the
| user says "at" or "zero", and more that they didn't say
| "oh" or "one".
| Piskvorrr wrote:
| It's obviously a . Or perhaps a # . Maybe an A3/4, on odd
| Wednesdays?
|
| (Unicode has its strengths. Making up replacement
| characters isn't one.)
| codr7 wrote:
| I happen to know that the biggest ski resort reservation
| system in Scandinavia contains a function called
| MaybeOnATuesday(), but to my knowledge it's never called.
| WorldMaker wrote:
| If you are trying for URL safe, Unicode is problematic
| because of Punycode conversions and differing browser
| behavior with Unicode URLs. (Some browsers always show
| Unicode as Unicode in URLs. Some browsers always show
| Unicode as Punycode in URLs. Some browsers switch between
| the two based on a huge number of variables such as gTLD,
| user preference, phase of the moon, etc.)
| gvx wrote:
| Anything that needs to be read over the phone should
| probably be written out using something like the NATO
| phonetic alphabet, split into smaller chunks if needed:
| "The code? It's kilo eight niner; one three mike; delta
| echo lima."
| 8372049 wrote:
| Having come from a military background where using that
| is second nature, I'm constantly surprised how rarely I
| meet civilians who understand it effortlessly. When
| picking up a package I say "the code is Oscar Foxtrot
| three-fife" and you see the person processing for a long
| time to extract the first letter of the word. I've
| started saying "OF, that's Oscar Foxtrot, 3-5" to help
| them out.
|
| In other words, asking a customer/consumer to be able to
| recite something in phonetics is not realistic in most
| cases.
|
| Fortunately the code already takes this into
| consideration and removes ambiguous characters.
| hprotagonist wrote:
| digikey phone personnel all speak NATO. it's wonderful.
| hobs wrote:
| First thing I drilled into Apple phone support folks.
| mannykannot wrote:
| My experience in the USA is that if I don't include the
| phrase "as in" (as in "X as in Xray") most people still
| will not realize what I am doing (the alternative "for"
| can be confused with the digit.)
|
| I also ask them to check my readback of key information
| they have given me and vice-versa; usually that works
| well.
| 8372049 wrote:
| i, L and o are left out of the alphabet in the snippet, so
| there's not really any ambiguity.
| jgalt212 wrote:
| Or just drop the L and O, like CUSIP does.
| kstenerud wrote:
| I developed safe32 for this reason.
|
| https://github.com/kstenerud/safe-
| encoding/blob/master/safe3...
|
| Notably, confusable characters are interchangeable when
| being ingested (although a machine encoder MUST always
| produce canonical output).
| https://github.com/kstenerud/safe-
| encoding/blob/master/safe3...
|
| So a user can confuse 1 for l, 0 for o, I for l, u for v,
| uppercase, lowercase etc, or the agent can say any of those
| over the phone, and it won't matter.
| vanderZwan wrote:
| Oh that looks well-thought out and is probably the sanest
| way to solve this particular problem!
|
| It's obvious why the safe64/safe80/safe85 cannot do this,
| but is there a reason why the safe16 version doesn't have
| the same features?
| kstenerud wrote:
| Oh whoops that's an oversight! I'll fix that up tonight.
| what wrote:
| Isn't this just crockford encoding?
| kstenerud wrote:
| Every base-32 style encoding is Crockford at the core.
| The difference is in the alphabet, and also whether it
| requires padding or not (safe32 does not).
|
| Crockford also incorporates error correction, which is
| unnecessary in modern systems since the underlying
| protocols do that already.
| ss64 wrote:
| Another approach is using base31 with all vowels removed
| https://ss64.com/ps/syntax-base31.html
| kstenerud wrote:
| You still have the problem of 1 vs l. And also it doesn't
| support user-error (reading 0 as O, or reading V as U).
| inopinatus wrote:
| My favourite ambiguous readback is to say "M for Movember".
|
| I have a note from a few years ago that 367CDFGHJKMNPRTWX
| may be a sufficiently unambiguous alphabet. Drop the one
| you like the least (probably N) to obtain a faux hex
| encoding.
| kodisha wrote:
| Or, just use cuid2 [1] + prefix.
|
| [1] https://github.com/paralleldrive/cuid2
| fnordsensei wrote:
| One thing I've been looking for in an ID generator is a way
| to supply a blocklist. There are a number of character
| combinations I'd like to avoid in IDs, because they might
| be offensive or get stuck in filters when copy-pasted (e.g.
| in a URI).
|
| This can be solved in user space by regenerating if the
| character sequences are detected, but this a) skews the
| distribution, and b) potentially takes time, especially
| when the ID generator is made to not be "too fast". I want
| to generate a single ID _that passes the blocklist_ in a
| timeframe that is not too fast, if that makes sense.
|
| Is there an ID generator that takes this into
| consideration?
| ceejayoz wrote:
| Just take out the vowels and numbers that can look like
| vowels. Nixing 0 means no b00bs IDs, and avoids 0/O; I
| usually take out 1/I as well.
| fnordsensei wrote:
| That goes some of the way, but I can think of a few
| problematic sequences that are only consonants and/or
| numbers.
| ceejayoz wrote:
| I like to nix vowels and things that look like them, i.e. 0,
| to avoid random b00bs sort of tokens.
| travisgriggs wrote:
| Let's not get sexist here! It also avoids random d1ck and
| c0ck sorts of tokens.
| madcaptenor wrote:
| Sure, but I can see people still getting offended if "fck"
| showed up.
| Tangurena2 wrote:
| This is why an open source project is now "CK Editor". It
| was the author's initials, but too many people saw an
| extra vowel in the name of the project.
|
| https://en.wikipedia.org/wiki/CKEditor
| inopinatus wrote:
| Having a zero in your alphabet can be problematic, because
| leading zeros are often stripped (e.g. Excel notoriously
| mangles phone numbers thinking they are integers).
|
| Multiple calls to a randomness generator can be expensive,
| and a waste of entropy; production-scale random string
| generators should still respect this and ask for a block of
| bytes, then encode them, but with bias correction. You're off
| the hook in this case, I think node's implementation of
| randomInt is doing exactly that for you and conserving
| remaining entropy in a cache.
| scrollaway wrote:
| I don't recommend using random bytes for this. What I have done
| in my previous projects is take a uuid6, reserve a couple of
| bytes inside of it to replace with an object ID, and I convert
| that to obj_XXXXXXX.
|
| This means you can store them in the db not as a string but as
| a uuid, which is a lot more performant. You also get time
| stamping for free.
| cogman10 wrote:
| A problem with this approach is it's not monotonical.
|
| Especially if you want to use this thing as an index in a
| database, you'll run into problems where you try doing middle
| insertions frequently, which causes fragmentation.
|
| The solution to this problem is making the higher order
| characters time sorted [1]. You don't need to go all out like
| uuid, you can have a pretty low resolution. It's more important
| that new insertions tend to be on the same page. If you have a
| low frequency insertions then minute resolution is probably
| good enough. (Minutes since 2000 is an easy calculation).
|
| To implement that here, I'd suggest looking at how base64 or 85
| encoders work and use that instead of repeated mods. You can
| then dedicate the upper bits to a time component and the lower
| bits can remain random. [2]
|
| [1] https://vladmihalcea.com/uuid-database-primary-key/
|
| [2] https://github.com/mklemm/base-n-codec-
| java/blob/master/src/...
| reissbaker wrote:
| IMO it's nice to have two keys:
|
| 1. An auto-incremented 64-bit (unless you have a good reason,
| in which case 32-bit is fine) primary key, used internally
| for foreign key relations. This will generally result in less
| index bloat on associated tables, and fast initial inserts.
|
| 2. A public-facing random string ID. Don't use this
| internally (other than in an index on the table it's defined
| for), since it's large. But this should be the only key you
| expose to end-users, to prevent leaking data via the German
| Tank Problem:
| https://en.wikipedia.org/wiki/German_tank_problem
|
| Only create the second key if this is data you're exposing to
| users, of course -- for data that's only used internally,
| just use the 64-bit auto-incremented PK and skip the added
| index bloat entirely.
| cogman10 wrote:
| For the number 2, I think one issue is that you are going
| to be semi-frequently whacking the db to do a mapping of
| that random string id back to the real id. OK for smaller
| entities but might be a pain if there's a lot of those ids
| to wrangle. You can throw a secondary index on it, but that
| will still have some minor fragmentation issues.
|
| One benefit of a random id is if you are working with more
| complex data models it can make creating those
| easier/faster. Instead of having a centralized location to
| get new ids from (the DB) you can create ids on the fly
| from the application which can turn the write into a single
| action from the application rather than a dance of
| inserting the main table, getting the new id, then
| inserting to the normalized tables.
| caeril wrote:
| That's why the Good Lord invented caching. In most
| applications, 90% of your workload will be over ids less
| than a week old, so your hit rate is likely to be pretty
| high for this sort of mapping.
| cogman10 wrote:
| First hit can be a beast. It's workload/entity
| determinant if caching is enough for this.
|
| Not great if you are spending 1 minute on the first
| lookup just to do the mapping.
| rav wrote:
| Instead of a random string ID, you can devise a fixed
| secret key and expose the auto-incremented ID xor the fixed
| secret key as the public-facing ID. This saves you the
| separate index but still avoids the German tank problem.
| But it gives you a new problem, namely a secret that's hard
| or impossible to rotate.
| cogman10 wrote:
| This is insecure. Assuming the user can get a few key
| examples (which, we assume they would be able to if the
| german tank problem is a problem) then the secret can
| easily be revealed. [1]
|
| [1] https://dev.to/wrongbyte/cryptography-basics-
| breaking-repeat...
| senderista wrote:
| XOR isn't secure enough, but you're on the right track.
| Instead, use an actual block cipher.
| WorldMaker wrote:
| A different approach to solve both 1 and 2 is timestamp-
| oriented IDs. You can get useful cache locality/less "index
| bloat"/fast initial inserts if your keys can be easily
| ordered in time. Sorted by timestamp means very similar
| behavior to B-Tree appends of a monotonic integer, even
| sometimes in the worst cases where "same moment" IDs aren't
| monotonic and rely more on random entropy.
|
| I got some great DB cache/index performance from ULIDs with
| a bit of work to order the ULID timestamp bits in the way
| the DB's 128-bit column "uuid" sort best supported.
|
| Now that UUIDv7 is standardized we should hopefully see
| good out-of-the-box collation for UUIDv7 in databases
| sooner rather than later.
| senderista wrote:
| I don't understand why you need to maintain two separate
| keys: instead of generating a random key, why not just
| encrypt the auto-increment key using a secret key? This is
| the approach used by e.g. cloud providers that use auto-
| increment keys internally but don't want them to be
| guessable.
| cogman10 wrote:
| I think the biggest problem with this approach is it
| effectively pins you to a encryption key and algorithm
| (unless you embed some information in the key that lets
| you version the key, gotta think of that upfront).
|
| Imagine, for example, that you picked DES and "kangaroo"
| as the secret several years back. You are now pinned to
| an algorithm and key with known security problems and a
| weak key.
| echelon wrote:
| This is a great technical modification that can be made to
| work with "Stripe"-alike IDs or tokens.
|
| Another hack for advanced active-active situations where you
| may need to route events before replication completes:
| encoding the author shard / region in the lower order bytes.
|
| There are lots of interesting primary key hacks for dealing
| with physical or algorithmic complications.
| jordanthoms wrote:
| This is dependent on the database you are using - if it's a
| key-sharded distributed database, you want to have insertions
| evenly spread across the key space in order to avoid having
| all the inserts go into a single shard (which could overload
| it)
| cogman10 wrote:
| This is the great thing about using random bits for the
| lower bits. Because you are unlikely to use more than say
| 2^64 database nodes, any sharding algorithm will have to
| figure out how to spread a key with 64 bits (or however
| many bits are in your key) across n nodes.
|
| Because of the random portion of the key, that means you'll
| get good distribution so long as the distribution algorithm
| isn't something stupid like relying solely on the highest
| order bits.
| groestl wrote:
| > A problem with this approach is it's not monotonical
|
| Whether or not that's bad fully depends on your platform and
| the number of writes you do. If you're using a massively
| distributed database like Datastore, Spanner etc, you want
| random keys as to avoid hot spots for writes. They produce
| contention.
| cogman10 wrote:
| Well, you'd still likely want psuedo-random keys. You'd
| rather not have the underlying database doing extra work to
| shuffle around records as the pages get jumbled.
|
| One solution to that is having more complex keys. For
| example, in one of our more contentious tables the index
| includes an account id (32bit int) and then the id of the
| entity being inserted. This causes inserts for a given
| account to still be contiguous (resulting in less
| fragmentation) while not creating a writing hotspot since
| those writes are distributed across various clients.
| groestl wrote:
| Not disagreeing. Point is, you need to know your domain,
| your technology, your write patterns, your downstream
| systems, etc to decide if a specific key scheme works to
| your advantage or not. All the more reason not to use
| natural keys, as they lock you in in that regard.
| cogman10 wrote:
| Absolutely agree.
|
| I don't know how you can successfully maintain or develop
| software without developing an understanding of the
| underlying domain. I've seen devs try that route and the
| quality of their work has never been high.
| hot_gril wrote:
| Depends heavily on what kind of database/index. Going with
| anything other than a random uuid4 adds complexity; for one,
| do you want to expose time info? I'd rather default to uuid4
| as the client-exposed ID* and only change if there's a solid
| measured reason to.
|
| * not the same as your internal DB row primary keys, which in
| Postgres should usually be bigserial
| cogman10 wrote:
| Yes but also no.
|
| Yes, in that there are DB technologies not built in a
| fashion where records are stored in a sorted order of some
| fashion. No, in that they are very much not common
| technologies. Most databases, relational, non-relational,
| etc have some form of a B-Tree at their core somewhere.
| hot_gril wrote:
| I can see this. Spanner is an example where you don't
| want this, idk if that's considered common enough.
| Postgres and MySQL both support hash indexes that are
| unordered, but the default in both is btree, and Postgres
| hash indexes used to have some caveats that made them
| unsuitable (idk about now) so I've gotten in the habit of
| just using the default.
|
| MySQL docs claim that a hash index is much faster if you
| only need kv lookups, so it seems like uuid4 with hash
| index would be suitable. Never tried it though, and can't
| say whether it's faster than using a btree with uuid7.
| Seems like in theory it would be.
| cogman10 wrote:
| As always, depends on the implementation.
|
| Hash maps should generally always have faster lookups
| than Btree based structures. However, they'll have slower
| writes especially when contested. A key issue hash tables
| have to deal with is what happens when a remapping needs
| to happen. For example, when 2 keys have the same hash.
| In that case, locking becomes a lot more messy.
|
| For a btree this is simpler. It's built to be able to
| handle reshuffling and rebalancing in a way that's semi
| easy to have fine grained locks around.
| hot_gril wrote:
| Hm yeah, now I'm wondering how MySQL implements it.
| Consistent hashing is one way to limit the scope of a
| remap.
| giancarlostoro wrote:
| If I'm going to do that I think I'd use Bitcoins BASE58 which
| avoids letters that could be confused for each other. The
| number of times I see an O and 0 and wonder which is which,
| because the font does not make it clear really annoys me.
|
| Edit:
|
| Other honorable mentions: ObjectID's as used by MongoDB which
| contain the creation timestamp. Also Discord's snowflakes
| (inspired by Twitter's iirc), which also contain the creation
| timestamp.
| kibwen wrote:
| If you want random IDs to be human-readable (and human-
| communicatable), I'd just recommend base 32 or even base16.
| You don't actually save that many bytes from base58 or base64
| when it comes to short IDs.
|
| Case in point, the parent poster's base64 ID is 14 characters
| long. When encoded as base32 that's still only 17 characters
| (or 19 in base16), and now you have completely gotten rid of
| all notion of casing, which is annoying to communicate
| verbally.
| giancarlostoro wrote:
| I think the more I think of it, the more I favor the
| Discord snowflake ID's because they're just integers, and
| they can be generated on the fly. I think new messages
| generated them on the client if I'm not mistaken.
| cimnine wrote:
| Yes, and I like to combine two established concepts instead of
| rolling my own: URI and UUIDv7. So my IDs become `uri:customer_
| shortname:product_or_project_name:entity_type:uuid`. An example
| ID could be `uri:cust:super_duper_erp:invoice:018fe87b-b1fc-7b6
| f-a09c-74b9ef7f4196`. It's even possible to cascade such IDs,
| for example: `uri:cust:super_duper_erp:invoice:018fe87b-b1fc-7b
| 6f-a09c-74b9ef7f4196:line_item:018fe882-43b2-77bb-8050-a1139303
| bb65`.
|
| It's immediately clear, when I see an ID in a log somewhere or
| when a customer sends me an ID to debug something, to which
| customer, system and entity such an ID belongs.
|
| UUIDv7 is monotonic, so it's nice for the database. Those IDs
| are not as 'human-readable' for the average Joe, but for me as
| an engineer it's a bliss.
|
| Often I also encode ID's I retrieve from external systems this
| way: `uri:3rd_party_vendor:system_name:entity_type:external_id`
| (e.g.
| `uri:ycombinator:hackernews:item:40580549:comment:40582365`
| might refer to this comment).
| michaelt wrote:
| _> It 's even possible to cascade such IDs, for example: `uri
| :cust:super_duper_erp:invoice:018fe87b-b1fc-7b6f-a09c-74b9ef7
| f4196:line_item:018fe882-43b2-77bb-8050-a1139303bb65`._
|
| Let me guess - you're a Java developer, right?
| phkahler wrote:
| Something like that should have a built in check "digit" if
| people are going to see it and possibly type it in.
|
| For numeric values, making them all a multiple of 11 is a
| simple way to catch all single digit errors or single
| transpositions.
| chuckadams wrote:
| That's why VINs make a decent natural key, because they do
| have a check digit. Plus they're not opaque: if you look up
| the VIN and the make/model/year is completely different than
| the car in front of you, you know you either have the wrong
| VIN or the wrong car.
| psychoslave wrote:
| To my mind, it always felt so saddening that adoption of a
| truly straightforwardly readable notation for numbers never
| took of. I mean it's so easy to do. You can start for example
| with a single syllable per digit, and for example only target
| CV syllables.
|
| From this there is many possibilities, but for example, let's
| consider only a base ten. Starting with vowels order o, i, e,
| a, u with mnemonic o, i graphically close to 0, 1 and then
| cyclically continue the reverse order in alphabet (<-a, <-e,
| <-i, <-o*, |-u). We now only need two consonants for the two
| series of 5 cardinals in our base ten, let's say k and n.
|
| So in a quick and dirty ruby implementation that could be
| something like: $digits = %w{k
| n}.product(%w{o i e a u}).map{it.join('')} def
| euphonize(number) =
| number.to_s.split('').map{$digits[it.to_i]}.join('-')
| euphonize(1234567890) # => "ki-ke-ka-ku-no-ni-ne-na-nu-ko"
|
| That's just one simple example of course, there are plenty of
| other options in the same vein. It's easy to create "syllabo-
| digit" sets for larger bases just adding more consonants, go
| with some CVC or even up to C0C1VC0C1 if sets for C0 and C1 are
| carefully picked.
| psychoslave wrote:
| And of course a naive implementation of the reverse is also
| trivial: def numerize(euphonism) = euphonism
| .split(?-).map{$digits.find_index(it)}.map{it.to_s}.join.to_i
| jmcphers wrote:
| There is a old system for making numbers pronounceable as
| words using a mapping from each number to a consonant value.
| It's typically used to help memorize numbers:
|
| https://en.wikipedia.org/wiki/Mnemonic_major_system
| samatman wrote:
| This is how the Urbit address scheme works, btw:
| https://urbit.org/blog/the-urbit-address-space
| mkleczek wrote:
| The question is: if you have two restaurants having different
| scores but distinguishable only by their surrogate keys, how do
| you know which one to go to?
|
| In other words - using surrogate key is an attempt (and the wrong
| one!) to fix the problem of missing important information in the
| database.
| big_whack wrote:
| If you use a surrogate key, you still need a unique constraint
| in the table (probably the same columns you would otherwise
| call your natural PK). If your unique constraint isn't
| sufficient to capture the difference you mention, you need to
| add more columns.
|
| However, that's strictly better than the natural PK situation,
| where you would need to not only add new columns to the key,
| but also add those columns to all referencing tables.
| mkleczek wrote:
| > However, that's strictly better than the natural PK
| situation, where you would need to not only add new columns
| to the key, but also add those columns to all referencing
| tables.
|
| Foreign keys referencing surrogate key has different
| semantics than fk referencing natural key - it is a can of
| worms actually and can lead to unexpected anomalies.
|
| Lets take the example from the article (with surrogate key):
|
| Restaurant(id, name, city)
|
| Now let's add a possibility to record visits:
|
| Visit(restaurant_id references Restaurant(id), user, date)
|
| We have a procedure to register visits to a restaurant:
|
| register_visit(restaurant_name, user_name, date_of_visit) {
| INSERT INTO visit SELECT id, user_name, date_of_visit FROM
| restaurant WHERE name = restaurant_name }
|
| I very much enjoy spending time in "Polish Kielbasa"
| restaurant in Warsaw and I visit it everyday - I don't visit
| any other restaurant at all.
|
| Now changes of a restaurant name will lead to the database
| containing misinformation:
|
| register_visit('Polish Kielbasa', 'mkleczek', 2024-6-4);
| update restaurant set name = 'Old Kielbasa' where name =
| 'Polish Kielbasa' and city = 'Warsaw'; insert into restaurant
| ('Polish Kielbasa', 'Warsaw'); register_visit('Polish
| Kielbasa', 'mkleczek', 2024-6-4);
|
| Question: what restaurants did I visit this year?
|
| This kind of anomalies are avoided using _natural_ keys and -
| first of all - defining proper _predicate_ for _each_
| relation.
|
| The predicate of relation visit(restaurant_name, city, user,
| date) is quite obvious: "User [user] visited restaurant
| [restaurant_name] in [city] on [date]"
|
| Question: What is the predicate of relation
| visit(restaurant_id, user, date)?
| begueradj wrote:
| One of the most interesting part to me is the fact he is a
| university professor in a good country but rides a used/second
| hand car... in an era where status and success are highlighted by
| displaying what one buys and owns on social media.
| vanderZwan wrote:
| Highlighted by _whom_? University professors?
| mobiuscog wrote:
| > in an era where status and success are highlighted by
| displaying what one buys and owns on social media
|
| This is no different from the people who used to show off their
| rolex in the pub, or park their BMW on display thinking it was
| impressive.
|
| The 'show' may have moved, but it's still the same people who
| gain the same amount of 'respect' that they ever did (not
| much).
|
| Besides, I would guess that most professors are not so worried
| about their car (and in fact many may prefer a more sustainable
| mode of transport) as they are to their research and citations,
| and the advancement of their students.
| lordnacho wrote:
| Cars in Denmark carry a huge tax, it's not unusual that you can
| only afford a used car. Plus public transport in the cities is
| pretty ok.
| rsynnott wrote:
| > in an era where status and success are highlighted by
| displaying what one buys and owns on social media.
|
| I mean, not for everyone? These things, pretty much regardless
| of era, matter a lot to some people and not at all to others;
| for still others it may be a negative signal of sorts.
|
| Anecdotally, I know a lot of very well-off people; well under
| half would have fancy cars.
| fuzzfactor wrote:
| This is an educated person who shows higher-class tendencies by
| riding a bicycle, compared to those who seek mere automotive
| eminence.
|
| Even if the bicycle is older than the car, perhaps even more
| so.
| iso8859-1 wrote:
| How it is possible for an article from 2024 to have a comment
| from 2009?
| jimbobthrowawy wrote:
| I think it's a joke about natural keys, based on the date added
| to the comment:
| https://github.com/ploeh/ploeh.github.com/commit/6149bcdf0e2...
|
| I think the date was chosen because that's when the blog
| originated after the author left microsoft.
| lysecret wrote:
| Again, every time I have seen natural keys in a big long lived
| system I have seen duplication. This is something you will have
| to deal with.
| porker wrote:
| As with everything, natural keys are sometimes useful or even
| good. Where business rules fit, I like using them. But you have
| to know that the value will never change or be reused.
| mootothemax wrote:
| You can get performance benefits from using natural keys, as many
| databases store rows in the order of a table's primary key
| (sometimes called the _clustered_ index, though it may or may not
| have a unique constraint requirement depending on the DBMS and a
| few other bits).
|
| In the author's example, if the first column in the natural index
| was the city name (or city ID!), and locations are often pulled
| from the database by city, you'll see a read time performance
| benefit because each cityName's restaurants will be stored
| together.
|
| This is why UUID-based systems can suffer worse read + write
| performance; their rows will be stored in the order of its UUIDs
| (that is, randomly spread around), making read, insert, and
| update performance lower.
|
| What to do? I favour a mixed approach: have a unique integer ID
| column used internally, expose a unique UUID to the public where
| necessary and - with a BIG DO NOT OPTIMIZE PREMATURELY warning -
| really think about how the data is going to be queried, updated,
| and inserted. Ask - does it makes sense to create a clustered
| index based on the table data? Is there enough data to make it
| worthwhile? Where can the index fail if changes need to be made?
| Under some circumstances, it might even make sense to use a
| natural key with the integer column included right at the end!
|
| The only hard rule I have is using UUIDs for clustered indexes.
| Unless the tables are teeny-tiny, the system is most likely
| suffering without anyone being aware of it.
| williamdclt wrote:
| > many databases store rows in the order of a table's primary
| key (sometimes called the clustered index [...])
|
| Note for readers: Postgres doesn't do that
| mootothemax wrote:
| >> many databases store rows in the order of a table's
| primary key (sometimes called the clustered index [...])
|
| >Note for readers: Postgres doesn't do that
|
| And I am climbing out of the rabbit hole that is Postgres'
| CLUSTER keyword :) Really funky, looks like it's a one-time
| operation that isn't persisted?
|
| Looks like Postgres stores its index entries in ascending
| order by default (NULLs at the end), and if so, the point's
| worth keeping in mind on that front too.
|
| I _really_ need to do a deep dive on Postgres internals one
| of these days, it 's an amazing - and different! - system its
| developers have created.
| sgarland wrote:
| Nope, but it does have a Visibility Map due to its MVCC
| implementation. Go ahead and do an index-only lookup on a
| table using a UUIDv4. Then repeat with something k-sortable.
| See how many buffer hits each one has.
|
| I assure you, those add up. Nothing like explaining to a dev
| that they have hit the theoretical memory transfer limit of
| the hardware, and it cannot possibly go any faster unless
| they refactor their schema (or a faster DDR version comes
| out).
| nicce wrote:
| > This is why UUID-based systems can suffer worse read + write
| performance; their rows will be stored in the order of its
| UUIDs (that is, randomly spread around), making read, insert,
| and update performance lower.
|
| This isn't always the case anymore. UUID standards have been
| developed in that mind, and they are not completely random
| anymore. UUIDs can give a hint, for example, about the time
| when it was created, which gives them some order.
| usrbinbash wrote:
| The problem is: As soon as a synthetic key is used by an
| _EXTERNAL_ system, it stops being synthetic; for all intents and
| purposes it is now an integral part of the record, because if you
| change it, the external system may not be able to find the record
| any more.
| kstenerud wrote:
| In databases, never rely on data you don't control. "Natural"
| keys are an example of this.
|
| Names can be natural keys, but you don't control them. You don't
| control when or how a name changes, or even what makes a valid
| name.
|
| Addresses change. Or disappear. Or somehow can't be ingested by
| your system suddenly.
|
| Official registration numbers (SSNs, license plate numbers,
| business numbers etc) seem attractive, but once again you don't
| control them. So if the license plate numbering scheme changes in
| some way that breaks your system, too bad. Or people without an
| SSN. Or people in transition because an SSN needs to be changed
| in a government system somewhere. Or any other number of things
| that happen in a government office that affect you, yet you have
| no control over.
|
| Phone numbers? Well, we've already seen that mess with many
| messenger platforms.
|
| Fingerprints? Guess what? They evolve over time, and your system
| will eventually break.
|
| Retrofitting a system that relies on "natural" keys that have
| broken SUCKS.
|
| Use a generated unique key system that YOU ALONE control.
|
| The first rule of software design is: Don't try to be clever.
| You're not clever enough to see all of the edge cases that will
| eventually bite you.
| wvh wrote:
| I agree with your general point, but it might still be cheaper
| to redesign your system if a rare breaking change happens to
| usually reliably stable external registration systems, rather
| than to pay the cost of a thousand paper cuts by indirection,
| computational cost or human confusion.
|
| One edge case and good indicator is if your system is testable
| by itself without third party cooperation and the ability to,
| for instance, create license plate numbers.
| kstenerud wrote:
| Every system that uses SSN as the primary key is fully
| testable by itself without third party cooperation. And yet
| every one of these systems has required workarounds because
| what the designers thought to be invariants, weren't.
|
| And this is ALWAYS the case with identifiers that you don't
| control. You don't make policy decisions about them, but
| SOMEBODY does. And that somebody isn't aware of - and
| wouldn't even care - about the invariants that you ASSUMED
| they followed (and maybe they really did follow them, but
| they sure don't anymore! Oops...)
|
| Fixing broken invariants after the fact is always a
| nightmare, because it only comes up once you get stuck -
| either you can't enter something into the database that you
| absolutely MUST by the weekend, or you can't change something
| in the database that you absolutely MUST by the weekend. So
| you do some last minute hacks to get things kind of working,
| and then it works for awhile until the next problem (usually
| involving your hack).
|
| It's hardly any extra work or complexity to just use an ID
| generator for the primary key. You'll still have the same
| indices, the same foreign key linkages etc. You have no
| reason not to do this.
|
| And yet somehow people always seem to fall for the "Oh cool!
| This existing ID does everything we want! Let's just use that
| instead of adding one more field to the table! I'm so
| clever!"
| jimbobthrowawy wrote:
| Licence plate numbers are an interesting one, since what those
| mean varies from country to country. Here (Ireland), they are
| assigned to the car itself via VIN, are _never_ meant to change
| once assigned, and are backdated based on information about the
| vehicle itself (e.g. year of first registration even if first
| registered in another country, following an older format if
| applicable), but in other countries they can be reassigned to
| other vehicles.
|
| e.g. 06-LK-12345 might be your car's plate if you move your
| foreign 2006 car here and register it while living in Limerick,
| but buying a new car might give you a plate like 242-L-12345
| since the format of the first two fields changed since. If you
| leave and later return with that car, re-registering it gives
| you the exact same number.
|
| https://en.wikipedia.org/wiki/Vehicle_registration_plates_of...
| akira2501 wrote:
| You don't have vanity plates in Ireland?
| ThePowerOfFuet wrote:
| Many, many countries do not.
| cheema33 wrote:
| Vanity plates are essentially free money for the govt. I
| am surprised every govt. body doesn't go for it. If you
| really don't like them, make them super expensive.
| rsynnott wrote:
| Tbh when you factor in the administration, I'd question
| whether they actually make any money on these.
| Macha wrote:
| There aren't any, the closest is that some authorities
| reserve low sequence number plates for e.g. the local mayor
| jimbobthrowawy wrote:
| You can get vanity numbers, but 1 is reserved for the
| region's mayor, and anything 1000 or below is assigned
| based on a raffle system. Other than that, it's first come
| first served.
|
| It can only go on that one car though, so there's not as
| much value in doing it. I've mainly seen bus rental
| companies do it.
| andrelaszlo wrote:
| Official registration numbers, such as Swedish personal
| identification number, or "personnummer" (date of birth +
| serial + checksum [Luhn], where even serials are used for
| females and odd for males):
|
| - It can take a few days before a newborn is assigned a number
|
| - Non-citizens don't have one, but they can get a coordination
| number on the same format but with the date part incremented by
| 60 days.
|
| - Citizens can have both a coordination number and a personal
| identification number in certain cases.
|
| - They can be changed if the wrong birth date or gender
| registered at birth or during immigration, for protected
| identities, or for gender transitions.
| Aaargh20318 wrote:
| In the Netherlands we have something similar: the BSN
| (translated: citizen service number). This used to be the
| social security number, but because it was used by an
| increasing number or government agencies unrelated to social
| security it was changed to BSN.
|
| One of the major problems with this number is that it has a
| special status under the law. There are very strict rules as
| to who can process and/or store this number and for which
| purpose. For example: a bank can process this when opening a
| bank account, under anti money laundering regulations, but
| they cannot use it to identify an existing customer.
|
| If you originally set up your database to use SSNs you now
| have a problem. This actually happened with our chamber of
| commerce: if you registered a one-man business they used the
| SSN as the business id and you're required as a business to
| publish this. Now it's suddenly a number that is subject to
| strict privacy rules and they have to renumber all one-man
| businesses.
|
| So that's another problem with data you don't control: the
| legal status of this data can change.
| wasmitnetzen wrote:
| Nitpick: long-term residents get a personnummer as well, not
| just citizens.
| nailer wrote:
| I was this recently. I moved to US in November and it was
| February before I had an SSN. A bunch of companies had to put
| in fake SSNs into their system, which they have a standard for.
| zer0tonin wrote:
| Yes, except sometime whole industries come together already has
| developed a system of unique identifier, and I can just use this
| as a natural key. For example, the Nasdaq stock exchange is
| identified perfectly fine by the string 'XNAS', I'm not going to
| generate an ID on top of it.
| canterburry wrote:
| I don't think I have seen natural keys used in a database since
| the 90s.
|
| While I too was taught natural keys, or combined keys, were the
| "intended" way to identify data, I was corrected very quickly
| that artificial primary keys were much more reliable and more
| convenient due to all the auto increment features etc. I am
| surprised this is even talked about anymore.
| jbverschoor wrote:
| The amount of inexperience is staggering
| nuancebydefault wrote:
| I'm a bit confused. I've never programmed databases, though i
| have a lot of programming experience.
|
| The student made a structure that can be seen as a unit or table
| in a database. Is the database not clever enough to synthesize a
| unique id? Why does the programmer need to care about keys?
|
| It is clear that the end user will not know anything about a key,
| it might be a byte offset in a file or a memory pointer. Why does
| the programmer need a key?
| alanhaha wrote:
| ~~You'll regret~~ I regretted
| contingencies wrote:
| My traditional schema was always to have _id_ in every table and
| to make it an integer, then for pretty much anything it 'd be two
| tables, _nouns_ for the thing and a one-to-many relationship with
| _noun_names_ for the names, where _noun_names_ was comprised of
| _id /noun/language/name_, keyed on _noun_ and _language_ , with
| the _language_ column FK 'd to _languages.id_. I 'd also have a
| _languages.code_ column for ISO or IANA code (sometimes both
| columns or aliases), and a _language_names_ table with
| _language_used /language_named_. So you could name English in
| Frisian, and Chinese in Tibetan. Haven't needed to build SQL in
| ages, though that schema works great. Certainly never had an
| issue with it once UTF-8 became standard. If you build schemas
| without i18n, you're asking for trouble.
|
| One good thing about using a standard integer key is the cost of
| indexes is low (low memory use). One good thing about using _id_
| everywhere is that it 's short and self-explanatory to
| programmers from any culture.
|
| Lots of casual queries built of subqueries like...
|
| _select * from noun_names where noun=(select id from nouns where
| ...) and language=(select id from language where code= 'en');_
|
| Always felt this was the most readable. Always felt that
| LEFT/RIGHT JOIN stuff was bonkers. Onboarded a lot of serious
| junior devs, never had an issue.
| praptak wrote:
| Modeling identity doesn't end with synthetic keys though.
| "Generate own keys" only solves the problem of identity
| ownership. You now own the identity but it doesn't mean your
| identities are as they should be.
|
| Say a customer used a different government-issued ID to re-
| register with your bank. A year down the line you notice that you
| have two identities for the same person. It might be a meh for an
| online game but if you are a bank this can make you run afoul of
| regulations. Can you handle the merge of all the relevant data?
| And merging is usually the easier of the two glitches - can you
| handle a split?
|
| The point is that identity just like security requires thought
| from the start of the design.
|
| For a domain where identity is really hairy (although admittedly
| with less consequences for screwing up) see
| https://news.ycombinator.com/item?id=4493959 "The music
| classifying nightmare". Also
| https://en.wikipedia.org/wiki/Identity_(philosophy)#Metaphys...
| for some philosophical perspective.
| meekaaku wrote:
| I am firmly in the control your primary key, preferably uuid.
|
| Quick question for database gurus here. Is it ok to have a
| currency table using the currency code as primary key?
| brabel wrote:
| I think long term, that is a bad idea.
|
| In Brazil in the 80's and early 90's, the currency changed name
| several times. It was called "cruzado", then "cruzeiro", then
| "cruzado novo" (yep, "new" <old-name>, very creative) and then
| I think it went back to just "cruzado" :D before finally
| becoming Real (which I believe was the name of the currency
| also during Monarchy, 100 years earlier).
| kassner wrote:
| Likely no, there are currencies without an ISO 4217 code, i.e.:
| https://en.wikipedia.org/wiki/Faroese_kr%C3%B3na
| zarzavat wrote:
| This is interesting but it doesn't really seem like a
| separate currency. It's more like Scottish pound vs English
| pound, it's the same currency but different banknotes for
| different regions in the same state.
| bjourne wrote:
| Hard disagree. You create tables for them to be queried by
| applications. For the Restaurants example the application clearly
| wasn't designed to handle Restaurant entries whose only field
| differs is rank. Consequently, you shouldn't allow data the
| application can't handle. Whenever the time comes and you think
| "ah, wouldn't it be great if multiple people could review the
| same restaurant in the same year?" you change the natural keys.
| You can always relax the rules. Tightening the rules is much
| harder because garbage data may already be present.
|
| Keys are about the integrity of your application(s) and
| preventing corner cases by making them impossible.
| konschubert wrote:
| You can use a UUID as a primary key and still enforce
| uniqueness on a combined index of city and name.
| nkozyra wrote:
| If they conflict and are user facing identifiers aren't you
| then forced to add uniqueness to the city and name, ala
| /chicago/chipotle_s4, potentially bleeding some business
| details?
|
| I don't love uuids as public identifiers for a number of
| reasons but not hinting details about your data is one nice
| thing about them.
| konschubert wrote:
| Hmm, I am not sure if I get you.
|
| The cool thing about enforcing uniqueness on a secondary
| index is that you can just change or remove the uniqueness
| constraint anytime without breaking foreign key relations.
| nkozyra wrote:
| I'm not sure why you want a uniqueness index then in this
| case. If it's not publicly visible (effectively a
| secondary index / reference), then you can't use it as an
| identifier.
|
| This means you need something else (a slug, as discussed
| herein) or just use the ID (bad practice in general) or
| UUID (bad for humans). If you use - in this example -
| city and company name - you still have to enforce
| uniqueness, so you have a pseudo-slug anyway.
| bjourne wrote:
| But what's the point? You're complicating the data for no
| apparent gain. Only terrible RDBMSes which don't support
| multi-column keys require surrogate keys.
| konschubert wrote:
| Because you can remove or change the uniqueness constraint
| without having to worry about foreign key relationships.
| bjourne wrote:
| Eh, that's a feature of natural keys. If you try to fuck
| up your model your RDBMS will complain.
| zarzavat wrote:
| You can't change primary keys, that's the point, because you
| don't know _where_ they are.
|
| For example if an old key is in a URL, and that URL is in a
| browser bookmark, now you need redirects, so you need to keep
| all the old keys around forever. Keys should be random or
| sequential, never contain information.
|
| If you want to enforce uniqueness then use a unique
| index/constraint.
| bjourne wrote:
| I think you are confused about the terminology because a
| primary key _is_ a uniqueness constraint. They can be changed
| in any RDBMS worth its salt. Keys should not be random and
| your URL example is a case in point. The url
| /germany/berlin/2023/mcdonalds contains no surrogate key and
| is immensely more useful than the url
| /review?uuid=1kjksdhh3244ygdvvgdd2345.
| zarzavat wrote:
| A primary key _implies_ a uniqueness constraint. But you
| can have a uniqueness constraint without a primary key.
|
| You can change primary keys within a database but you can't
| change them outside the database. A database is a map but
| keys are part of the territory.
|
| The second URL with the UUID is more secure and is
| preferred in many situations where information leakage is a
| concern.
|
| The first URL is more descriptive but more prone to
| breakage. It depends on the situation.
| bjourne wrote:
| A key is any set of unique columns. The primary in
| primary key just hints to the RDBMS how to store the
| data. Both URLs encode queries on the data. The first
| says "give me the 2023 review(s) of McDonald's in Berlin"
| (i.e., it has semantics) and the second "give me the row
| with uuid 1kjksdhh3244ygdvvgdd2345". I don't think the
| second URL is more robust because, for example, if the
| row has to be recreated it will fail to work.
| fiatjaf wrote:
| Such a long article, no convincing examples. Why did he waste so
| long on the stupid restaurant case that clearly makes no sense?
| Natural keys are good as long as you think one minute before
| deciding to use them.
| bdcravens wrote:
| Another surprising type of data that people think are unique are
| tracking numbers. Some carriers (like UPS) do recycle them.
| shireboy wrote:
| In the example from the article I think the table should be named
| RestaurantRank(year,rank,restaurantid), with a natural key
| year,rank. It would have a fk to Restaurant(Id,name,etc). Choose
| a natural key if there is some natural unique field or
| combination. In this case unless restaurants can tie in their
| ranking year,rank can be unique
| noisy_boy wrote:
| You'll have to modify the natural key if later it is decided to
| add rankCategory ("best overall", "best ambience", "best value"
| etc) via rankCategoryId foreign key to the RestaurantRank table
| because a given restaurant can have the same rank in the same
| year under multiple categories. A synthetic key avoids you
| having to mess with the key to handle such cases.
| tigerBL00D wrote:
| This is a bit contrived to illustrate a problematic case. Every
| business has a unique ID that's assigned to it before it can
| operate. EIN or even a state id together with state would be a
| fine natural key for most situations.
| d--b wrote:
| > what if we had a restaurant named China Wok in Springfield,
| USA? Hardly unique. Add the state, you say? Probably still not
| unique.
|
| Ok sure, but then you have 2 restaurants which are
| indistinguishable from one another in your database. It doesn't
| matter that the thing has a unique id next to it. You can't know
| which is which. That's not useful.
| sashokbg wrote:
| I disagree with this article. The author takes some very edgy
| cases such as "undergraduate student" and "transgender CRP
| number".
|
| Even the example for the restaurant and "time based id number"
| are bad because they all indicate that you just badly identified
| the entities (in DDD terms) and their identity.
|
| So being bad at DDD doesn't mean that you can't use natural keys
| (although I myself have arguments against them)
| nostrademons wrote:
| There's a better solution for many of the exceptional cases that
| the author describes: aliases & audit logs.
|
| Take for example the Danish CPR number. That's perfectly fine as
| a natural key; its definition is the _first_ CPR number assigned.
| If a person 's CPR number changes because they've changed their
| gender, you will want a separate table recording a.) the date of
| the change. The new CPR number is not valid before that time b.)
| the new gender c.) probably the _reason_ for the CPR number
| change, since if the policy now is that they can change because
| of a gender change, there 's a decent chance they'll be some
| other policy in the future that results in a new CPR issuance.
|
| Or the chassis number. Also fine as a natural key. If it's
| changed because of a data-entry error, you also want to record
| a.) the date of change b.) _who_ changed it. This opens up a
| whole host of auditing, monitoring, and reporting functionality
| that eg. lets you catch fraud, determines if a single person is
| being sloppy, identify mass changes in policy, notify and update
| external records of owners, etc.
|
| URLs are another good natural key: they are defined to be unique
| (otherwise your webserver won't work), they make for very easy
| lookups when you're fetching from a web request, and if they
| change, they break the web. Except that they do change. But when
| a URL changes, _you don 't want to just update them in the
| database everywhere_, because again, that will break the web. You
| want to leave a redirect from the old to the new one. So you
| create a redirects table of all the other aliases that point to a
| given page, use it to generate server redirects, and you can
| throw in other data like the time of change or hit counts on each
| individual alias.
| tapland wrote:
| It's not really fine. I work on a Healthcare system in the
| nordics.
|
| Who you billed, who visited what doctor, who your primary care
| provider is, all the people a doctors office has as patients,
| refers to the SSN.
|
| You don't want to lose that connection or to have to update
| everything for any change.
|
| You store a unique identifier for the person in the system, and
| you can then pull the actual personal identification number
| when needed.
|
| You do not keep individual private lists of people changing
| genders.
| tapland wrote:
| > You do not keep individual private lists of people changing
| genders.
|
| On this note: Some organization in Sweden actually does store
| changes like suggested in the top comment. Both the tax
| agency and police were interested in knowing who did, and who
| had access, since it's the cause of recent mass doxxing
| events.
| nikita2206 wrote:
| But there's no natural key for a person, not even the SSN.
| This article also starts off by showing an example where no
| natural key exists (Restaurants) and acknowledges it.
| ironchef wrote:
| You appear to be describing a classic type 2 or type 4 slowly
| changing dimension.
| (https://en.wikipedia.org/wiki/Slowly_changing_dimension )
| lolinder wrote:
| I'm a fan of audit logs, but what you're describing is more
| event sourcing than audit logs. From what I understand what
| you're suggesting is that you don't change the key in the
| original table, you just record the change in a second table.
| Presumably, then, when you need to read a row you must also
| look up the list of diffs for that row to make sure that the
| natural key hasn't changed.
|
| Two things strike me about this proposed model:
|
| First, under your proposal the version of the model that you
| work with at the application layer _must_ have two copies of
| the key field--one is the database key for when you need to
| make changes or look up more data and one is the meaningful
| business-model field that 's actually up to date. That's
| exactly the same extra mental overhead that natural keys are
| supposed to have solved, only worse because the fields will
| have similar names and will contain the same content most of
| the time, making it easy to accidentally use one where the
| other was expected.
|
| Second, if you're going to introduce effectively an event-
| sourced data model then you've introduced a ton of new database
| records already, so why not just give everything a proper
| unique key while you're at it? Once you've done that you can
| modify the original row after all (while retaining the audit
| logs!) or, if you're serious about event sourcing, cache the
| latest derived value and look it up by its database key instead
| of carrying around an out of date natural key that's just
| waiting to be used incorrectly.
| phkahler wrote:
| >> If a person's CPR number changes because they've changed
| their gender, you will want a separate table recording a.) the
| date of the change. The new CPR number is not valid before that
| time b.) the new gender c.) probably the reason for the CPR
| number change, since if the policy now is that they can change
| because of a gender change, there's a decent chance they'll be
| some other policy in the future that results in a new CPR
| issuance.
|
| But what if you have a bunch of records in some other table -
| billing information for example, and it's indexed by the CPR
| number (a foreign key). When they change the CPR number you can
| no longer query for their entire billing history based on CPR.
| None of your proposed extra complexity does anything about this
| problem. The only good way to solve it is to use a synthetic
| key with NO meaning. It would still be good to do as you say
| and track all the CPR number changes for a given person, but
| they will still need a unique key. So a sort of "identity
| table" used to figure out what unique key you're dealing with.
| wild_egg wrote:
| Why could you not join with the audit table and find
| historical billing information from the old CPRs?
| lolinder wrote:
| You could, but that introduces a lot of complexity that
| would be saved by just giving the customer a synthetic ID
| that uniquely and stably identifies them in your system.
|
| The grandparent's proposal basically turns the first-
| entered CPR into a meaningless ID field that should not
| generally be used in a piece of business logic (unless for
| some reason you need to display "first CPR they entered").
| Once you've declared that you should look elsewhere if you
| actually need the CPR, why did we even bother using it as
| the primary key?
| phkahler wrote:
| >> Why could you not join with the audit table and find
| historical billing information from the old CPRs?
|
| 1) complexity and 2) what is actually tying the CPRs
| together? We're not going to have a CPR table per-customer,
| so all the CPRs of every customer are in the same table.
| Presumably the CPR table has a unique key for the customer
| that can be used to associate multiple CPRs with that
| person, so we have come full circle - just use that unique
| key in the audit table.
| nostrademons wrote:
| You need this complexity anyway - if you had surrogate keys,
| the primary key would be an opaque identifier, and to do
| anything involving the CPR, you'd need to join against your
| CPR table (which, again, needs to be 1:many because CPRs
| themselves are not a 1:1 relationship).
|
| The first CPR in this case becomes identical to your
| surrogate key - it's an opaque ID that you use to reference
| other tables in the DB - but with the added benefit that _for
| the common case, you don 't need any additional lookups_. You
| only need to lookup CPR changes if you don't find the CPR
| that the user gave you. And then it has other added benefits
| in that you have a record of CPR changes, you can understand
| how common a case this is, the CPR change table itself has
| semantic meaning and you can query a wide variety of
| properties without joining your primary user table, etc.
| shkkmo wrote:
| > but with the added benefit that for the common case, you
| don't need any additional lookups. You only need to lookup
| CPR changes if you don't find the CPR that the user gave
| you
|
| Doing the CPR lookup ONLY when you don't find a record is
| really, really stupid. All you need is one bad cache, one
| database update that go rolled back and suddenly you've
| lost all recent updates the the user and will only see the
| bad record with no indication you've failed.
|
| Trying to use force changeable data into being a 'natural
| key' means that the number of edge cases you have to
| predict, write around and test for is going to rise
| significantly for no real benefit.
| lolinder wrote:
| > The first CPR in this case becomes identical to your
| surrogate key - it's an opaque ID that you use to reference
| other tables in the DB - but with the added benefit that
| for the common case, you don't need any additional lookups.
| You only need to lookup CPR changes if you don't find the
| CPR that the user gave you.
|
| I'm not buying that this is a meaningful gain in either
| performance or code complexity.
|
| In the world of synthetic keys I look up the CPR in the CPR
| table and join it to the user table using the synthetic ID.
| If I find a record for the CPR+user join then I'm set, if I
| don't then the customer doesn't exist.
|
| In the world of natural keys, you're advocating that I
| first query the user table directly by CPR. Then if I turn
| nothing up I run a separate query with a join on the
| original CPR. Then if I still don't turn something up the
| customer doesn't exist.
|
| The code in the second instance is obviously more
| complicated than the code in the first instance. It has
| increased risk of someone writing a bug because now there's
| a very tempting CPR field that will be right most of the
| time but wrong in some edge cases. Depending on the
| database and usage patterns, indexing on the CPR may be
| much less efficient than indexing on an autoincrementing
| integer.
|
| The only thing it has going for it is that I _might_ be
| able to avoid a single join on a very specific path where a
| user is looking up a customer by typing in the CPR. That
| seems like the wrong thing to optimize for in the face of
| all the downsides.
| trevor-e wrote:
| URLs can be tricky and have plenty of gotchas depending on what
| you're trying to do. For example, the order of query params is
| free to change but it's still the same URL. Nothing that can't
| be worked around with a little normalization.
| Karellen wrote:
| > For example, the order of query params is free to change
| but it's still the same URL.
|
| Are you sure that's guaranteed by any spec? I thought an end-
| point would be free to treat `?a=1&b=2` and `?b=2&a=1`
| differently. I mean, it would be a nightmarish
| implementation, but I don't think it would be non-conforming?
| Piskvorrr wrote:
| It's a mere convention: "a querystring of this type is an
| array of parameters, the order of which is irrelevant" -
| this is one of the tautological "valid except when it's
| not" non-rules (a surprising number of cases).
|
| Example: ?a=1&b=2&a=3 - will the server treat this to be
| equivalent as ?a=3&b=2&a=1 , or ?b=2&a=1 , ?b=2&a=3 , or
| something else entirely? You'd need to check the serverside
| parsing implementation to be sure - those could even be
| valid (and distinct) _filenames_ FWIW.
|
| (And that's before you get to caching - "?b=2&a=1 is not to
| be served as a cached version of ?a=1&b=2")
| Pengtuzi wrote:
| > URLs are another good natural key: they are defined to be
| unique
|
| No, see more info here: http://localhost:8080/
| shkkmo wrote:
| > Take for example the Danish CPR number. That's perfectly fine
| as a natural key; its definition is the first CPR number
| assigned.
|
| This is a horrible idea. You now have two different pieces of
| information that are identical in form and indistinguishable in
| the majority of cases: "first CPR number" and "current CPR
| number. Every place you enter, or use a CPR number, you now
| must track which piece of information you have. If you make a
| mistake doing this, it will be hard to catch. Now you've
| decided that first piece of information is a natural key and
| will be every single place the record is used, even if that
| spot doesn't do anything specific to the CPR number. Every
| single time a CPR number is ingested from an exterior source,
| you need to do a lookup to make sure it is the original CPR
| number and then track it. Ever place you don't do this lookup
| is a place where an error can creep in if something changes in
| your pipeline our source.
|
| Even in an context where external sources are using something
| like a CPR number as an key, I would still use a different key
| internally since I see only downsides to using a "natural key".
|
| > URLs are another good natural key: they are defined to be
| unique (otherwise your webserver won't work), they make for
| very easy lookups when you're fetching from a web request, and
| if they change, they break the web
|
| URLs are also horrible natural keys. They are not defined to be
| unique and provide no guarantee that the content has not
| changed or even that the same content is sent to different
| users.
|
| If the location for content changes, you may or may not get a
| redirect. If you do get a redirect, you'd have to now go update
| every single place that uses the key to the new value. It is
| much better to map URLs to an artificial key and update that
| mapping in a single place.
| IggleSniggle wrote:
| URLs are only really _uniform_ not necessarily _unique._ The
| most obvious case is something like
| `http://localhost/file.txt`, where the resource being located
| is almost sure to be different on every single "localhost," but
| this is true of _any server_. The pointer is neither unique nor
| has any true guarantee that the resource being pointed to can
| or should be considered unique for any given context. It is
| merely unique within its calling context, which, when dealing
| with URLs, is often _anywhere on the web where DNS responds._
|
| Even if you presume that the resource being located is in fact
| a unique resource in the general case, the "unique resource"
| may not be unique in the way that you presume. Some URLs at a
| given location are intended to be idempontent and cache-able,
| others are not, and many are time-limited forwarders. And
| there's no guarantee or even expectation that two identical
| forwarding URL's will resolve to the same location; it may well
| be network-topography dependent.
| crote wrote:
| > Take for example the Danish CPR number. That's perfectly fine
| as a natural key; its definition is the first CPR number
| assigned.
|
| The problem is that the _new_ CPR number is now the one you
| want to use for all display purposes and future interactions
| with external systems. In other words, you can 't use the
| "original CPR" field for anything _except_ as key. It 's no
| longer a CPR field, because it no longer has any relation to
| the person's CPR!
|
| And at that point it'd be better to just use a GUID or
| something as key and avoid any potential confusion between the
| "real" CPR and "fake" CPR, because when the two are the same
| 99% of the time it is _guaranteed_ to cause a shitton of bugs.
|
| The only solution is to essentially rewrite all your records
| with the new CPR as key, and leave a redirect entry at the old
| CPR. That's pretty much what happens in Sweden when you change
| your gender: your old identity _ceases to be_ and you 're
| issued a completely new one.
| stiiv wrote:
| There is still something to be said for "real world uniqueness"
| (GIS coordinates) or deferring to a third party to establish
| identity (license plate numbers).
|
| Identifiers like these aren't always available, but within many
| domains will be sufficient.
|
| The idea here is not that these keys can't be somehow "invalid,"
| but rather that it isn't our system's problem -- it belongs to
| some other authority.
| jonstewart wrote:
| TFA author cites many examples from human-oriented systems,
| which perhaps are more common and traditional domains for RDBMS
| design, where there often several layers of exceptions to the
| rules - his argument that synthetic keys work well for these
| domains lands.
|
| I personally work far more with computer-oriented systems and
| their data, and natural keys work well for me. When well-chosen
| they allow me to do an initial load of the source data for
| analysis, and then aggregate such databases together later on
| for historical analysis without fear of conflict. The data are
| often immutable in these domains, too.
| hateful wrote:
| After reading many threads here, I think the final ruling may be:
| - If your record represents a physical being or object: Use a
| surrogate key.
|
| People change.
|
| Unrelated - if you have a list of emails or SSNs or license
| plates or VINs - we can think of these as foreign keys to a
| database we don't control.
| fl0ki wrote:
| I inherited a project where a dozen different entity kinds were
| all given UUIDv4s. Users mostly see namespaced strings, but
| actually searching by those strings is janky and unreliable for
| other reasons (including the fact that they _can_ change), so
| whenever I 'm asked to debug something I insist on reducing to
| UUIDs first so at least we're definitely looking at the same
| entities.
|
| I didn't like the UUIDs at first, but it ended up being an
| unexpected boon for generic code to use the same key type for
| different entity kinds. What was less of a boon was that the
| string identifiers also have to be unique, but can change at any
| time, and depend on three-level namespacing (yes really) so
| there's far more that has to be tracked and enforced. The names
| are very important for UI use cases, but can never be the way
| that records reference other records, because that would just
| make them much harder to change with confidence.
|
| The essay seems to assume you'll have either unique natural keys
| or unique synthetic keys, but having now worked on a project that
| does both at the same time for many entities, I think it's a
| third option worthy of its own analysis. My experience was
| negative but I can't deny that the end result ticks a lot of
| functional boxes.
| thatjoeoverthr wrote:
| One may notice this has something to do with mutability. If there
| isn't a surrogate key, the record isn't mutable. The database may
| let you change it, but the new record has a new identity.
| Mutability as a concept requires a common identity across time.
| Languages permitting mutability are using a pointer or reference
| as an implicit surrogate identity. A typical database can't offer
| this, hence the need to put explicit surrogate keys into the
| schema. You cannot say "this changed" unless you can refer to
| both samples as a common "this".
| bjourne wrote:
| While that is true, for correctness appends should supplant
| updates. On HN, many users (like myself) have posted comments
| for over a decade. Suppose I changed my username from "bjourne"
| to "SpongeBob"... Should the comments I wrote in 2010 show up
| as having been authored by "SpongeBob" or "bjourne"? I'd
| strongly argue in favor of the latter since the former would
| constitute falsifying history. The "change" in username should
| be viewed as the creation of a new personae rather than a
| "change".
| theelous3 wrote:
| Another counter example is when the thing you are recording
| literally _is_ the natural key. Phone numbers are a good example
| of this. There is no situation in which some data entry error
| will break a number. That is the creation of a new distinct
| (possibly) broken number. A phone number doesn't care about any
| externality in the world. It is what it is.
| banish-m4 wrote:
| Use db-generated UUID "id" primary keys. Add a bigserial "order"
| field to tables that require stable temporal ordering, and then
| either create a view to "order by" it or always add "order by" to
| stable-required statements.
| TexanFeller wrote:
| One downside is UUIDs take twice the space of a BIGINT which is
| minor on the surface, but makes a huge difference when doing
| lots of joins on that key. I used to see 3-4x difference in
| some queries even using postgres' uuid type(not the string
| column of fools). Doubling the space used by your keys also
| means less of your indexes stay locked in RAM.
| dwheeler wrote:
| I've found that using a simple unique integer id as the key
| usually works well. Integers are easy to explain and are simpler
| to enter than uuids. There are always ttade-offs, but if there
| are no other compelling issues, simplicity is best.
| branko_d wrote:
| The author seems to suggest that the choice is between natural
| and surrogate key.
|
| In fact, the choice is between natural and natural+surrogate key.
|
| - If you have a natural key, you _have_ to enforce it, otherwise
| you risk data corruption. The question is do you _also_ need a
| surrogate key? Sometimes you do, sometimes you don't.
|
| - If you don't have an obvious natural key, then your surrogate
| becomes meaningful. You have to use something to distinguish
| between two "equal but not identical" rows, so you end-up showing
| the surrogate in the UI etc. In other words, it is no longer
| "pure" surrogate.
| JackSlateur wrote:
| He is right: natural keys are difficult to handle
|
| I'd just say this is not so right: As it turned out, though,
| whoever made that piece of software knew what they were doing,
| because the mechanic just changed the chassis number, and that
| was that.
|
| Because, yes, everybody can just reprogram your car and change
| its VIN and stuff.
|
| The manufacturer software have protection against that. But
| nothing prevents you from using another software. At the end, you
| can directly program whatever you want into the car : change the
| VIN, change the odometer etc.
| duped wrote:
| > At the end, you can directly program whatever you want into
| the car : change the VIN, change the odometer etc.
|
| And you can wind up with an undriveable car this way.
| Particularly the odometer if you roll it back and try and sell
| it, your state's DMV will probably reject the title transfer.
| mikl wrote:
| If you use foreign keys in PostgreSQL, it is actually possible to
| update natural keys with little trouble via the "ON UPDATE
| CASCADE" mechanism on such keys.
| DontchaKnowit wrote:
| Yep. My last company had unique id columns for every table in the
| DB. Everythibg worked flawlessly as far as associating data.
|
| My current company uses natural keys all over the place and its a
| total clusterfuck trying to associate data at times.
| hosh wrote:
| There's another gotcha I found out at my workplace -- for
| Postgres, tables without a primary key has trouble when you want
| to use logical replication. The times you might want to use
| logical replication includes using AWS's database migration
| service to shrink allocated storage.
| cynicalsecurity wrote:
| > From following the course curriculum, it'd be natural, however,
| to define a key for the Restaurants table as the combination of
| restaurantName, cityName, and year. The assumption is that name
| and city uniquely identifies a restaurant.
|
| That's a strange assumption.
| teknopaul wrote:
| Don't see the problem with ssn as I'd. People have two, get two
| rows in your database. Not the end of the world often. database
| it's a tool thus limitation is not an error. E.g. maybe trans
| peeps want to forget their old ID, maybe not. your own key has no
| chance of reflecting any reality in the world.
| hathawsh wrote:
| The numerous problems with SSNs were identified years ago.
|
| https://www.gao.gov/products/gao-05-1016t
|
| In plain text form:
|
| https://www.gao.gov/assets/a112177.html
|
| I remember when the school I went to changed from using SSNs
| for all student records to using no SSNs at all. They had to
| notify everyone about the changes, repeatedly. It was obviously
| very expensive. Don't use SSNs as keys.
| hot_gril wrote:
| I don't get why relational DBs are usually taught around natural
| keys. With MySQL/Postgres you almost always want bigserial PK.
___________________________________________________________________
(page generated 2024-06-05 23:02 UTC)