https://blog.ploeh.dk/2024/06/03/youll-regret-using-natural-keys/

Toggle navigation ploeh blog

  * About
  * Archive
  * Hire Me
  * Pages
  * Schedule
  * Support the blog
  * Tags

You'll regret using natural keys by Mark Seemann

Beating another dead horse.

Although I live in Copenhagen and mostly walk or ride my bicycle in
order to get around town, I do own an old car for getting around the
rest of the country. In Denmark, cars go through mandatory official
inspection every other year, and I've been through a few of these in
my life. A few years ago, the mechanic doing the inspection informed
me that my car's chassis number was incorrect.

This did make me a bit nervous, because I'd bought the car used, and
I was suddenly concerned that things weren't really as I thought. Had
I unwittingly bought a stolen car?

But the mechanic just walked over to his computer in order to correct
the error. That's when a different kind of unease hit me. When you've
programmed for some decades, you learn to foresee various typical
failure modes. Since a chassis number is an obvious candidate for a
natural key, I already predicted that changing the number would prove
to be either impossible, or have all sorts of cascading effects,
ultimately terminating in official records no longer recognizing that
the car is mine.

As it turned out, though, whoever made that piece of software knew
what they were doing, because the mechanic just changed the chassis
number, and that was that. This is now five or six years ago, and I
still own the same car, and I've never had any problems with the
official ownership records.

Uniqueness #

The reason I related this story is that I'm currently following an
undergraduate course in databases and information systems. Since this
course is aimed at students with no real-world experience, it wisely
moves forward in a pedagogical progression. In order to teach
database keys, it starts with natural keys. From a didactic
perspective, this makes sense, but the result, so far, is that the
young people I work with now propose database designs with natural
keys.

I'm not blaming anyone. You have to learn to crawl before you can
walk.

Still, this situation made me reflect on the following question: Are
natural keys ever a good idea?

Let's consider an example. For a little project we're doing, we've
created a database of the World's 50 best restaurants. My fellow
students suggest a table design like this:

CREATE TABLE Restaurants (
    year TEXT NOT NULL,
    rank TEXT NOT NULL,
    restaurantName TEXT NOT NULL,
    cityName TEXT NOT NULL
);

Granted, at this point, this table definition defines no key at all.
I'm not complaining about that. After all, a month ago, the students
probably hadn't seen a database table.

From following the course curriculum, it'd be natural, however, to
define a key for the Restaurants table as the combination of
restaurantName, cityName, and year. The assumption is that name and
city uniquely identifies a restaurant.

In this particular example, this assumption may actually turn out to
hold. So far. After all, the data set isn't that big, and it's
important for restaurants in that league to have recognizable names.
If I had to guess, I'd say that there's probably only one Nobelhart &
Schmutzig in the world.

Still, a good software architect should challenge the underlying
assumptions. Is name and city a natural key? It's easy to imagine
that it's not. What if we expand the key to include the country as
well? Okay, but what if we had a restaurant named China Wok in
Springfield, USA? Hardly unique. Add the state, you say? Probably
still not unique.

Identity #

Ensuring uniqueness is only the first of many problems with natural
keys. You may quickly reach the conclusion that for a restaurant
database, a synthetic key is probably the best choice.

But what about 'natural' natural keys, so to speak? An example may be
a car's chassis number. This is already an opaque number, and it
probably originates from a database somewhere. Or how about a
personal identification number? In Denmark we have the CPR number,
and I understand that the US Social Security Number is vaguely
analogous.

If you're designing a database that already includes such a personal
identification number, you might be tempted to use it as a natural
key. After all, it's already a key somewhere else, so it's guaranteed
to be unique, right?

Yes, the number may uniquely identify a person, but the converse may
not be true. A person may have more than one identification number.
At least when time is a factor.

As an example, for technical-historical reasons, the Danish CPR
number carries information (which keys shouldn't do), such as a
person's date of birth and sex. Since 2014 a new law enables
transsexual citizens to get a new CPR number that reflects their
perceived gender. The consequence is that the same person may have
more than one CPR number. Perhaps not more than one at the same time,
but definitely two during a lifetime.

Even if existing keys are guaranteed to be unique, you can't assume
that the uniqueness gives rise to a bijection. If you use an external
unique key, you may lose track of the entities that you're trying to
keep track of.

This is true not only for people, but cars, bicycles (which also have
chassis numbers), network cards, etc.

Clerical errors #

Finally, even if you've found a natural key that is guaranteed to be
unique and track the actual entity that you want to keep track of,
there's a final argument against using an externally defined key in
your system: Data-entry errors.

Take the story about my car's chassis number. The mechanic who
spotted the discrepancy clearly interpreted it as a clerical error.

After a few decades of programming, I've learned that sooner or
later, there will be errors in your data. Either it's a clerical
error, or the end-user mistyped, or there was a data conversion error
when importing from an external system. Or even data conversion
errors within the same system, as it goes through upgrades and
migrations.

Your system should be designed to allow corrections to data. This
includes corrections of external keys, such as chassis numbers,
government IDs, etc. This means that you can't use such keys as
database keys in your own system.

Heuristic #

Many were the times, earlier in my career, when I decided to use a
'natural key' as a key in my own database. As far as I recall, I've
regretted it every single time.

These days I follow a hard heuristic: Always use synthetic keys for
database tables.

Conclusion #

Is it ever a good idea to use natural keys in a database design? My
experience tells me that it's not. Ultimately, regardless of how
certain you can be that the natural key is stable and correctly
tracks the entity that it's supposed to keep track of, data errors
will occur. This includes errors in those natural keys.

You should be able to correct such errors without losing track of the
involved entities. You'll regret using natural keys. Use synthetic
keys.

---------------------------------------------------------------------

Comments

James Snape #

There are lots of different types of keys. I agree that using natural
keys as physical primary keys is a bad idea but you really should be
modelling your data logically with natural keys. Thinking about
uniqueness and identity is a part of your data design. Natural keys
often end up as constraints, indexes and query plans. When natural
keys are not unique enough then you need to consider additional
attributes in your design to ensure access to a specific record.

Considering natural keys during design can help elicit additional
requirements and business rules. "Does a social security number
uniquely identify a person? If not why?" In the UK they recycle them
so the natural key is a combination of national insurance number and
birth year. You have to ask questions.

2009-06-04 15:43 UTC
Thomas Castiglione #
[you-will-r]
2024-06-05 9:33 UTC
Nicholas Peterson #

I largely agree with James Snape, but wanted to throw in a few other
thoughts on top. Surrogates don't defend you from duplicate data, in
fact they facilitate it, because the routine generating the surrogate
key isn't influenced by any of the other data in the record. The
concept of being unable to correct a natural key is also odd, why
can't you? Start a transaction, insert a new record with the correct
key, update the related records to point to the new record, then
delete the old record, done. Want some crucial information about a
related record but only have the surrogate to it? I guess you have to
join it every time in order to get the columns the user actually
wants to see. A foreign key that uses a natural key often often
prevents the join entirely, because it tells the user what they
wanted to know.

I find the problem with natural keys usually comes from another
source entirely. Developers write code and don't tend to prefer using
SQL. They typically interact with databases through ORM libraries.
ORMs are complicated and rely on conventions to uniformly deal with
data. It's not uncommon for ORMs to dictate the structure of tables
to some degree, or what datatypes to prefer. It's usually easier in
an ORM to have a single datatype for keys (BIGINT?) and use it
uniformly across all the tables.

2024-06-05 12:42 UTC
---------------------------------------------------------------------

  * - Previous
  * Archive
  * Next -

---------------------------------------------------------------------

Wish to comment?

You can add a comment to this post by sending me a pull request.
Alternatively, you can discuss this post on Twitter or somewhere else
with a permalink. Ping me with the link, and I may respond.

Published

Monday, 03 June 2024 19:46:00 UTC

Tags

  * Architecture 32

---------------------------------------------------------------------
Mark Seemann
Support the blog
Buy my book Code That Fits in Your Head
Buy my book about Dependency Injection
Watch my Pluralsight courses
Watch my Clean Coders videos
Public speaking schedule
Tweet

"Our team wholeheartedly endorses Mark. His expert service provides
tremendous value."
Hire me!
Published: Monday, 03 June 2024 19:46:00 UTC

(c) Mark Seemann 2024 with help from Jekyll Bootstrap and Twitter
Bootstrap