[HN Gopher] What does First Normal Form mean?
___________________________________________________________________
What does First Normal Form mean?
Author : goto11
Score : 67 points
Date : 2021-06-17 12:24 UTC (1 days ago)
(HTM) web link (www.cargocultcode.com)
(TXT) w3m dump (www.cargocultcode.com)
| gfody wrote:
| I think he contradicts himself and confuses the data with the
| information. I don't think delimited values representing a table
| in a column can be considered normalized.
|
| > _Make no mistake, encoding multiple values in a single string
| is generally a bad design. But it has nothing to do with first
| normal form. First normal form mean a column should not allow
| relations as values. A comma-separated string is still just a
| single string from the perspective of the database type system._
| mamcx wrote:
| No, is not a contradiction.
|
| Is like say "Is wrong to store a large json in a cell in an
| array, that make it not an array".
|
| A relation is A VALUE. That is.
|
| The atomicity of the data INSIDE each "cell" is orthogonal to
| the fact the that relation is a relation.
|
| This is the point.
| blacktriangle wrote:
| For even more fun mind bending, there's absolutely nothing
| that violates the relational model in having relation-valued
| attributes. SQL doesn't support that, but that doesn't mean
| its non-relational.
| derefr wrote:
| Re: DB support for relations as values -- while no DB I
| know of supports full relations ( _sets_ of tuples) in
| values, composite-typed (single-tuple) columns are a thing,
| e.g. the on_hand example table in
| https://www.postgresql.org/docs/current/rowtypes.html
|
| Also, presuming DB support for some format like JSON, you
| can _encode_ a relation into a JSON value, stick it into a
| column and then, at query time, use a function like
| Postgres 's json_to_recordset(json) to emit a relation (set
| of tuples). Postgres doesn't know how to deal with it "as
| is" (i.e. it forces you to treat the result of a function
| as a named joined table in your query, rather than treating
| it as an concrete value) but it does understand _where_ it
| is in relation (heh) to the rest of the row, since you can
| unnest() values from this virtual table and get the right
| results out.
|
| As awful as that sounds, we're actually doing that in
| production to normalize data coming from redis_fdw (which
| gives you, basically, (key text, value jsonb) rows for
| Redis hashes) into relational shape.
| fuckf4ce wrote:
| Relation valued attributes are precisely the antithesis of
| first normal form. That's like exactly what first normal
| form is about.
| [deleted]
| brazzy wrote:
| You're not really addressing the point of the comment you are
| replying to, which is about what constitutes a _normalized_
| relation, not what constitutes a relation.
|
| And I agree, TFA is just flat out wrong.
|
| A relation Manager with a CSV-valued column SubordinateIds or
| columns named "Subordinate1Id" through "Subordinate9Id" is
| _most definitely_ not normalized because those are in effect
| a (bad) simulation of hierarchical-database-style repeating
| groups. That they have distinct names (or positions in the
| CSV list) is an artifact of the representation and not really
| part of the data model.
| mamcx wrote:
| To be pedantic, is not a contradiction. Is a sign of
| problem with the design? Probably! Most likely!
|
| But the relational model CAN'T say it. The normal forms are
| properties of the way you can do the design, but can't go
| that Deep.
|
| In other words: You can adhere 100% on SOLID OOP principles
| and your code is still conceptually wrong!
|
| If pay attention to the article, his issue is that people
| latches on the apparent definition of 1st normal form, but
| misunderstanding it and probably coming to the wrong
| conclusion about how make the tables...
| gfody wrote:
| it's all a subtle misunderstanding of who's doing the
| considering and what's being considered. the author wants to
| say 1nf is of no concern to users of rdbms, that they can't
| actually violate 1nf because the system doesn't violate 1nf -
| however when users are considering normal form it's about the
| information (the phone numbers) not the data (a string) and
| in that sense the information is not normalized due to
| violating first normal form. a different audience, say the
| rdbms designers, can't have concerns about anything beyond
| the string - they don't care what you stash in there it's
| just data and in that sense the data is normalized but the
| information is not.
|
| as an aside I think a lot of this sprouts from subtle
| ambiguity in the language. you really shouldn't just drop all
| articles and expect your english to still make sense!
| dragonwriter wrote:
| > I think he contradicts himself and confuses the data with the
| information.
|
| I would say he is confusing the physical representation with
| the logical model.
| derefr wrote:
| > A relation more or less correspond to a table
|
| "More or less", but mostly less. Views are relations. The rowsets
| returned by a SELECT query are relations. Etc.
|
| Without going into what relational algebra is about, I find it a
| lot more helpful to understand a "relation" like this:
|
| A relational database is like a Prolog interpreter: it's a thing
| that "knows" a bunch of "facts", because someone has "asserted"
| (declared to be true) those facts to it.
|
| In Prolog, facts look like this:
|
| * red(MyCar).
|
| * father(HarryPotter, JamesPotter).
|
| * succ(1, 2).
|
| Those look like functions, but what they really are, are tuples,
| and moreover, tuples that are members of a _relation_. They
| describe / assert a particular relationship. The "function name"
| at the beginning of a fact is the _label_ , an identifier for a
| particular kind of relationship known to the system; it is then
| followed by an N-tuple of values, describing the _details_ of
| this particular instance of the relationship.
|
| A mathematical "relation", then, can be understood as an
| arbitrary set containing facts/relationships of a given _type_
| (the "type" being the predicate label, the tuple's arity, and
| the domain types of the tuple's slots.)
|
| Note that what this definition is _defining_ is, essentially, an
| ADT. A relation is a container for relationships, that exposes
| certain operations (relational algebra).
|
| You can think of the set of "asserted" facts/relationships of
| each kind, as being a special relation. A global data structure.
|
| But remember that Prolog can _derive_ truth from predicate
| functions (e.g. that 'succ' predicate above can be defined by
| induction by defining a base-case fact and a function succ(N, N +
| 1).) So, while the set of _asserted_ facts is _a_ relation
| declaring which facts /relationships are "true", it's not the
| _only_ source of truth.
|
| -----
|
| Now, going back to the world of relational databases:
|
| SQL is a language for manipulating relations, just as APL is a
| language for manipulating arrays.
|
| Database tables _are_ relations! But they 're not the _only_
| relations in the DB. Any "rowset" (as Postgres terms it) that
| you create or manipulate is a relation.
|
| Every relation in an RDBMS (table, view, result rowset, etc.) has
| a type. That type is the kind of its relation, as defined above.
| It includes the arity and types of the tuple fields, yes, but
| _also the label_. -- Prolog equivalent relation
| type: foo(A, B). CREATE TABLE foo (a text, b text).
|
| When you're creating an RDBMS table with CREATE TABLE, and giving
| a structural inline row-type definition, you're implicitly first
| defining a relation type with the same _label_ as the table 's
| _name_. The table itself is now just one relation of that type.
| You can have others! (In fact, queries selecting * from that
| table, are already other relations of the same type. However, if
| you 're _projecting_ or _joining_ -- changing the shape of the
| result-set -- then your result relation is of a new temporary
| relation type.)
|
| Perhaps surprisingly, you _can_ have two RDBMS tables that have
| the same relation _type_ , which makes them both sources of truth
| about the same _facts_. (Why? Partitions, for one.)
|
| But two RDBMS tables aren't sources of truth about the same facts
| just _because_ they share a shape. They need to actually be
| defined as having the same relation type (including the label!),
| to be considered interchangeable by the DB (for purposes of e.g.
| defining stored procedures that operate on rows of a given type.)
|
| You can't do this by just creating another table with the same
| (structural inline) row-type definition. Those tables will have
| different relation types, because those types are _carrying
| around predicate labels_ , and those labels are different.
| -- BAD: relation types are fathers_us(text, text) and
| fathers_eu(text, text). Incompatible! CREATE TABLE
| fathers_us (child text, father text); CREATE TABLE
| fathers_eu (child text, father text);
|
| Instead, you must either explicitly define a relation type using
| CREATE TYPE; or reuse the type implicitly created from your first
| CREATE TABLE, to create the second. Only then are the relation
| labels the same. -- Single type: father(A, B).
| CREATE TYPE father AS (child text, father text); --
| GOOD: same type father(A, B). CREATE TABLE fathers_us OF
| father; CREATE TABLE fathers_eu OF father;
|
| Why don't DBMSes use structural typing? Well, because facts can
| have the same _structure_ but still not be facts about the same
| thing: CREATE TABLE fathers(child text, parent
| text); CREATE TABLE mothers(child text, parent text);
|
| A stored procedure that takes a fathers(text, text) row,
| shouldn't necessarily be able to accept a mothers(text, text)
| row!
| wizzwizz4 wrote:
| > _A stored procedure that takes a fathers(text, text) row,
| shouldn 't necessarily be able to accept a mothers(text, text)
| row!_
|
| Of course, in the _real_ world, you shouldn 't make such a
| distinction. People are always messier than your database
| schema.
| pjungwir wrote:
| Codd's paper is not that hard. [0] is a PDF copy. One of the pros
| to reading primary sources is you then know "the whole story":
| what if anything is missing from popular explanations, and what
| is changed.
|
| A striking bit of trivia (but OT to SQL) is that in Turing's "On
| Computable Numbers" [1] he treats a Turing Machine that halts as
| having a problem, whereas in basically every popular treatment a
| machine that halts is one that gives you an answer, and a machine
| that doesn't halt is one with a bug, like it entered an infinite
| loop. But his paper is the opposite (and doesn't use the word
| "halt" at all!): If a computing machine never
| writes down more than a finite number of symbols of the
| first kind it will be called circular. Otherwise it is said to be
| circle-free. A machine will be circular if it
| reaches a configuration from which there is no possible
| move, or if it goes on moving, and possibly printing symbols of
| the second kind, but cannot print any more symbols of the
| first kind.
|
| More relevantly, Codd's paper from ten years later [2] is even
| more interesting than the first one. Everyone focuses on the
| first half (discussing NULLs and outer joins) but the second half
| is way more interesting, and has never been implemented anywhere
| AFAIK. It sounds a lot like graph databases and queries that
| dynamically change structure based on querying the data catalog.
|
| Re relations containing relations, I was very surprised that C.J.
| Date is apparently okay with them. He is a perfect example of the
| disciple who is more zealous than the master, and has spent his
| career trying to remove NULLs and non-unique rows. But in page
| 17-18 of [3] he writes: To repeat, any type
| whatsoever---apart from the two exceptions noted in the
| section "Types" earlier---can be used as the basis for defining
| attributes of relations. In particular, relation types
| can be used for this purpose. (So too can tuple types.)
| Thus, attributes can be *relation valued*, meaning we can have
| relations with attributes whose values are relations in
| turn.
|
| Wow! Part of me wonders if he is really just compromising because
| relations-inside-relations is so useful for his goals in that
| book. I had to re-read that passage over and over to convince
| myself I wasn't misunderstanding.
|
| [0] https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
|
| [1] https://www.cs.virginia.edu/~robins/Turing_Paper_1936.pdf
|
| [2]
| https://gertjans.home.xs4all.nl/usenet/microsoft.public.sqls...
|
| [3] C.J. Date et al, _Time and Relational Theory_
| mmarx wrote:
| > A striking bit of trivia (but OT to SQL) is that in Turing's
| "On Computable Numbers" [1] he treats a Turing Machine that
| halts as having a problem, whereas in basically every popular
| treatment a machine that halts is one that gives you an answer,
| and a machine that doesn't halt is one with a bug, like it
| entered an infinite loop. But his paper is the opposite (and
| doesn't use the word "halt" at all!):
|
| Note that his machines compute real numbers, which always have
| a an infinitely long binary representation (with possibly
| infinitely many trailing zeroes), whereas usually one considers
| Turing machines computing natural numbers, which always have a
| finite binary representation.
| pjungwir wrote:
| Yes, it makes sense. But it sure took me by surprise when I
| finally read the original paper.
| pjungwir wrote:
| I looked up Date's two exceptions. The first is uninteresting
| ---databases cannot contain pointers---but the second is sort
| of relevant:
|
| > If a relation _r_ has heading _H_ , then no attribute of _r_
| can be defined in terms of a relation or tuple type that has
| the same heading _H_ , at any level of nesting.
|
| In other words, relations can nest, but not recurse.
| crazygringo wrote:
| Sorry but I think this author is being more pedantic rather than
| helpful, and actually seems to miss the point entirely. The crux
| here is two things the author claims:
|
| > _Since SQL does not allow creating or using nested tables, most
| relational databases will be in first normal form by necessity._
|
| > _Many explanations (including Wikipedia) uses the example of a
| comma-separated string with phone numbers as an example of a 1NF
| violation... Make no mistake, encoding multiple values in a
| single string is generally a bad design. But it has nothing to do
| with first normal form. First normal form mean a column should
| not allow relations as values. A comma-separated string is still
| just a single string from the perspective of the database type
| system._
|
| I think it's rather obvious that the example here is nothing to
| do with _literal_ commas. But that storing multiple comma-
| delimited phone numbers in a single field _is precisely_ creating
| a nested table _in spirit_.
|
| It has nothing to do with SQL support for nested tables or the
| database type system -- it's not about _formal_ definitions. It
| 's about whether you're _conceptually_ putting multiple pieces of
| data into a single field. Which has _everything_ to do with first
| normal form.
|
| Because if you put comma-delimited phone numbers into a field,
| you _lose the ability to relate and join those phone numbers
| individually to another table_. Which is bad, and the whole
| concept of "1NF" exists _to give a name to this badness_.
|
| And this isn't a minor point. It's actually extremely important,
| because the common practice of throwing JSON data into a string
| field destroys the ability to operate relationally on the data
| within it.
|
| Now of course, plenty of databases these days actually _do_
| provide ways to index and join on data using a new "JSON" type
| column -- which _is_ yet another form of the "nested table" the
| author claims doesn't exist. Which is great for combining ease of
| use with performance, but certainly _isn 't_ 1NF, and basically
| explodes (or at least dramatically complicates) the fundamental
| concept of relational databases as tables of rows and columns.
| zozbot234 wrote:
| > Because if you put comma-delimited phone numbers into a
| field, you lose the ability to relate and join those phone
| numbers individually to another table. Which is bad
|
| It's bad _if you expect that ability_ in the first place! If
| the data you 're storing (e.g. a list of phone numbers) will
| only ever be treated as a unified blob without any finer
| senantics, storing it as a single attribute is not a 1NF
| violation.
| mannykannot wrote:
| Among the principal tenets of Codd's database model (which is
| _not_ intended to be a generalized model of relations) are
| that the database scheme should not introduce either
| accidental relationships or accidental constraints.
|
| If we represent a one-to-many relationship by making a CSV of
| the elements of the "many" side, then, for a given X standing
| in this relationship to A, B and C:
|
| 1) We have introduced an ordering that may not be relevant -
| is "A,B,C" the same as "B,A,C"?
|
| 2) We cannot find out whether X has this relationship to B
| without parsing the CSV, or which entities B stands in this
| relationship to.
|
| 3) We cannot add or remove instances of this relationship to
| X without updating the whole relationship.
|
| ...and so on.
|
| All these are accidental features of the data representation
| that are not inherent to the semantic domain.
|
| So, when you write "It's bad if you expect that ability in
| the first place!", you are essentially saying "if we put
| aside the principles of Codd's model and accept the very
| problems it was expressly intended to mitigate..."
|
| You can certainly use a relational database to store
| information in ways that preserve the sorts of problems they
| were intended to solve - and there will be cases where it
| does not matter. The normal forms are guidelines for using
| relational databases as intended, to minimize these problems,
| when it might matter.
| crazygringo wrote:
| Technically yes, but this is where you have to use common
| sense.
|
| The assumption that something "will only ever be treated as a
| unified blob" can turn out to be a real footgun -- suddenly
| you need to locate all users who share a phone number and,
| whoops, you can't.
|
| Nobody's saying strings should be split into individual
| characters, or small JPEG thumbnail blobs ought to be
| decomposed into fields for each header.
|
| But if you're dealing with obvious common-sense pieces of
| data like phone numbers? That experience ought to tell you
| have a reasonable chance of needing to be joined relationally
| at some point? Then the smart thing to do is to interpret
| that as a 1NF violation.
| painchoc wrote:
| What about NULLs ? Is it allowed or not ? I recently discovered
| on Wikipedia https://en.wikipedia.org/wiki/First_normal_form that
| it should not be.
| hobs wrote:
| A clear and concise explanation - well done and rare for the
| topic.
| whatshisface wrote:
| I have never understood why RDBs have such general concepts of
| primary keys, where you can for example let date of birth be a
| primary key, when in every schema I have ever designed or seen,
| all rows get unique integer IDs anyway.
| projektfu wrote:
| In one sense, the unique integer is often not real data. It's
| synthetic and has no meaning outside the application.
|
| I find it easier to understand when I think that Codd and other
| database pioneers were looking at creating universal databases,
| not the application-specific ones we tend to use.
|
| Furthermore, the use of auto numbering tends to make the
| queries easier to write but the tables less meaningful. For
| example, INVOICE might have a sequential invoice id that is
| real information, as well as a reference to customer, billing
| address, shipping address, etc. INVOICE_ITEM will have a
| foreign key to an invoice, unique number, position, stock item
| id (second foreign key), description, etc. PACKING_LIST might
| have a foreign key to INVOICE_ITEM. In my opinion, I prefer to
| see INVOICE_ID as part of the primary key of INVOICE_ITEM. I
| like being able to see the relationship in foreign keys and
| things like that.
|
| These things can be a heated topic and other designers probably
| think I also like to murder kittens. But the data layout is, I
| think, what Codd was going for.
| hknapp wrote:
| I worked at a place that used decimals that auto incremented by
| 2
| mamcx wrote:
| > when in every schema I have ever designed or seen, all rows
| get unique integer IDs anyway.
|
| This is artifact of 4 things:
|
| - Indexed Integer search are fast in RDBMS
|
| - Having an non-bussines related value for identify a row could
| save you from making triggers for updated in other relations if
| _it change_. This is the part a lot of folks missed: If you
| already have a natural PK and it not change, is wasteful add
| ANOTHER (is another index btw)
|
| - Is enforced by ORMS that are badly designed.
|
| - Most people have not idea how model a database, and this one
| little thing is one of the easier "fix" you can think of!
|
| I have done a lot "non integer primary keys" before (and not
| GUIds!) and is useful that RDBMS are not like mongo with a
| fixed schema (yes! Mongo is a single fixed-schema for all your
| data!).
|
| For reporting, store aggregates, pre-compute values, etc is
| very valuable!
|
| For example, in accounting you do reports per day/mont/year.
|
| You then do granularity per day, and the another per month, per
| year, per semestre, etc. When the data volume is high, making
| tables "days, months, years" make a lot of sense.
|
| Cool guys call this a "time-series database".
|
| But the rdbms CAN model it!
|
| And can model
|
| - "document database"
|
| - "key-value database"
|
| - "olap database"
|
| - "columnar database"
|
| etc.
|
| (you go for specialized backend for performance but honestly?
| Most of the time is a mistake if them are not relational.
|
| "Relational" don't means "is backed by b-trees, is row oriented
| and use SQL")
| whatshisface wrote:
| A time series database isn't keyed by day/month/year, it's
| typically keyed by monotonic timestamps.
| mamcx wrote:
| Yes, that is even a _more narrowing_ detail. (people that
| use a "time-series" rarely need that high of granularity!)
|
| But the question was about wondering "why exist the support
| for PK that are not integers/guis, like dates"?.
|
| To be able to model a lot of things for what in the past
| (before RDBMS) and today (for upstarts engines) you will be
| tempted to add another data store (that rarely is that
| necessary).
|
| I know by a fact that some times is because people not
| realize you DON'T need to constrain your db to things like
| "only integers pks!"!
|
| P.D: I work in the enterprise sector. Is fun when something
| is composed of a cluster of disparate things purely by
| misunderstanding you could have solved it with a simple db,
| well done. Contrary to belief, a lot of my customers, their
| largest databases are 1-10GB each. A rdbms can happily go
| much much larger than that, if you give them a little
| love...
| jrumbut wrote:
| It's wonderful what you can gain in performance, correctness,
| and storage efficiency from finding the real primary key from
| the data model itself.
|
| Maybe not worth the intensive design work for many tables,
| and it can be a risk to get it wrong, but it's cool when it
| does work.
| brazzy wrote:
| Yeah, I'm currently facing exactly that dilemma. I have to
| model a database of machine parts.
|
| There is a natural PK in the form of the supplier's parts
| number, but the legacy system (which uses that as PK) notes
| that it is specifically the parts number _as used in the
| supplier 's ERP system_, but suppliers often use
| differently formatted parts numers in publications such as
| catalogs (adding or removing spaces, slashes or dashes). So
| the legacy system also has the "published part number" as
| one or more separate fields.
|
| This inconsistency makes me lean very strongly towards
| using a synthetic PK instead.
| bob1029 wrote:
| We deal with a lot of crap where a certain business
| entity has several different biz keys that can reasonably
| be used to uniquely identify the thing.
|
| I would concur with adding a synthetic primary key, and
| maintaining all of the relevant business/domain keys on
| the same type.
|
| The advantage of this approach is that you can still keep
| 100% correct relations between your internal types, even
| if the domain keying scheme is fucked up or otherwise
| broken per your original assumptions. It's really easy to
| add a new column to refine the additional domain keys.
| Fixing a bad PK and all the things that talk to it is
| much harder.
| turbocon wrote:
| It's rare in practice but occasionally rows have natural unique
| identifiers that are useable.
|
| The textbook case would be student IDs in a university database
| where each row corresponds to a single student. However some
| universities view the student id as sensitive information (kind
| of like a ssn in the US) and so in that scenario it should be
| aliased to some other key to prevent the student id from being
| present in multiple tables as a foreign key.
| dragonwriter wrote:
| > The textbook case would be student IDs in a university
| database where each row corresponds to a single student
|
| "Student IDs" are an example of surrogate keys, though they
| may be surrogate keys in a system predating and outside of
| the DB.
|
| (And if the mapping between them and actual students are
| managed by an error-prone process outside of the DB, they
| probably aren't good primary keys for a table of students.)
| TeMPOraL wrote:
| > _However some universities view the student id as sensitive
| information (kind of like a ssn in the US) and so in that
| scenario it should be aliased to some other key to prevent
| the student id from being present in multiple tables as a
| foreign key._
|
| It is ironic, given that the whole reason to have a student
| ID is for it to be a primary key in some database somewhere.
| whatshisface wrote:
| Databases are vulnerable to a particular type of folkloric
| magic:
|
| https://en.m.wikipedia.org/wiki/True_name
| jameshart wrote:
| Student IDs are certainly not 'natural' keys.
|
| I am almost entirely convinced that there is actually no such
| thing as a natural key.
| dragonwriter wrote:
| Dates are natural keys in a calendar table. The composite
| of the foreign key values is a natural key in a join table
| expressing a unique relationship between entities in two or
| more other tables, perhaps with additional attributes, even
| if those values are surrogate keys in their own tables.
| mulmen wrote:
| Natural keys are everywhere. They are just less interesting
| than the relationships. My name is a natural key. It
| uniquely identifies my name. The wheels come off if you
| expect it to uniquely identify _me_. But with natural keys
| that 's never actually necessary.
|
| When I buy a car they record my name, my date of birth, the
| time of the sale and some details about the car, such as
| the VIN, which is its name. That provides enough
| information to form a natural key to identify the _sale_.
|
| At no point does a database actually have a concept of who
| I am really, only the relevant data points to satisfy its
| own relational model.
|
| Any database that means to track information about people
| would be able to construct a natural key from my name, my
| mothers name and some information about my birth, such as
| the doctor, hospital and time.
| CharlesW wrote:
| > _I am almost entirely convinced that there is actually no
| such thing as a natural key._
|
| Yes! Or if it is _now_ , there's a future scenario where
| that will have proven to have been a poor decision.
| wolfgang42 wrote:
| a "natural key" is frequently just a _really_ foreign key
| in a database you and your org don 't manage. -- 'wweston,
| https://news.ycombinator.com/item?id=27349246
| wombatpm wrote:
| In the 1980's your student ID was most likely your SSN
| dnlhg wrote:
| Codd was pretty adamant in his original papers that keys should
| be made up from data rather than a unique integer, and his
| example of converting from a hierarchical database schemas to a
| relational database schema shows how compound, natural keys are
| important to represent relationships. But we don't usually do
| it that way today, partly because it isn't that practical with
| our current databases and ORMs. It has some theoretical
| advantages and disadvantages, but using IDs seems to work well
| enough that we're not investing in a new paradigm.
| ARandumGuy wrote:
| In practice, using a unique IDs is often not ideal, but it
| also rarely the wrong choice. I've personally never heard of
| a scenario where a unique ID caused difficult to fix issues
| (although I'm sure it's possible).
|
| In contrast, a poorly though out natural key can cause a lot
| of headaches. This mostly occurs when your natural key isn't
| as unique as you thought, or if you have to change what you
| use as a natural key. The real world does not like to conform
| to our database constraints.
| wwweston wrote:
| Yeah, I think the early proponents of relational models had a
| lot to do with setting the culture here.
|
| I suspect that some of this has to do with an insight I
| recently discovered: a "natural" key is in some sense a key
| that's so foreign it's in a "database" you don't control. And
| if you establish a culture that converges on natural keys,
| you may end up with more easily cross-operable
| databases/collections as an asset.
|
| And this is why I think it's probably dead. If the boring
| practicality of generated identifiers didn't do it, laws like
| the CCPA which are going to, because they recognized the
| potential of broad cross-operable data to be a liability and
| a hazard vs an asset.
| whatshisface wrote:
| Those "natural" databases as a rule are going to be full of
| duplicates and sometimes even inconsistencies, hence the
| need for IDs that only exist for the purpose of expressing
| relationships within your own database. Basically,
| pointers.
| mmcgaha wrote:
| I completely agree with you, yet I design most of my schemas
| with integer keys anyway. The only thing I do different than
| most folks is that I make a unique constraint/index for
| candidate keys (the proper primary key) because otherwise we
| have lost the ability to ensure data consistency.
|
| As for why I use integer keys, it has always been the
| convention everywhere I go so I don't want to break convention.
|
| Now the only convention that I would rather see would be to use
| GUIDs instead of integers because I hate that I can join an
| item table to an employee table and get results back.
| dylan604 wrote:
| How would DoB work as a primary key? A primary key is useless
| if not unique, right? And I know many people born on the same
| day that aren't even twins. Incrementing integers are nice and
| easy for humans to consume. Using GUIDs is far more portable
| than unique integers, but damn near impossible for humans to
| deal.
| whatshisface wrote:
| They wouldn't, so my point is why do RDBMSs support so many
| kinds of primary keys?
| dylan604 wrote:
| Keys can span multiple columns, so DoB could be a
| differentiator if First, Last are also the same. I
| personally would never try to build a key like that though,
| but that's the only way I could see justifying its use.
|
| If you can't see a why for uses of a field that is truly
| more unique, then I'd suggest you just need some more
| imagination. Strictly not allowing it seems to me to be
| short sighted just because someone had the ill conceived
| idea to use DoB as a primary key
| whatshisface wrote:
| Can you think of any good use case for a non-arbitrary
| primary key? You are right that first name + last name +
| dob is not a good way to design the system. For example
| someone could differentiate their twin children only by
| their middle name. So, what's an example of a non-integer
| non-arbitrary primary key?
| dylan604 wrote:
| Not sure exactly what you mean by non-arbitrary primary
| key, but...
|
| GUID/UUID
|
| FeatureName/ReleaseYear or Series/Season/Episode
|
| email
|
| phone-number
|
| SSN
|
| driverslicense#/State
| fuckf4ce wrote:
| Avoiding "natural keys" has little to do with relational
| databases or the model. Consider that the Social Security
| Number far predates relational databases.
|
| The reason for using surrogate keys is not some great
| conspiracy or mass incompetence. Simply, there are very few
| cases where natural keys work well.
|
| Good keys are generally stable, globally unique, not too big,
| and it turns out that these qualities are extremely rare in a
| natural key in practice. Furthermore, composite keys over more
| than 2 or 3 attributes become terribly unwieldy (and
| intuitively, composite key size and stability are inversely
| related).
|
| What is a good natural key for an individual? There are none!
| The SSA figured this out in the 20s, this was not a modern tech
| problem.
|
| Even social security numbers can't be stable, but they are far
| better than things like names or addresses because they change
| for only a handful of reasons.
|
| That said, a lot of database designs should use composite
| primary keys. You can in part thank early MySQL which started
| the ball rolling on generally shitty database design in the
| late 90s.
| adrianmonk wrote:
| What you're talking about is a _natural key_ [1], which can be
| contrasted with a _surrogate key_ [2]. (Surrogate keys are also
| called "synthetic keys" or "factless keys".)
|
| So the question is whether natural keys are ever useful.
|
| I think the answer is yes, they are useful if the real world
| _already_ imposes a key on the stuff you are modeling in the
| database. In that case, it is already unique (as it needs to
| be), and generating your own key doesn 't accomplish anything
| (but is redundant).
|
| For example, suppose I'm creating a history database of who
| owned what internet domain names at which times. If my table is
| one-to-one with domains, I can use the domain name itself as
| the primary key.
|
| If not, I can perhaps use the domain name plus some other
| column as a composite key. For a table of ownership intervals,
| my primary key could be this: domain name + start date + end
| date.
|
| ---
|
| [1] https://en.wikipedia.org/wiki/Natural_key
|
| [2] https://en.wikipedia.org/wiki/Surrogate_key
| whatshisface wrote:
| > _For example, suppose I 'm creating a history database of
| who owned what internet domain names at which times. If my
| table is one-to-one with domains, I can use the domain name
| itself as the primary key._
|
| In your example, wouldn't that not work, because the same
| domain would be in the table twice with two different owners?
| adrianmonk wrote:
| You might have more than one table in the database. For
| example, you could have a table of current owners for quick
| lookup and a table of all owners for full history.
|
| For the current owners table, you'd use just the domain as
| a key. For all the full history table, you'd use domain +
| start date + end date as the key.
|
| So yeah, that makes it a weird example, but the point is
| that in neither table did you make up a synthetic key. In
| one table, you used a real-world fact (domain) as the key,
| and in the other table you used three real-world facts
| (domain plus two dates) together as a key.
| juped wrote:
| Synthetic primary keys like a unique integer ID assigned to a
| row are worse in theory - the id column isn't real data. They
| exist as a very common practical compromise for good reasons,
| but you should be aware you're compromising theory for
| practice.
| dragonwriter wrote:
| > I have never understood why RDBs have such general concepts
| of primary keys, where you can for example let date of birth be
| a primary key, when in every schema I have ever designed or
| seen, all rows get unique integer IDs anyway.
|
| Because sequential integer surrogate keys aren't always a good
| idea, e.g.:
|
| 1. Certain logically-unique join tables, where the foreign keys
| to the joined tables form an obvious composite primary key.
|
| 2. Cases with natural primary keys for basic data; A date might
| not be a good primary key if its the DoB in a table of people,
| but its an excellent primary key in a calendar table.
|
| 3. Even when you need surrogate keys, sometimes you need them
| to be able to be generated in a distributed manner, so you want
| something like UUID/ULID instead of sequential integers.
| brazzy wrote:
| Since when is "date of birth" a distinct type?
|
| And "date" definitely _does_ make sense as a primary key when
| you have a table that has conceptually at most one entry per
| day. For example, a travel journal. Or a time sheet
| application.
| cyberbanjo wrote:
| What about people with the same birthdays?
| wfriesen wrote:
| Business/requirements/users will often tell you that no two
| users will have the same birthday/SSN/whatever, but so often
| that eventually proves incorrect. Just using an auto
| increment ID as the primary key can prevent a lot of pain
| down the road.
| lifepillar wrote:
| So, when someone comes to you and identifies herself by
| providing her SSN, name, date of birth, how do you know
| whether she corresponds to the record with id M or to the
| record with id N? Or, more likely, to one of the N records,
| with N>2, with that SSN, name, date of birth?
| jraph wrote:
| What about them? Do they really exist, practically speaking?
|
| At a quick glance on my databases, I see nothing of the sort.
| Kranar wrote:
| Are we seriously discussing whether in practice, two people
| can be born on the same day?
| jraph wrote:
| This was an absurd joke. A person looking at their
| database using birth dates as a primary key can obviously
| only see people with different birth dates.
| lkuty wrote:
| https://en.wikipedia.org/wiki/Birthday_problem
| jraph wrote:
| Is there anything comparable to the birthday paradox,
| which includes years, by the way?
|
| I would guess that you would need a big group of random
| people such that there is a high chance of having two
| people born the very same day.
| whatshisface wrote:
| The birthday paradox threshold if years are included
| would be 23 people per year, rather than 23 people in
| your database. What that translates to in your database
| depends a lot the age demography of your userbase. If
| people live to be 100, and if I don't worry about the
| fact that some years will have more people than others
| even if they are randomly distributed (something that
| should go in to the model), that's about 2,300 people
| before you start expecting to see duplicate birthdays.
| cyberbanjo wrote:
| I don't think the birthday problem accounts for year, but
| yet it's clear some people share the same birthdate,
| twins for example.
| jcranmer wrote:
| I share the same name and birthdate as another individual.
| It has actually caused me to be held up at the border
| before.
| jraph wrote:
| What gave proof that you were not this individual in the
| end?
| ARandomerDude wrote:
| I think this was the parent's point. Since, as you point out,
| we we almost never use the data itself (e.g. date of birth)
| as the primary key, the parent is asking why is this a
| feature at all? Why not just mandate a uuid, number, etc. as
| the primary key?
| uDontKnowMe wrote:
| Well it's easy enough to think of a situation where you
| would use a date as a primary key, for example time series
| data.
| flint wrote:
| Why does he use the form: "A relation more or less correspond to"
| and "A relation have"? This doesn't seem to me to be unintended.
| tyingq wrote:
| He's somewhat consistent with that. "A relation...correspond",
| "relation refer to the", "an attribute correspond to".
|
| But diverges on that pattern for "domain": "a
| domain...corresponds".
| wcarss wrote:
| may also be english as a second+ language or just a frequent
| brainfart around pluralization, see the intro:
|
| > This post explains what 1NF actually means and is useful
| for, and debunk a number of the misunderstandings.
| ehw3 wrote:
| The Wikipedia entry on First Normal Form also uses this
| kind of construction, which I suspect means it was written
| by the same person.
|
| I have never seen terms like domain and attribute used as
| if they were plurals (the way "data" is), but even were
| that correct, "a domain have" would be contradictory, so I
| think it is just a linguistic quirk of the writer.
| whatshisface wrote:
| Level 0 data science: "We have several datapoints."
|
| Level 1 data science: "We have several data."
|
| Level 100 data science: "A relation have a fixed number of
| attributes (columns) and each column have a domain [...]."
| transitorykris wrote:
| The key, the whole key, and nothing but the key
| jbigelow76 wrote:
| So help me Codd.
___________________________________________________________________
(page generated 2021-06-18 23:01 UTC)