hngopher.com

       [HN Gopher] What does First Normal Form mean?
       ___________________________________________________________________
        
       What does First Normal Form mean?
        
       Author : goto11
       Score  : 67 points
       Date   : 2021-06-17 12:24 UTC (1 days ago)
        
 (HTM) web link (www.cargocultcode.com)
 (TXT) w3m dump (www.cargocultcode.com)
        
       | gfody wrote:
       | I think he contradicts himself and confuses the data with the
       | information. I don't think delimited values representing a table
       | in a column can be considered normalized.
       | 
       | > _Make no mistake, encoding multiple values in a single string
       | is generally a bad design. But it has nothing to do with first
       | normal form. First normal form mean a column should not allow
       | relations as values. A comma-separated string is still just a
       | single string from the perspective of the database type system._
        
         | mamcx wrote:
         | No, is not a contradiction.
         | 
         | Is like say "Is wrong to store a large json in a cell in an
         | array, that make it not an array".
         | 
         | A relation is A VALUE. That is.
         | 
         | The atomicity of the data INSIDE each "cell" is orthogonal to
         | the fact the that relation is a relation.
         | 
         | This is the point.
        
           | blacktriangle wrote:
           | For even more fun mind bending, there's absolutely nothing
           | that violates the relational model in having relation-valued
           | attributes. SQL doesn't support that, but that doesn't mean
           | its non-relational.
        
             | derefr wrote:
             | Re: DB support for relations as values -- while no DB I
             | know of supports full relations ( _sets_ of tuples) in
             | values, composite-typed (single-tuple) columns are a thing,
             | e.g. the on_hand example table in
             | https://www.postgresql.org/docs/current/rowtypes.html
             | 
             | Also, presuming DB support for some format like JSON, you
             | can _encode_ a relation into a JSON value, stick it into a
             | column and then, at query time, use a function like
             | Postgres 's json_to_recordset(json) to emit a relation (set
             | of tuples). Postgres doesn't know how to deal with it "as
             | is" (i.e. it forces you to treat the result of a function
             | as a named joined table in your query, rather than treating
             | it as an concrete value) but it does understand _where_ it
             | is in relation (heh) to the rest of the row, since you can
             | unnest() values from this virtual table and get the right
             | results out.
             | 
             | As awful as that sounds, we're actually doing that in
             | production to normalize data coming from redis_fdw (which
             | gives you, basically, (key text, value jsonb) rows for
             | Redis hashes) into relational shape.
        
             | fuckf4ce wrote:
             | Relation valued attributes are precisely the antithesis of
             | first normal form. That's like exactly what first normal
             | form is about.
        
           | [deleted]
        
           | brazzy wrote:
           | You're not really addressing the point of the comment you are
           | replying to, which is about what constitutes a _normalized_
           | relation, not what constitutes a relation.
           | 
           | And I agree, TFA is just flat out wrong.
           | 
           | A relation Manager with a CSV-valued column SubordinateIds or
           | columns named "Subordinate1Id" through "Subordinate9Id" is
           | _most definitely_ not normalized because those are in effect
           | a (bad) simulation of hierarchical-database-style repeating
           | groups. That they have distinct names (or positions in the
           | CSV list) is an artifact of the representation and not really
           | part of the data model.
        
             | mamcx wrote:
             | To be pedantic, is not a contradiction. Is a sign of
             | problem with the design? Probably! Most likely!
             | 
             | But the relational model CAN'T say it. The normal forms are
             | properties of the way you can do the design, but can't go
             | that Deep.
             | 
             | In other words: You can adhere 100% on SOLID OOP principles
             | and your code is still conceptually wrong!
             | 
             | If pay attention to the article, his issue is that people
             | latches on the apparent definition of 1st normal form, but
             | misunderstanding it and probably coming to the wrong
             | conclusion about how make the tables...
        
           | gfody wrote:
           | it's all a subtle misunderstanding of who's doing the
           | considering and what's being considered. the author wants to
           | say 1nf is of no concern to users of rdbms, that they can't
           | actually violate 1nf because the system doesn't violate 1nf -
           | however when users are considering normal form it's about the
           | information (the phone numbers) not the data (a string) and
           | in that sense the information is not normalized due to
           | violating first normal form. a different audience, say the
           | rdbms designers, can't have concerns about anything beyond
           | the string - they don't care what you stash in there it's
           | just data and in that sense the data is normalized but the
           | information is not.
           | 
           | as an aside I think a lot of this sprouts from subtle
           | ambiguity in the language. you really shouldn't just drop all
           | articles and expect your english to still make sense!
        
         | dragonwriter wrote:
         | > I think he contradicts himself and confuses the data with the
         | information.
         | 
         | I would say he is confusing the physical representation with
         | the logical model.
        
       | derefr wrote:
       | > A relation more or less correspond to a table
       | 
       | "More or less", but mostly less. Views are relations. The rowsets
       | returned by a SELECT query are relations. Etc.
       | 
       | Without going into what relational algebra is about, I find it a
       | lot more helpful to understand a "relation" like this:
       | 
       | A relational database is like a Prolog interpreter: it's a thing
       | that "knows" a bunch of "facts", because someone has "asserted"
       | (declared to be true) those facts to it.
       | 
       | In Prolog, facts look like this:
       | 
       | * red(MyCar).
       | 
       | * father(HarryPotter, JamesPotter).
       | 
       | * succ(1, 2).
       | 
       | Those look like functions, but what they really are, are tuples,
       | and moreover, tuples that are members of a _relation_. They
       | describe  / assert a particular relationship. The "function name"
       | at the beginning of a fact is the _label_ , an identifier for a
       | particular kind of relationship known to the system; it is then
       | followed by an N-tuple of values, describing the _details_ of
       | this particular instance of the relationship.
       | 
       | A mathematical "relation", then, can be understood as an
       | arbitrary set containing facts/relationships of a given _type_
       | (the  "type" being the predicate label, the tuple's arity, and
       | the domain types of the tuple's slots.)
       | 
       | Note that what this definition is _defining_ is, essentially, an
       | ADT. A relation is a container for relationships, that exposes
       | certain operations (relational algebra).
       | 
       | You can think of the set of "asserted" facts/relationships of
       | each kind, as being a special relation. A global data structure.
       | 
       | But remember that Prolog can _derive_ truth from predicate
       | functions (e.g. that  'succ' predicate above can be defined by
       | induction by defining a base-case fact and a function succ(N, N +
       | 1).) So, while the set of _asserted_ facts is _a_ relation
       | declaring which facts /relationships are "true", it's not the
       | _only_ source of truth.
       | 
       | -----
       | 
       | Now, going back to the world of relational databases:
       | 
       | SQL is a language for manipulating relations, just as APL is a
       | language for manipulating arrays.
       | 
       | Database tables _are_ relations! But they 're not the _only_
       | relations in the DB. Any  "rowset" (as Postgres terms it) that
       | you create or manipulate is a relation.
       | 
       | Every relation in an RDBMS (table, view, result rowset, etc.) has
       | a type. That type is the kind of its relation, as defined above.
       | It includes the arity and types of the tuple fields, yes, but
       | _also the label_.                   -- Prolog equivalent relation
       | type: foo(A, B).         CREATE TABLE foo (a text, b text).
       | 
       | When you're creating an RDBMS table with CREATE TABLE, and giving
       | a structural inline row-type definition, you're implicitly first
       | defining a relation type with the same _label_ as the table 's
       | _name_. The table itself is now just one relation of that type.
       | You can have others! (In fact, queries selecting * from that
       | table, are already other relations of the same type. However, if
       | you 're _projecting_ or _joining_ -- changing the shape of the
       | result-set -- then your result relation is of a new temporary
       | relation type.)
       | 
       | Perhaps surprisingly, you _can_ have two RDBMS tables that have
       | the same relation _type_ , which makes them both sources of truth
       | about the same _facts_. (Why? Partitions, for one.)
       | 
       | But two RDBMS tables aren't sources of truth about the same facts
       | just _because_ they share a shape. They need to actually be
       | defined as having the same relation type (including the label!),
       | to be considered interchangeable by the DB (for purposes of e.g.
       | defining stored procedures that operate on rows of a given type.)
       | 
       | You can't do this by just creating another table with the same
       | (structural inline) row-type definition. Those tables will have
       | different relation types, because those types are _carrying
       | around predicate labels_ , and those labels are different.
       | -- BAD: relation types are fathers_us(text, text) and
       | fathers_eu(text, text). Incompatible!         CREATE TABLE
       | fathers_us (child text, father text);         CREATE TABLE
       | fathers_eu (child text, father text);
       | 
       | Instead, you must either explicitly define a relation type using
       | CREATE TYPE; or reuse the type implicitly created from your first
       | CREATE TABLE, to create the second. Only then are the relation
       | labels the same.                   -- Single type: father(A, B).
       | CREATE TYPE father AS (child text, father text);              --
       | GOOD: same type father(A, B).         CREATE TABLE fathers_us OF
       | father;         CREATE TABLE fathers_eu OF father;
       | 
       | Why don't DBMSes use structural typing? Well, because facts can
       | have the same _structure_ but still not be facts about the same
       | thing:                   CREATE TABLE fathers(child text, parent
       | text);         CREATE TABLE mothers(child text, parent text);
       | 
       | A stored procedure that takes a fathers(text, text) row,
       | shouldn't necessarily be able to accept a mothers(text, text)
       | row!
        
         | wizzwizz4 wrote:
         | > _A stored procedure that takes a fathers(text, text) row,
         | shouldn 't necessarily be able to accept a mothers(text, text)
         | row!_
         | 
         | Of course, in the _real_ world, you shouldn 't make such a
         | distinction. People are always messier than your database
         | schema.
        
       | pjungwir wrote:
       | Codd's paper is not that hard. [0] is a PDF copy. One of the pros
       | to reading primary sources is you then know "the whole story":
       | what if anything is missing from popular explanations, and what
       | is changed.
       | 
       | A striking bit of trivia (but OT to SQL) is that in Turing's "On
       | Computable Numbers" [1] he treats a Turing Machine that halts as
       | having a problem, whereas in basically every popular treatment a
       | machine that halts is one that gives you an answer, and a machine
       | that doesn't halt is one with a bug, like it entered an infinite
       | loop. But his paper is the opposite (and doesn't use the word
       | "halt" at all!):                   If a computing machine never
       | writes down more than a finite number of symbols         of the
       | first kind it will be called circular. Otherwise it is said to be
       | circle-free.              A machine will be circular if it
       | reaches a configuration from which there is no possible
       | move, or if it goes on moving, and possibly printing symbols of
       | the second kind,         but cannot print any more symbols of the
       | first kind.
       | 
       | More relevantly, Codd's paper from ten years later [2] is even
       | more interesting than the first one. Everyone focuses on the
       | first half (discussing NULLs and outer joins) but the second half
       | is way more interesting, and has never been implemented anywhere
       | AFAIK. It sounds a lot like graph databases and queries that
       | dynamically change structure based on querying the data catalog.
       | 
       | Re relations containing relations, I was very surprised that C.J.
       | Date is apparently okay with them. He is a perfect example of the
       | disciple who is more zealous than the master, and has spent his
       | career trying to remove NULLs and non-unique rows. But in page
       | 17-18 of [3] he writes:                   To repeat, any type
       | whatsoever---apart from the two exceptions noted in the
       | section "Types" earlier---can be used as the basis for defining
       | attributes of         relations. In particular, relation types
       | can be used for this purpose. (So too can         tuple types.)
       | Thus, attributes can be *relation valued*, meaning we can have
       | relations         with attributes whose values are relations in
       | turn.
       | 
       | Wow! Part of me wonders if he is really just compromising because
       | relations-inside-relations is so useful for his goals in that
       | book. I had to re-read that passage over and over to convince
       | myself I wasn't misunderstanding.
       | 
       | [0] https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
       | 
       | [1] https://www.cs.virginia.edu/~robins/Turing_Paper_1936.pdf
       | 
       | [2]
       | https://gertjans.home.xs4all.nl/usenet/microsoft.public.sqls...
       | 
       | [3] C.J. Date et al, _Time and Relational Theory_
        
         | mmarx wrote:
         | > A striking bit of trivia (but OT to SQL) is that in Turing's
         | "On Computable Numbers" [1] he treats a Turing Machine that
         | halts as having a problem, whereas in basically every popular
         | treatment a machine that halts is one that gives you an answer,
         | and a machine that doesn't halt is one with a bug, like it
         | entered an infinite loop. But his paper is the opposite (and
         | doesn't use the word "halt" at all!):
         | 
         | Note that his machines compute real numbers, which always have
         | a an infinitely long binary representation (with possibly
         | infinitely many trailing zeroes), whereas usually one considers
         | Turing machines computing natural numbers, which always have a
         | finite binary representation.
        
           | pjungwir wrote:
           | Yes, it makes sense. But it sure took me by surprise when I
           | finally read the original paper.
        
         | pjungwir wrote:
         | I looked up Date's two exceptions. The first is uninteresting
         | ---databases cannot contain pointers---but the second is sort
         | of relevant:
         | 
         | > If a relation _r_ has heading _H_ , then no attribute of _r_
         | can be defined in terms of a relation or tuple type that has
         | the same heading _H_ , at any level of nesting.
         | 
         | In other words, relations can nest, but not recurse.
        
       | crazygringo wrote:
       | Sorry but I think this author is being more pedantic rather than
       | helpful, and actually seems to miss the point entirely. The crux
       | here is two things the author claims:
       | 
       | > _Since SQL does not allow creating or using nested tables, most
       | relational databases will be in first normal form by necessity._
       | 
       | > _Many explanations (including Wikipedia) uses the example of a
       | comma-separated string with phone numbers as an example of a 1NF
       | violation... Make no mistake, encoding multiple values in a
       | single string is generally a bad design. But it has nothing to do
       | with first normal form. First normal form mean a column should
       | not allow relations as values. A comma-separated string is still
       | just a single string from the perspective of the database type
       | system._
       | 
       | I think it's rather obvious that the example here is nothing to
       | do with _literal_ commas. But that storing multiple comma-
       | delimited phone numbers in a single field _is precisely_ creating
       | a nested table _in spirit_.
       | 
       | It has nothing to do with SQL support for nested tables or the
       | database type system -- it's not about _formal_ definitions. It
       | 's about whether you're _conceptually_ putting multiple pieces of
       | data into a single field. Which has _everything_ to do with first
       | normal form.
       | 
       | Because if you put comma-delimited phone numbers into a field,
       | you _lose the ability to relate and join those phone numbers
       | individually to another table_. Which is bad, and the whole
       | concept of  "1NF" exists _to give a name to this badness_.
       | 
       | And this isn't a minor point. It's actually extremely important,
       | because the common practice of throwing JSON data into a string
       | field destroys the ability to operate relationally on the data
       | within it.
       | 
       | Now of course, plenty of databases these days actually _do_
       | provide ways to index and join on data using a new  "JSON" type
       | column -- which _is_ yet another form of the  "nested table" the
       | author claims doesn't exist. Which is great for combining ease of
       | use with performance, but certainly _isn 't_ 1NF, and basically
       | explodes (or at least dramatically complicates) the fundamental
       | concept of relational databases as tables of rows and columns.
        
         | zozbot234 wrote:
         | > Because if you put comma-delimited phone numbers into a
         | field, you lose the ability to relate and join those phone
         | numbers individually to another table. Which is bad
         | 
         | It's bad _if you expect that ability_ in the first place! If
         | the data you 're storing (e.g. a list of phone numbers) will
         | only ever be treated as a unified blob without any finer
         | senantics, storing it as a single attribute is not a 1NF
         | violation.
        
           | mannykannot wrote:
           | Among the principal tenets of Codd's database model (which is
           | _not_ intended to be a generalized model of relations) are
           | that the database scheme should not introduce either
           | accidental relationships or accidental constraints.
           | 
           | If we represent a one-to-many relationship by making a CSV of
           | the elements of the "many" side, then, for a given X standing
           | in this relationship to A, B and C:
           | 
           | 1) We have introduced an ordering that may not be relevant -
           | is "A,B,C" the same as "B,A,C"?
           | 
           | 2) We cannot find out whether X has this relationship to B
           | without parsing the CSV, or which entities B stands in this
           | relationship to.
           | 
           | 3) We cannot add or remove instances of this relationship to
           | X without updating the whole relationship.
           | 
           | ...and so on.
           | 
           | All these are accidental features of the data representation
           | that are not inherent to the semantic domain.
           | 
           | So, when you write "It's bad if you expect that ability in
           | the first place!", you are essentially saying "if we put
           | aside the principles of Codd's model and accept the very
           | problems it was expressly intended to mitigate..."
           | 
           | You can certainly use a relational database to store
           | information in ways that preserve the sorts of problems they
           | were intended to solve - and there will be cases where it
           | does not matter. The normal forms are guidelines for using
           | relational databases as intended, to minimize these problems,
           | when it might matter.
        
           | crazygringo wrote:
           | Technically yes, but this is where you have to use common
           | sense.
           | 
           | The assumption that something "will only ever be treated as a
           | unified blob" can turn out to be a real footgun -- suddenly
           | you need to locate all users who share a phone number and,
           | whoops, you can't.
           | 
           | Nobody's saying strings should be split into individual
           | characters, or small JPEG thumbnail blobs ought to be
           | decomposed into fields for each header.
           | 
           | But if you're dealing with obvious common-sense pieces of
           | data like phone numbers? That experience ought to tell you
           | have a reasonable chance of needing to be joined relationally
           | at some point? Then the smart thing to do is to interpret
           | that as a 1NF violation.
        
       | painchoc wrote:
       | What about NULLs ? Is it allowed or not ? I recently discovered
       | on Wikipedia https://en.wikipedia.org/wiki/First_normal_form that
       | it should not be.
        
       | hobs wrote:
       | A clear and concise explanation - well done and rare for the
       | topic.
        
       | whatshisface wrote:
       | I have never understood why RDBs have such general concepts of
       | primary keys, where you can for example let date of birth be a
       | primary key, when in every schema I have ever designed or seen,
       | all rows get unique integer IDs anyway.
        
         | projektfu wrote:
         | In one sense, the unique integer is often not real data. It's
         | synthetic and has no meaning outside the application.
         | 
         | I find it easier to understand when I think that Codd and other
         | database pioneers were looking at creating universal databases,
         | not the application-specific ones we tend to use.
         | 
         | Furthermore, the use of auto numbering tends to make the
         | queries easier to write but the tables less meaningful. For
         | example, INVOICE might have a sequential invoice id that is
         | real information, as well as a reference to customer, billing
         | address, shipping address, etc. INVOICE_ITEM will have a
         | foreign key to an invoice, unique number, position, stock item
         | id (second foreign key), description, etc. PACKING_LIST might
         | have a foreign key to INVOICE_ITEM. In my opinion, I prefer to
         | see INVOICE_ID as part of the primary key of INVOICE_ITEM. I
         | like being able to see the relationship in foreign keys and
         | things like that.
         | 
         | These things can be a heated topic and other designers probably
         | think I also like to murder kittens. But the data layout is, I
         | think, what Codd was going for.
        
         | hknapp wrote:
         | I worked at a place that used decimals that auto incremented by
         | 2
        
         | mamcx wrote:
         | > when in every schema I have ever designed or seen, all rows
         | get unique integer IDs anyway.
         | 
         | This is artifact of 4 things:
         | 
         | - Indexed Integer search are fast in RDBMS
         | 
         | - Having an non-bussines related value for identify a row could
         | save you from making triggers for updated in other relations if
         | _it change_. This is the part a lot of folks missed: If you
         | already have a natural PK and it not change, is wasteful add
         | ANOTHER (is another index btw)
         | 
         | - Is enforced by ORMS that are badly designed.
         | 
         | - Most people have not idea how model a database, and this one
         | little thing is one of the easier "fix" you can think of!
         | 
         | I have done a lot "non integer primary keys" before (and not
         | GUIds!) and is useful that RDBMS are not like mongo with a
         | fixed schema (yes! Mongo is a single fixed-schema for all your
         | data!).
         | 
         | For reporting, store aggregates, pre-compute values, etc is
         | very valuable!
         | 
         | For example, in accounting you do reports per day/mont/year.
         | 
         | You then do granularity per day, and the another per month, per
         | year, per semestre, etc. When the data volume is high, making
         | tables "days, months, years" make a lot of sense.
         | 
         | Cool guys call this a "time-series database".
         | 
         | But the rdbms CAN model it!
         | 
         | And can model
         | 
         | - "document database"
         | 
         | - "key-value database"
         | 
         | - "olap database"
         | 
         | - "columnar database"
         | 
         | etc.
         | 
         | (you go for specialized backend for performance but honestly?
         | Most of the time is a mistake if them are not relational.
         | 
         | "Relational" don't means "is backed by b-trees, is row oriented
         | and use SQL")
        
           | whatshisface wrote:
           | A time series database isn't keyed by day/month/year, it's
           | typically keyed by monotonic timestamps.
        
             | mamcx wrote:
             | Yes, that is even a _more narrowing_ detail. (people that
             | use a  "time-series" rarely need that high of granularity!)
             | 
             | But the question was about wondering "why exist the support
             | for PK that are not integers/guis, like dates"?.
             | 
             | To be able to model a lot of things for what in the past
             | (before RDBMS) and today (for upstarts engines) you will be
             | tempted to add another data store (that rarely is that
             | necessary).
             | 
             | I know by a fact that some times is because people not
             | realize you DON'T need to constrain your db to things like
             | "only integers pks!"!
             | 
             | P.D: I work in the enterprise sector. Is fun when something
             | is composed of a cluster of disparate things purely by
             | misunderstanding you could have solved it with a simple db,
             | well done. Contrary to belief, a lot of my customers, their
             | largest databases are 1-10GB each. A rdbms can happily go
             | much much larger than that, if you give them a little
             | love...
        
           | jrumbut wrote:
           | It's wonderful what you can gain in performance, correctness,
           | and storage efficiency from finding the real primary key from
           | the data model itself.
           | 
           | Maybe not worth the intensive design work for many tables,
           | and it can be a risk to get it wrong, but it's cool when it
           | does work.
        
             | brazzy wrote:
             | Yeah, I'm currently facing exactly that dilemma. I have to
             | model a database of machine parts.
             | 
             | There is a natural PK in the form of the supplier's parts
             | number, but the legacy system (which uses that as PK) notes
             | that it is specifically the parts number _as used in the
             | supplier 's ERP system_, but suppliers often use
             | differently formatted parts numers in publications such as
             | catalogs (adding or removing spaces, slashes or dashes). So
             | the legacy system also has the "published part number" as
             | one or more separate fields.
             | 
             | This inconsistency makes me lean very strongly towards
             | using a synthetic PK instead.
        
               | bob1029 wrote:
               | We deal with a lot of crap where a certain business
               | entity has several different biz keys that can reasonably
               | be used to uniquely identify the thing.
               | 
               | I would concur with adding a synthetic primary key, and
               | maintaining all of the relevant business/domain keys on
               | the same type.
               | 
               | The advantage of this approach is that you can still keep
               | 100% correct relations between your internal types, even
               | if the domain keying scheme is fucked up or otherwise
               | broken per your original assumptions. It's really easy to
               | add a new column to refine the additional domain keys.
               | Fixing a bad PK and all the things that talk to it is
               | much harder.
        
         | turbocon wrote:
         | It's rare in practice but occasionally rows have natural unique
         | identifiers that are useable.
         | 
         | The textbook case would be student IDs in a university database
         | where each row corresponds to a single student. However some
         | universities view the student id as sensitive information (kind
         | of like a ssn in the US) and so in that scenario it should be
         | aliased to some other key to prevent the student id from being
         | present in multiple tables as a foreign key.
        
           | dragonwriter wrote:
           | > The textbook case would be student IDs in a university
           | database where each row corresponds to a single student
           | 
           | "Student IDs" are an example of surrogate keys, though they
           | may be surrogate keys in a system predating and outside of
           | the DB.
           | 
           | (And if the mapping between them and actual students are
           | managed by an error-prone process outside of the DB, they
           | probably aren't good primary keys for a table of students.)
        
           | TeMPOraL wrote:
           | > _However some universities view the student id as sensitive
           | information (kind of like a ssn in the US) and so in that
           | scenario it should be aliased to some other key to prevent
           | the student id from being present in multiple tables as a
           | foreign key._
           | 
           | It is ironic, given that the whole reason to have a student
           | ID is for it to be a primary key in some database somewhere.
        
             | whatshisface wrote:
             | Databases are vulnerable to a particular type of folkloric
             | magic:
             | 
             | https://en.m.wikipedia.org/wiki/True_name
        
           | jameshart wrote:
           | Student IDs are certainly not 'natural' keys.
           | 
           | I am almost entirely convinced that there is actually no such
           | thing as a natural key.
        
             | dragonwriter wrote:
             | Dates are natural keys in a calendar table. The composite
             | of the foreign key values is a natural key in a join table
             | expressing a unique relationship between entities in two or
             | more other tables, perhaps with additional attributes, even
             | if those values are surrogate keys in their own tables.
        
             | mulmen wrote:
             | Natural keys are everywhere. They are just less interesting
             | than the relationships. My name is a natural key. It
             | uniquely identifies my name. The wheels come off if you
             | expect it to uniquely identify _me_. But with natural keys
             | that 's never actually necessary.
             | 
             | When I buy a car they record my name, my date of birth, the
             | time of the sale and some details about the car, such as
             | the VIN, which is its name. That provides enough
             | information to form a natural key to identify the _sale_.
             | 
             | At no point does a database actually have a concept of who
             | I am really, only the relevant data points to satisfy its
             | own relational model.
             | 
             | Any database that means to track information about people
             | would be able to construct a natural key from my name, my
             | mothers name and some information about my birth, such as
             | the doctor, hospital and time.
        
             | CharlesW wrote:
             | > _I am almost entirely convinced that there is actually no
             | such thing as a natural key._
             | 
             | Yes! Or if it is _now_ , there's a future scenario where
             | that will have proven to have been a poor decision.
        
             | wolfgang42 wrote:
             | a "natural key" is frequently just a _really_ foreign key
             | in a database you and your org don 't manage. -- 'wweston,
             | https://news.ycombinator.com/item?id=27349246
        
           | wombatpm wrote:
           | In the 1980's your student ID was most likely your SSN
        
         | dnlhg wrote:
         | Codd was pretty adamant in his original papers that keys should
         | be made up from data rather than a unique integer, and his
         | example of converting from a hierarchical database schemas to a
         | relational database schema shows how compound, natural keys are
         | important to represent relationships. But we don't usually do
         | it that way today, partly because it isn't that practical with
         | our current databases and ORMs. It has some theoretical
         | advantages and disadvantages, but using IDs seems to work well
         | enough that we're not investing in a new paradigm.
        
           | ARandumGuy wrote:
           | In practice, using a unique IDs is often not ideal, but it
           | also rarely the wrong choice. I've personally never heard of
           | a scenario where a unique ID caused difficult to fix issues
           | (although I'm sure it's possible).
           | 
           | In contrast, a poorly though out natural key can cause a lot
           | of headaches. This mostly occurs when your natural key isn't
           | as unique as you thought, or if you have to change what you
           | use as a natural key. The real world does not like to conform
           | to our database constraints.
        
           | wwweston wrote:
           | Yeah, I think the early proponents of relational models had a
           | lot to do with setting the culture here.
           | 
           | I suspect that some of this has to do with an insight I
           | recently discovered: a "natural" key is in some sense a key
           | that's so foreign it's in a "database" you don't control. And
           | if you establish a culture that converges on natural keys,
           | you may end up with more easily cross-operable
           | databases/collections as an asset.
           | 
           | And this is why I think it's probably dead. If the boring
           | practicality of generated identifiers didn't do it, laws like
           | the CCPA which are going to, because they recognized the
           | potential of broad cross-operable data to be a liability and
           | a hazard vs an asset.
        
             | whatshisface wrote:
             | Those "natural" databases as a rule are going to be full of
             | duplicates and sometimes even inconsistencies, hence the
             | need for IDs that only exist for the purpose of expressing
             | relationships within your own database. Basically,
             | pointers.
        
         | mmcgaha wrote:
         | I completely agree with you, yet I design most of my schemas
         | with integer keys anyway. The only thing I do different than
         | most folks is that I make a unique constraint/index for
         | candidate keys (the proper primary key) because otherwise we
         | have lost the ability to ensure data consistency.
         | 
         | As for why I use integer keys, it has always been the
         | convention everywhere I go so I don't want to break convention.
         | 
         | Now the only convention that I would rather see would be to use
         | GUIDs instead of integers because I hate that I can join an
         | item table to an employee table and get results back.
        
         | dylan604 wrote:
         | How would DoB work as a primary key? A primary key is useless
         | if not unique, right? And I know many people born on the same
         | day that aren't even twins. Incrementing integers are nice and
         | easy for humans to consume. Using GUIDs is far more portable
         | than unique integers, but damn near impossible for humans to
         | deal.
        
           | whatshisface wrote:
           | They wouldn't, so my point is why do RDBMSs support so many
           | kinds of primary keys?
        
             | dylan604 wrote:
             | Keys can span multiple columns, so DoB could be a
             | differentiator if First, Last are also the same. I
             | personally would never try to build a key like that though,
             | but that's the only way I could see justifying its use.
             | 
             | If you can't see a why for uses of a field that is truly
             | more unique, then I'd suggest you just need some more
             | imagination. Strictly not allowing it seems to me to be
             | short sighted just because someone had the ill conceived
             | idea to use DoB as a primary key
        
               | whatshisface wrote:
               | Can you think of any good use case for a non-arbitrary
               | primary key? You are right that first name + last name +
               | dob is not a good way to design the system. For example
               | someone could differentiate their twin children only by
               | their middle name. So, what's an example of a non-integer
               | non-arbitrary primary key?
        
               | dylan604 wrote:
               | Not sure exactly what you mean by non-arbitrary primary
               | key, but...
               | 
               | GUID/UUID
               | 
               | FeatureName/ReleaseYear or Series/Season/Episode
               | 
               | email
               | 
               | phone-number
               | 
               | SSN
               | 
               | driverslicense#/State
        
         | fuckf4ce wrote:
         | Avoiding "natural keys" has little to do with relational
         | databases or the model. Consider that the Social Security
         | Number far predates relational databases.
         | 
         | The reason for using surrogate keys is not some great
         | conspiracy or mass incompetence. Simply, there are very few
         | cases where natural keys work well.
         | 
         | Good keys are generally stable, globally unique, not too big,
         | and it turns out that these qualities are extremely rare in a
         | natural key in practice. Furthermore, composite keys over more
         | than 2 or 3 attributes become terribly unwieldy (and
         | intuitively, composite key size and stability are inversely
         | related).
         | 
         | What is a good natural key for an individual? There are none!
         | The SSA figured this out in the 20s, this was not a modern tech
         | problem.
         | 
         | Even social security numbers can't be stable, but they are far
         | better than things like names or addresses because they change
         | for only a handful of reasons.
         | 
         | That said, a lot of database designs should use composite
         | primary keys. You can in part thank early MySQL which started
         | the ball rolling on generally shitty database design in the
         | late 90s.
        
         | adrianmonk wrote:
         | What you're talking about is a _natural key_ [1], which can be
         | contrasted with a _surrogate key_ [2]. (Surrogate keys are also
         | called "synthetic keys" or "factless keys".)
         | 
         | So the question is whether natural keys are ever useful.
         | 
         | I think the answer is yes, they are useful if the real world
         | _already_ imposes a key on the stuff you are modeling in the
         | database. In that case, it is already unique (as it needs to
         | be), and generating your own key doesn 't accomplish anything
         | (but is redundant).
         | 
         | For example, suppose I'm creating a history database of who
         | owned what internet domain names at which times. If my table is
         | one-to-one with domains, I can use the domain name itself as
         | the primary key.
         | 
         | If not, I can perhaps use the domain name plus some other
         | column as a composite key. For a table of ownership intervals,
         | my primary key could be this: domain name + start date + end
         | date.
         | 
         | ---
         | 
         | [1] https://en.wikipedia.org/wiki/Natural_key
         | 
         | [2] https://en.wikipedia.org/wiki/Surrogate_key
        
           | whatshisface wrote:
           | > _For example, suppose I 'm creating a history database of
           | who owned what internet domain names at which times. If my
           | table is one-to-one with domains, I can use the domain name
           | itself as the primary key._
           | 
           | In your example, wouldn't that not work, because the same
           | domain would be in the table twice with two different owners?
        
             | adrianmonk wrote:
             | You might have more than one table in the database. For
             | example, you could have a table of current owners for quick
             | lookup and a table of all owners for full history.
             | 
             | For the current owners table, you'd use just the domain as
             | a key. For all the full history table, you'd use domain +
             | start date + end date as the key.
             | 
             | So yeah, that makes it a weird example, but the point is
             | that in neither table did you make up a synthetic key. In
             | one table, you used a real-world fact (domain) as the key,
             | and in the other table you used three real-world facts
             | (domain plus two dates) together as a key.
        
         | juped wrote:
         | Synthetic primary keys like a unique integer ID assigned to a
         | row are worse in theory - the id column isn't real data. They
         | exist as a very common practical compromise for good reasons,
         | but you should be aware you're compromising theory for
         | practice.
        
         | dragonwriter wrote:
         | > I have never understood why RDBs have such general concepts
         | of primary keys, where you can for example let date of birth be
         | a primary key, when in every schema I have ever designed or
         | seen, all rows get unique integer IDs anyway.
         | 
         | Because sequential integer surrogate keys aren't always a good
         | idea, e.g.:
         | 
         | 1. Certain logically-unique join tables, where the foreign keys
         | to the joined tables form an obvious composite primary key.
         | 
         | 2. Cases with natural primary keys for basic data; A date might
         | not be a good primary key if its the DoB in a table of people,
         | but its an excellent primary key in a calendar table.
         | 
         | 3. Even when you need surrogate keys, sometimes you need them
         | to be able to be generated in a distributed manner, so you want
         | something like UUID/ULID instead of sequential integers.
        
         | brazzy wrote:
         | Since when is "date of birth" a distinct type?
         | 
         | And "date" definitely _does_ make sense as a primary key when
         | you have a table that has conceptually at most one entry per
         | day. For example, a travel journal. Or a time sheet
         | application.
        
         | cyberbanjo wrote:
         | What about people with the same birthdays?
        
           | wfriesen wrote:
           | Business/requirements/users will often tell you that no two
           | users will have the same birthday/SSN/whatever, but so often
           | that eventually proves incorrect. Just using an auto
           | increment ID as the primary key can prevent a lot of pain
           | down the road.
        
             | lifepillar wrote:
             | So, when someone comes to you and identifies herself by
             | providing her SSN, name, date of birth, how do you know
             | whether she corresponds to the record with id M or to the
             | record with id N? Or, more likely, to one of the N records,
             | with N>2, with that SSN, name, date of birth?
        
           | jraph wrote:
           | What about them? Do they really exist, practically speaking?
           | 
           | At a quick glance on my databases, I see nothing of the sort.
        
             | Kranar wrote:
             | Are we seriously discussing whether in practice, two people
             | can be born on the same day?
        
               | jraph wrote:
               | This was an absurd joke. A person looking at their
               | database using birth dates as a primary key can obviously
               | only see people with different birth dates.
        
             | lkuty wrote:
             | https://en.wikipedia.org/wiki/Birthday_problem
        
               | jraph wrote:
               | Is there anything comparable to the birthday paradox,
               | which includes years, by the way?
               | 
               | I would guess that you would need a big group of random
               | people such that there is a high chance of having two
               | people born the very same day.
        
               | whatshisface wrote:
               | The birthday paradox threshold if years are included
               | would be 23 people per year, rather than 23 people in
               | your database. What that translates to in your database
               | depends a lot the age demography of your userbase. If
               | people live to be 100, and if I don't worry about the
               | fact that some years will have more people than others
               | even if they are randomly distributed (something that
               | should go in to the model), that's about 2,300 people
               | before you start expecting to see duplicate birthdays.
        
               | cyberbanjo wrote:
               | I don't think the birthday problem accounts for year, but
               | yet it's clear some people share the same birthdate,
               | twins for example.
        
             | jcranmer wrote:
             | I share the same name and birthdate as another individual.
             | It has actually caused me to be held up at the border
             | before.
        
               | jraph wrote:
               | What gave proof that you were not this individual in the
               | end?
        
           | ARandomerDude wrote:
           | I think this was the parent's point. Since, as you point out,
           | we we almost never use the data itself (e.g. date of birth)
           | as the primary key, the parent is asking why is this a
           | feature at all? Why not just mandate a uuid, number, etc. as
           | the primary key?
        
             | uDontKnowMe wrote:
             | Well it's easy enough to think of a situation where you
             | would use a date as a primary key, for example time series
             | data.
        
       | flint wrote:
       | Why does he use the form: "A relation more or less correspond to"
       | and "A relation have"? This doesn't seem to me to be unintended.
        
         | tyingq wrote:
         | He's somewhat consistent with that. "A relation...correspond",
         | "relation refer to the", "an attribute correspond to".
         | 
         | But diverges on that pattern for "domain": "a
         | domain...corresponds".
        
           | wcarss wrote:
           | may also be english as a second+ language or just a frequent
           | brainfart around pluralization, see the intro:
           | 
           | > This post explains what 1NF actually means and is useful
           | for, and debunk a number of the misunderstandings.
        
             | ehw3 wrote:
             | The Wikipedia entry on First Normal Form also uses this
             | kind of construction, which I suspect means it was written
             | by the same person.
             | 
             | I have never seen terms like domain and attribute used as
             | if they were plurals (the way "data" is), but even were
             | that correct, "a domain have" would be contradictory, so I
             | think it is just a linguistic quirk of the writer.
        
           | whatshisface wrote:
           | Level 0 data science: "We have several datapoints."
           | 
           | Level 1 data science: "We have several data."
           | 
           | Level 100 data science: "A relation have a fixed number of
           | attributes (columns) and each column have a domain [...]."
        
       | transitorykris wrote:
       | The key, the whole key, and nothing but the key
        
         | jbigelow76 wrote:
         | So help me Codd.
        
       ___________________________________________________________________
       (page generated 2021-06-18 23:01 UTC)