[HN Gopher] Unnatural Keys - Nature doesn't come with identifiers
___________________________________________________________________
Unnatural Keys - Nature doesn't come with identifiers
Author : thunderbong
Score : 163 points
Date : 2023-05-28 05:54 UTC (17 hours ago)
(HTM) web link (matt-schellhas.medium.com)
(TXT) w3m dump (matt-schellhas.medium.com)
| 36097082 wrote:
| [flagged]
| cubefox wrote:
| "To keep reading this story, get the free app or log in."
| praptak wrote:
| https://en.m.wikipedia.org/wiki/Identity_(philosophy)
|
| For when you want to think real deep about what the identifier
| identifies.
| fatneckbeard wrote:
| or whether it identifies
|
| https://en.wikipedia.org/wiki/Philosophical_skepticism#Epist...
| jeff-davis wrote:
| "There are no real world natural keys -- only someone else's
| surrogate key... SSN, VIN, email address, UPC, tax numbers,
| country codes, language codes, URIs... they're all just someone
| else's made up identifier."
|
| Every time this topic comes up, there are major definitional
| problems, and it seems like either surrogate keys or natural keys
| get defined out of existence.
|
| The author is drawing the line such that natural keys are
| surrogate keys that are out of your control, and surrogate keys
| are in your control. That doesn't seem to match the definition
| from wikipedia: https://en.wikipedia.org/wiki/Surrogate_key
|
| I agree with the author that control over the ID is an important
| distinction, but if you're going to make up a new ID, why not
| just say so? "Surrogate key" just introduces confusion -- it
| makes it sound like something only programmers and database
| admins care about, which is definitely not true. If customers are
| seeing it, everyone in the organization should care that it's in
| a recognizable pattern (so you don't mix up different IDs from
| different places) and memorable and easy to understand when
| spoken.
| swayvil wrote:
| one might even say that all form is interpretation. Or that all
| facts are assertions.
| tingletech wrote:
| If you are mapping to a series of bits, then a checksum of some
| sort seems like a fine identifier to a specific file.
|
| But something like "song", minting ARKs[0] for FRBR[1]-esq
| Work/Expression/Manifestation records would be my first instinct,
| not a natural key. The Manifestation record could have the
| checksums of the files.
|
| [0] https://arks.org
|
| [1]
| https://en.wikipedia.org/wiki/Functional_Requirements_for_Bi...
| [1]
| 8organicbits wrote:
| FRBR makes a lot of sense. There's some really good insights
| from library sciences that folks in computer science aught to
| spend some time studying.
|
| [edit] removed a statement about an ARK URL not loading, it
| loads fine now. Temp issue on my end?
| blooalien wrote:
| I agree with the substance of your comment re; FRBR, but ...
| typo maybe? [1]
|
| [1] https://grammarist.com/homophones/ought-vs-aught/
| karmakaze wrote:
| I deal with inventory systems and there is a difference between
| your system's key and a 'natural' one--the natural one existed
| first. The good answer is to support both and use either as
| appropriate. There are many ways of ID'ing an item, usually a SKU
| (literally stock keeping unit) is used. The owner of the items
| can choose to use an existing one (manufacturer, UPC, etc) but
| more commonly assigns their own so that interchangeable items
| with different manufacturer codes can share a SKU. Ideally a
| system should support as many methods of ID'ing as the user deems
| useful.
|
| If you do use surrogate keys everywhere, avoid exposing a one if
| a pair of existing ones will do. e.g. rows of an association
| table, unless of course that association has attributes and is a
| well-known thing to itself.
|
| Making every user of your systems always use system generated
| identifiers (where external ones already in use commonly exist)
| is kind of like a file-system that only uses tmp-generated names
| and the user has to do external book-keeping to remember which is
| which.
| fwlr wrote:
| Nature doesn't come with identifiers; that is why Apollo gave us
| UUIDs. And always remember, the goal of a database is _not_ to
| model the domain - it is to model your knowledge of the domain.
| karmakaze wrote:
| Ironically DNA is an identifier and nature does use identifiers
| like facial recognition, etc--it only happens not to be text-
| based nor 100% unique/accurate.
| fanf2 wrote:
| DNA is not an identifier
| https://en.wikipedia.org/wiki/Chimera_%28genetics%29
| Anduia wrote:
| https://archive.ph/vuXrb
| praptak wrote:
| Slap a synthetic key on everything, you probably overestimate how
| bad it is to have multiple records for "the same thing".
|
| Even if it is indeed bad, it's probably unrealistic to expect to
| never let it slip on entry (however good the natural keys might
| be). Maybe it makes more sense to incorporate deduplication into
| the design right from the start.
| jack_squat wrote:
| When you use a surrogate key, you should still put a unique
| constraint on an alternate key. Multiple records for the same
| thing is a fundamental problem in a database.
| ur-whale wrote:
| > we want a database of all of the songs in the world so that we
| can properly identify unknown songs and provide attribution so
| that folks can get paid appropriately. It is a noble goal
|
| That it is a noble goal is this internet rando's own opinion, and
| there certainly is no consensus on that.
|
| Inflicting it as a fait-accompli to his readership does not bode
| well for the rest of the article
| TeMPOraL wrote:
| This. I'm actually surprised the linked Wikipedia article on
| "Natural key" only mentions one potential disadvantage, that
| _somewhen in the future, maybe, at some point, somehow_ , the
| other system may change, and then your keys are bust. They even
| give US SSNs as an example of something that may stop working as
| unique key _in the future_.
|
| There are no "natural keys". They don't exist in nature. They
| don't exist as a thing _in our physical reality_. And when it
| comes to human-assigned identifiers, it should be emphasized much
| more strongly that whatever external thing you _think_ is a good
| unique ID, it isn 't.
|
| Names? Addresses? Dates? Nope - have you heard of the "Falsehoods
| programmers believe about..."?
|
| Government IDs? Nope. US SSN was _not designed_ to be used as UID
| from the start, but even when you have an identifier that was -
| like e.g. PESEL number in Poland - it 's _still_ not an unique
| ID, because _people and computers make mistakes_. Multiple people
| having the same PESEL is something that 's been known to happen.
|
| (Side note: "identification number" really means "opaque string".
| IDs are not numbers. If adding or multiplying them together
| doesn't make sense, they're not numbers.)
|
| Anything business adjacent? Don't make me laugh. The article
| covered a lot of the reasons why any kind of tracking IDs are
| broken for the purpose - most of them dealing with the messy
| nature of reality - but businesses are businesses. They cut
| corners. They lie and cheat. That's why you see the same EAN code
| covering multiple products a store might even try and price
| differently. That's why SKUs get reused. That's why these days
| you often have multiple distinct products under the same SKU -
| first, quality batch, released to get early positive reviews, and
| then the main batch made with much cheaper and lower-quality
| parts. Product names? That's even worse - you'll have variations
| such as "extra batch of super-low-quality, reduced functionality,
| made specifically for Black Friday, so it can be sold under the
| same name as the OG product, seemingly at big discount".
|
| This is not to say "never use other people's IDs". There's a
| trade-off here, but making a good choice _starts_ with assuming
| that the external "unique ID" is neither unique, nor a good ID,
| and you'll have to deal with it at some point.
| sabas123 wrote:
| > (Side note: "identification number" really means "opaque
| string". IDs are not numbers. If adding or multiplying them
| together doesn't make sense, they're not numbers.)
|
| Eeehhh I'm not sure if this necessarily true. A telephone
| number is definitely a number but addition doesn't make all too
| much sense on them.
|
| If you want to specifically talk about objects that require
| addition and multiplication to be properly defined on them then
| you can always use Groups and Rings.
| kweingar wrote:
| A phone number is a number in a colloquial sense (a string of
| digits) but I would hate to use any computer system where
| phone numbers are stored or reasoned about as actual
| mathematical numbers.
| JadedBlueEyes wrote:
| A phone number is definitely not a number - for example, it
| can often be written as starting with a +. Not only that, but
| one number can have multiple valid expressions (eg. a number
| with and without a country code). That's ignoring the
| different ways people like to format them (spaces, dashes,
| letters, etc.)
| fanf2 wrote:
| Yes, and leading zeroes are significant in phone numbers.
| OJFord wrote:
| You could construct a telephone numbering system where the
| sum of two 'real' numbers could be dialed to start a three-
| way conference call. That would be.. amusing, to some subset
| of users.
| kroltan wrote:
| Give everyone prime _numbers_ and a N-way conference can be
| done with multiplication.
| Throw10987 wrote:
| I find myself coming back to Data and Reality on a periodic
| basis to remind me that the model is just the model.
|
| https://www.bkent.net/Doc/darxrp.htm
| bruce511 wrote:
| If I could go back 25 years and give myself advice, it would be
| this. A Lot of pain over my career would have been avoided.
|
| I would also tell myself not to use sequential numbers for
| surrogate keys. They're massively convenient, but a real pain
| later on (data security, and distributed data being just two
| cases.)
|
| I know there are folks who like natural keys, but frankly IMO
| they're a straight-up bug. YMMV.
| EdwardDiego wrote:
| I was about to suggest just hashing our DNA, but then,
| remembered identical twins+ exist.
| bruce511 wrote:
| Oh it's worse than that. There are lots of edge cases with
| DNA.
|
| Chimera exist [1], plus there are issues with transplant
| recipients and so on. I expect there are more edge cases I
| don't know about.
|
| [1] https://www.scientificamerican.com/article/3-human-
| chimeras-....
| drivers99 wrote:
| and gene therapy
| amelius wrote:
| Whatever you do, always let some manager decide about the
| actual key, then you can always blame them for their mistake :)
| genuine_smiles wrote:
| > There are no "natural keys". They don't exist in nature. They
| don't exist as a thing in our physical reality. And when it
| comes to human-assigned identifiers, it should be emphasized
| much more strongly that whatever external thing you think is a
| good unique ID, it isn't.
|
| It seems to me that nature has plenty of natural keys, it's
| human created constructs that don't.
|
| For example, the periodic table has natural keys.
| dale_glass wrote:
| Yeah, but you don't want to use that as a primary key for
| your stock database.
|
| It'll work fine until you find you want to keep track of
| U-235 and U-238 separately.
|
| What constitutes identity is context-dependent. Yeah, on one
| level they're the same thing. On another they aren't.
| rzzzt wrote:
| Make it a composite key of atomic and mass numbers, then.
| xmcqdpt2 wrote:
| But then you need to have a stored procedure to maintain
| integrity every time an atom undergoes radioactive decay,
| that's got to be bad for performance!
| c00lio wrote:
| But then you'd miss on crystal structure to distinguish
| your diamonds from your coal.
| OJFord wrote:
| Which single element are they then?
|
| Surely such a system would have a products table with
| something like nominal element fk, purity, mass columns
| anyway. (And stock table with fk to products.)
|
| Then nothing wrong with 'natural' key on the elements,
| but sure, products are the thing you've (personally) made
| up, so of course there's no 'someone else's' key to use.
| TRiG_Ireland wrote:
| Diamonds and coal are both pure carbon (mostly
| carbon-12). So is graphite, and buckminsterfullerene, and
| a number of other things. Crystal structure matters a
| lot.
| OJFord wrote:
| They're not _pure_ carbon was my point. I wasn 't saying
| crystal structure not important, I missed that but it'd
| be in my 'product derived from element x' table.
| MyThoughts100 wrote:
| The periodic table is a human construct. So is the concept of
| an identifier.
| jameshart wrote:
| I'd wager that nonhuman intelligences out there in the
| universe have analogous tables to the periodic table.
|
| And I suspect it would be possible for us to JOIN our
| periodic table onto theirs by matching on the atomic number
| field to create a mapping between, say, our names for the
| elements and theirs.
|
| That seems unlikely to be something we would be able to do
| with literally any other dataset that an alien civilization
| has. Their star catalog will contain many of the same
| objects ours does - but there won't be an obvious key that
| lets us join them up.
|
| I guess there's a few mathematical objects that would work
| - the list of Platonic solids maybe, or the symmetry
| groups... but in terms of real world physical things?
| Atomic numbers might be the only truly natural key there
| is.
| dkarl wrote:
| The periodic table groups atoms by the number of protons
| they have and ignores other differences. Uranium 235 and
| Uranium 238 have the same "natural key" according to the
| periodic table. The periodic table is useful for solving
| a lot of problems, and for other problems where it
| doesn't work to treat U-235 and U-238 as the same, we
| (and other intelligences) use other ways of categorizing
| atoms.
| jameshart wrote:
| Right, but there's a reason we call them 'isotopes of
| Uranium' rather than saying U-238, Np-238 and Pu-238 are
| isotopes in the 'family of atoms with atomic mass 238'.
|
| Things with 92 protons in them behave chemically in
| similar ways. They form the same crystals and molecules.
|
| If you were building a table of isotopes you'd put a
| column for 'proton count', and you'd find lots of the
| properties of the isotopes depend on the proton count not
| the atomic mass. So you could normalize that table
| structure out and create another table with the atomic
| number as the key, and put all the properties that are
| common to all isotopes of a given element in that table.
|
| You would have made a table of elements and it would have
| the same primary key as the periodic table.
|
| That's what a natural key is. One that emerges naturally
| when you normalize data you have collected.
|
| I suppose similarly 'electrons in outer shell' might
| emerge as a natural key by that process.
| rocqua wrote:
| The concept of an identifier comes with the concept of a
| bijection (or at least a injection). That is a truly basic
| concept in Mathematics.
|
| You could argue Mathematics is a human construction. But at
| that point, isn't every concept a human construction? In
| that case it is hardly helpful, or meaningful to say a
| concept is a human construct, since all concepts are.
| imtringued wrote:
| The problem isn't the concept of a bijective function, it
| is picking the right function and that is a social
| problem.
| xmcqdpt2 wrote:
| Most elements are stable, especially most of the ones we
| deal with, and they fall in clear periods and rows based on
| chemical valence. This is true for main group elements and
| mostly true for transition metals.
|
| I think the periodic table was discovered, not invented. My
| guess is that whenever we communicate with some advanced
| alien species, the periodic table is going to be one of
| those things we share.
| TeMPOraL wrote:
| Periodic table is recording _multiple_ different patterns
| in atoms. Some better than others. By nature, its a
| dimensional reduction. There are many alternative
| renderings of the periodic table (or spiral, or cube, or
| ...) emphasizing different regularities. The one you
| refer to is just the most popular rendering.
| rini17 wrote:
| Most of the time you don't interact directly with elements,
| but with molecules. Oh, and ions. And radicals. And and....
| TeMPOraL wrote:
| And isotopes. And transient, unstable configurations that
| may exist for a brief moment before decaying into something
| that slots into the periodic table better.
|
| And also elements aren't a thing, quantum fields are. This
| is why you sometimes see things that just don't fit the
| idea of distinct atoms. But then quantum fields also aren't
| a thing. We don't know what exactly is _the_ thing. We only
| have increasingly sophisticated working models.
| rini17 wrote:
| No need to go to such philosophical depths. We're only
| poor engineers trying to come up with workable model. The
| periodical table of elements was a true boon and it is
| unlikely these "foreign keys" will change, but still - it
| alone isn't fully suitable as data model for chemistry.
| jack_squat wrote:
| I also tend toward surrogate keys, and would also point out that
| the overhead concerns frequently weigh in favor of them as well
| (though not always).
|
| If your natural key is bigger than the appropriately sized
| surrogate key, and it will be referenced across the database in
| other tables, the overhead of referencing it everywhere (and
| indexing it in those places -- potentially in much larger tables)
| can easily outweigh the overhead of the extra bytes on the
| original table.
|
| Additionally, surrogate keys can realize much better performance
| WRT cache locality.
| efitz wrote:
| Slightly OT but related: one of the problems with identifiers
| that I have come to accept is that the things underlying the
| identifiers change over time. Things come and go, businesses
| launch and fail or are acquired/absorbed into other businesses,
| humans are born and die, humans change their names or genders,
| people buy and sell and subdivide or combine real property,
| people paint things different colors- the only constant is
| change. And it generally wreaks havoc because people who thought
| up identifiers for things usually assumed immutability of a
| mutable thing, and often don't have workable procedures for
| keeping up with change.
|
| I've come to the conclusion that identifiers should be structured
| in such a way that they have no relationship to what they
| identify - for instance, never ever ever use a name as an
| identifier, no matter how you normalize it. All properties of
| things should be mutable descriptors. Identifiers should be
| numeric-ish. Identifiers are not user interface, either - don't
| try to format them for readability - the audience for identifiers
| should be automation, not humans. You should give humans a way to
| serialize them, such as a URL or UUID or whatever, but the point
| is just to port the thing unambiguously between systems, not to
| be human readable. You can build UX that looks up the ID and
| shows you what the descriptive properties are.
| hdudhsyw wrote:
| [flagged]
| xigoi wrote:
| Medium has had paywalls for a long time.
| karmakaze wrote:
| > DALL-E -- "A photograph of meat in the shape of a key on a
| solid background"
|
| I feel like I'm now reading Matrix code where we can do away with
| the image itself and just keep the prompt.
| eshnil wrote:
| I ran into the same problem while building
| https://learnawesome.org/ . Forget the broad class of "learning
| resources", even the "books" category doesn't have a usable
| unique ID. Not everything gets an ISBN for example. There's also
| the ambiguity between a "work" and an "edition" of a work.
|
| This is probably why OpenLibrary supports mapping of books with
| 40+ identifiers:
| https://github.com/internetarchive/openlibrary/blob/master/o...
| makeitdouble wrote:
| ISBN is an interesting case as the publisher has leeway on how
| it's handled accross the lifetime of a book. In particular some
| books will get multiple ISBN for the same version, and some
| other books can keep the same ISBN through multiple minor
| revisions.
| TazeTSchnitzel wrote:
| In Sweden, almost everyone is assigned a number at birth, a
| _personnummer_ , with the format YYMMDD-NNNN (or YYYYMMDDNNNN in
| computing, to avoid having a Y2K issue every time someone turns
| 100).
|
| A lot of computer systems here seem to treat it as a natural key,
| which is a terrible idea, because:
|
| * Not everyone has one (notably, asylum seekers and temporary
| residents don't), so if you rely on them, you're excluding those
| people.
|
| * They can change. They won't for 99.9% of people, but not
| everyone is so lucky. If you use these as your _primary_ key, you
| 're going to create a headache for you and your users for that
| 0.1% of cases.
| schlowmo wrote:
| Not so long ago I had to deal with the Austrian version of a
| Social Security Number and it turns out that using birthdates
| inside such a number isn't a good idea, since
|
| * Some people doesn't have a birthdate with which the Austrian
| state agrees
|
| * For those people the Austrian state at least assumes a
| birthyear with a birthday at January 1st or July 1st
|
| * This leads to overflowing the available numbers for those two
| dates
|
| * If a date overflows, they get SSN with month "13". The 13th
| month can have as many days as needed (so > 31 is possible).
|
| * Now you have SSNs with a month part of e.g 2023-13-32.
|
| Why even bother with birthdates then?
| BlueTemplar wrote:
| I'm curious, was there any pushback to its introduction ?
|
| In France, when the government tried to introduce something
| like this in 1973, it (eventually) caused a scandal (after a
| scathing Le Monde article), which resulted in the whole project
| being scrapped, and the French data protection authority CNIL
| being created :
|
| https://www.lemonde.fr/blog/bugbrother/2010/12/23/safari-et-...
| (fr)
|
| (CNIL was later defanged in 2004.)
| Phelinofist wrote:
| In what cases can this number change?
| meithecatte wrote:
| Poland has a very similar numbering system, the PESEL. It can
| change if the date of birth was recorded inaccurately and you
| need to correct it, or, because the PESEL encodes gender,
| when you get your gender marker legally changed.
| playingalong wrote:
| PESEL system was bought from Sweden.
| Philpax wrote:
| The personnummer encodes gender in the last digit (which,
| IMO, is very silly) - so when that changes, the number has to
| change too.
| hgsgm wrote:
| That explains all the fuss of recent years. Transgenderism
| is an affront against data schematization.
| Pamar wrote:
| Personally I think this is very bad idea especially because it
| would be be very _tempting_ to use the id to - for example -
| trivially "decide" if you can, for example, smoke, drink
| alcohol, watch porn.
|
| But this would also make some tricks to extend the scheme to
| corner cases (e.g.: refugees will get YYMMDD-NNNN where the
| date part is when they were registered at entry) fail
| miserably.
|
| At the very least, a single letter code in the mix would
| mitigate the problems a bit.
| sebastialonso wrote:
| Any scheme that covers 99.9% of cases is a success, something
| quite hard to achieve. Calling it a terrible idea is a bit
| melodramatic, no?
| ttepasse wrote:
| Should't there be discussion about changing this number scheme?
|
| According to Wikipedia the NNNN part is really NNGC. C is a
| checksum, giving NNG as a serial number with G even/odd for
| gender. Sweden seems to have ca. 115000 births per year, making
| 315 births per day. But those may not be evenly distributed.
| The available space or serial numbers seems awfully close.
| jackjeff wrote:
| I think that's true for Social Security numbers in France/US,
| and National Insurance numbers in UK.
|
| About 15 years ago, after working in Germany I came back to
| France for a couple of years, the country where I was born /
| assigned a number at birth / following a similar pattern with
| the year in the code.
|
| My health insurance from Germany did not transfer according to
| EU rules or something, at least that's what the regional French
| National Security told me. The way they got me out of the
| Kafkaesque situation was to give me a "temporary" social
| security number, like they would give a foreign immigrant, and
| have it revert back to my "normal" number after 3 months. At
| some point I had "two" national security numbers. It was a real
| mess to fix the whole thing afterwards as you can imagine, and
| considered I left the country only after a couple of years, I
| wonder why I even bothered.
| tough wrote:
| French bureaucracy will kill your soul slowly but surely
| c00lio wrote:
| Yes, but what would be the alternative?
|
| Of course you can use artificial keys like a random UUID for
| each of your db objects. Which will work quite well if you only
| interact with your own system.
|
| But as soon as one of your db objects needs to be linked to
| some other system, you will need some common ground for a
| correlation, and something like a personnummer will help
| immensely and solve 99.9% of your problems. The 0.1% of
| problematic cases is far less than the headaches that the
| absence of some common ID scheme will cause you.
|
| Imagine e.g. having to correlate on the usual "Name, First
| Name, Birth Name, Place of Birth, Date of Birth" dance: My
| place of birth used to be called Lower Unxton at the time of my
| birth, now it has been reformed with Upper Unxton and
| Exampleville into Greater Unxton. Which one do you want? Did I
| always give the same answer to that question? My parents also
| divorced and my birth name legally changed. Do you want the old
| or the new birth name? Btw., my legal date of birth is actually
| in a non-gregorian calendar. How do I input that into your form
| field, it doesn't seem to like Showa 50 as a year, not even to
| speak about the proper characters?
|
| I'd claim any imperfect personnummer is still far superior to
| all the problems one would have without it.
| cpfohl wrote:
| They're not suggesting you don't store it... I don't think
| you're addressing the real concerns here.
|
| How I read the concerns: External connections to your data
| belong in a field or a separate table to ensure that your
| data is not so tightly coupled to the key you've chosen that
| you limit your data model's expressive power. Don't make an
| external identifier the primary key in your database.
| c00lio wrote:
| Ah. Thanks for the explanation. With that I agree
| completely.
| Groxx wrote:
| You deal with it the same way as you deal with any value that
| may change and may not be unique: you don't make it your
| primary key.
|
| Just index it separately, and resolve your links at insertion
| time to your internal actually-stable-and-unique keys. Then
| when your system runs into one of these duplicates _you can
| tell what is happening_ , and you have a chance to fix it,
| rather than blindly plowing ahead and ruining even more data.
|
| None of this has anything to do with how you represent your
| API to other systems. Do whatever you need there. And if
| system X is confused and thinks public-ID-Y is actually Z,
| well now you can handle it - that's just another
| discriminator, but you select this one based on the querier.
| vaylian wrote:
| > Yes, but what would be the alternative?
|
| Not have a nation-wide ID scheme. The problem is that people
| will rely too much on it being always correct and a lot of
| infrastructure will grow on top of it so that it is really
| hard to fix things.
|
| A better approach is to keep the human in the loop. If a
| government or a company wants to create a connection between
| databases, let the human whose data is requested, provide the
| external keys/IDs upon request. That's less automation, but
| it allows for more flexibility and better personal data
| protection.
| wolfgang42 wrote:
| We tried that in the US, and we ended up with an ad-hoc one
| anyway. It seems to be far too useful of a concept in
| modern society to avoid this happening whether you want it
| to or not.
| emodendroket wrote:
| Yeah the US system of hijacking numbers that were never
| meant for this purpose and treating them as sufficient to
| take out credit in your name, and then having fifty
| redundant DMVs handling identification cards makes much
| more sense.
| dvdkon wrote:
| Consistency in one database is important and "natural keys" can
| help, it's just a bad idea to apply them to messy nature. They're
| great for derived or linked data, though.
| xipho wrote:
| Nature + Identifiers is an issue (e.g. see
| https://github.com/tdwg/tag/issues/36). I've routinely mentioned
| to others in my field to look at other industries for ways
| forward, particularly ones like the music industry, so its
| interesting, if unsurprising to see all the same problems arise
| there. For those who know, in our field, when Identifiers comes
| up in conversation at conferences etc., we tiptoe away, somehow
| people can't learn from the past issues.
|
| There are two issues that keep coming up in my mind: 1) People
| want Identifiers to _do_ something (like resolve), rather than
| just be identifiers and 2) People think that there are such
| things as "unique" identifiers (one identifier per "thing").
| Neither, in my mind, are the purpose of identifiers. Identifiers
| should do one thing, localize you to some concept. By localize I
| mean that if you can find the digital space (or in physical
| collections where identifiers are used the physical "printed"
| identifier) that "contains" the identifier then you should have a
| reasonable probability of finding the thing/concept that
| identifier is for. That's all. No certainty, no uniqueness. It's
| very akin to what we do when we cite something in a publication,
| we are giving the researcher who reads it a reasonable chance of
| finding the origin. This isn't to say that we shouldn't try to
| keep identifiers unique though, it's to say that when it comes
| down to crunch time we should never assume 1) that they are
| unique, and 2) that their special properties (e.g. that they
| resolve) actually work.
|
| I've seen numerous identifier schemes come and go. We have
| specifically designed a 1-many for things-to-identifiers in our
| systems (sitting on top our internal IDs, yay, another ID). DOIs?
| They must be unique, right? Nope. Institutional CODENs? Nope
| (though the botanists have done it pretty well through community
| peer-pressure). People IDs? Do you mean wikidata or ORCiD or?
|
| As others have noted, identifiers really are just labels, though
| things like UUIDs have the game-changing property of reducing the
| probability that you're looking at a homonymous label.
| IanCal wrote:
| The really key thing with this is to identify what your
| database/Id is solving _and what it is not_. What guarantees are
| you giving and what problems should you raise to people. You can
| rarely solve all of them.
|
| I helped make GRID which was a database of research institutes in
| the world (morphed into ROR). But what is one institute? Is it
| the legal entity? Is a lab an institute? What about a research
| boat? A hospital? Which is the parent, the hospital or the
| university?
|
| The answer to lots of that is "maybe, it depends". We had to
| decide what use cases we were solving, and often it came down to
| some human fuzzy definition, how do people use this in reality?
|
| We built it because others had made very strict sets of rules or
| were very open and then were just not _useful_. A curated list of
| a 1000 institutes set up as people generally used them solved
| most issues. Scaling it up another 10x solved most others. Not
| all, not for everyone.
|
| A recording of background noise or silence isn't a song. Unless
| of course it's 4'33".
|
| > They're all arbitrary, unnatural, human inventions
|
| A really great summary.
| marcosdumay wrote:
| > The really key thing with this is to identify what your
| database/Id is solving and what it is not.
|
| On this context, it is very simple and straight-forward. It
| also generalizes almost completely.
|
| Your database id is solving the problem of creating/managing
| the relationships within your data. This generalizes to
| basically everybody because "database id" is a concept created
| within a mathematical theory exactly and exclusively for that
| purpose.
|
| If you try to add more purposes for it, you _will_ get
| conflicting requirements.
| narag wrote:
| _The really key thing with this is to identify what your
| database /Id is solving and what it is not._
|
| That's the money quote. If the indirection of long, composite,
| maybe changing foreign keys could be managed by the system,
| natural keys would be the way to go. But that doesn't happen.
|
| _> They're all arbitrary, unnatural, human inventions_
|
| That part from the article is the weaker point. Semantics.
| "Natural" only means that it comes from the problem domain, it
| doesn't really matter.
| fauigerzigerk wrote:
| Which goes to show that we often conflate identification and
| classification because it's convenient. It's not always the
| best decision longer term but it may be good enough.
|
| For your particular task, you could have chosen to identify
| things of interest and assign them an ID. Separately, you could
| have made (and revised) a decision on whether a particular
| thing is a lab or a research institute and/or some other kind
| of entity.
|
| In my experience, organisations are some of the most difficult
| things to model. The ways in which they can merge, split, be
| part of one another, be the same and not the same in different
| jurisdictions or according to different sets of rules are
| basically infinite.
|
| So when you say "The really key thing with this is to identify
| what your database/Id is solving and what it is not" I totally
| agree. But I would add that identification and classification
| solve overlapping problems that are often deceptively similar
| but not quite the same.
| IanCal wrote:
| > For your particular task, you could have chosen to identify
| things of interest and assign them an ID. Separately, you
| could have made (and revised) a decision on whether a
| particular thing is a lab or a research institute and/or some
| other kind of entity.
|
| I get the point but we couldn't. The very concept of what
| _one_ thing is, is a fuzzy decision in itself. There 's also
| not a single hierarchy that makes sense for different uses so
| you can't just group and then classify - legal ownership,
| financing and operational control can all be distinct
| structures. Then how people actually work can be another.
|
| Even _then_ , fundamentally what we needed was to get away
| from trying to model the world then work out how to use it.
| Because then you have one Id that maps one way for one user
| and another for another user and now their work is
| incompatible - but in a subtle and hidden way.
|
| Edit - this doesn't take away from your point, classification
| is a distinct problem.
| cwillu wrote:
| 4'33" isn't a song either. If you call a tail a leg, a dog
| still has 4 legs.
| drivers99 wrote:
| So in OP's example of tracking every song, if I want/need
| 4'33" in my database of all songs, but I need to keep track
| of whether cwillu think's it's a song for some reason (they
| seem pretty authoritative, better make note of it), I could
| create a foreign key for that and set it to values like
| "cwillu said it's a song", "cwillu said it's NOT a song", and
| "cwillu has not specified if it's a song yet". That's why
| natural keys don't exist.
| dgellow wrote:
| In real life dogs can exist with more or less than 4 legs.
| dpierce9 wrote:
| Yes! The existence of a five-legged dog makes "all dogs
| have four legs" false but it doesn't make "dogs have four
| legs" false. Similarly, the existence of a two-legged dog
| makes "some dogs have only two legs" true while "dogs have
| two legs" is still not (unless you just mean the former).
|
| Dog is a categorical abstraction over individuals. The
| properties of a category tend to have a complex
| relationship to the members of the category. This is true
| even if it is indisputable who the members are and what
| properties the members have.
| IanCal wrote:
| For a whole bunch of useful definitions and use cases, it's a
| song. Arguing about the precise delineations of song/not song
| is missing my point.
|
| > If you call a tail a leg, a dog still has 4 legs.
|
| I can't think of a useful definition of legs that includes
| the tail. I can for calling a recording of a musical piece a
| song.
| [deleted]
| [deleted]
| stefncb wrote:
| Since we're going into this meta-philosophical thing, your
| statement is false. If I call a tail a leg, dogs definitely
| have five legs. Unless they were of course born without a
| leg, or have otherwise lost one. There is no "reality" from a
| human perspective, it's all sensory input that goes through a
| lot of very subjective processing.
|
| You can't _really_ categorize things. You can mostly sort of
| do it, but it never actually works out, because categories
| are human stuff, and human stuff is subjective and thus
| different depending on who you ask. Sometimes unclear even to
| the same person.
|
| I can say dogs have five legs and the only reason most people
| will think I'm wrong is that society has come to categorize
| limbs based on function instead of length.
| tough wrote:
| Getting a little pedantic 2. a large branch of a tree. "the
| bare limbs of a high tree" Similar: branch
|
| limb (n.2) late 14c., "edge of a quadrant or other
| instrument," from Latin limbus "ornamental border, hem,
| fringe, edge," a word of uncertain origin.
|
| Just thought of the limbic system and how that might be
| related, doesn't seem to be about function but about
| positioning against a core system
| stefncb wrote:
| My point was that categorizing limbs (differentiating
| them from one another) is done by function. Now that I
| think about it limb isn't what I had in mind anyway, what
| I meant to say was that we categorize parts of the body
| in terms of function.
|
| But you're right, limbs are not defined by function.
| wolfgang42 wrote:
| a "natural key" is frequently just a _really_ foreign key in a
| database you and your org don 't manage. -- 'wwweston,
| https://news.ycombinator.com/item?id=27349246
| atoav wrote:
| As someone who has been thinking about a system to organize my
| improvisation band's rehearsal recordings I have made many of the
| same realizations.
|
| Only with us some of the points are even wilder, because for us
| "a song" is the abstract concept or idea for an improvisation
| rather than a fixed set of notes played in a fixed way in a fixed
| order. Sometimes that idea can be a word, sometimes it can be
| very precise sound settings, high-level structural ideas, a set
| of lyrics, rhythmic patterns, melodic themes etc.
|
| So any given recording can contain within itself any number of
| songs (including zero). Those songs can exist or not. And my
| conclusion for our case was that there is no automated way of
| deciding which is which, as at times it could be hard even for us
| ourselves.
| chefandy wrote:
| I hate to throw OT shade, but as a commercial artist, I'm gonna-
| the hero photo in this article shows the importance of my
| profession. Creating images is a lot easier than deciding how
| best to visually communicate an idea and conceptually manipulate
| the most relevant components. I probably could have made a vector
| drawing in 10 minutes, or a photorealistic image in photoshop in
| about 15 that would have done the job a million times better, and
| considering how much cheaper artists and designers are than
| developers, it probably would have been about the same labor cost
| if money was involved. I get it... It's a blog post... But it's
| seriously like the clipart revolution all over again, just
| without the guard rails provided by pre-made imagery.
| ak39 wrote:
| Always use surrogate primary keys. Use alternate keys (AK) for
| your "natural keys". Modify the AK business rules as and when
| needed. There are several advantages to using meaningless random
| integers, sequential integers or UUIDs as your table's PK. The
| most important one is that if the PK participates (or is expected
| to participate) as a foreign key in other tables, it's crucial to
| us a design where you never ever have to change the PK!
| bruce511 wrote:
| All this, but I'd steer away from sequential numbers. They
| suffer from security problems, data-merging problems, data-
| distribution and replication problems, and more.
|
| Over a long career I've come to favor UUID, the cost (space and
| performance) are well worth the advantages.
| scrollaway wrote:
| UUIDv7 has all the advantages of UUID, plus sortability, plus
| an embedded timestamp which can replace your "created_at"
| thus reducing the cost aspect. It's also soon becoming an
| official part of the UUID spec.
|
| In fact, if your created_at is indexed and you can get rid of
| it via uuid7, there is practically no difference in terms of
| table size. The size cost is solely on foreign keys.
| evntdrvn wrote:
| I really hope that I can find someone on the MS SQL Server
| team to champion implementing native UUIDV7 support, from
| what I've heard they are really hesitant to add more data
| types but a person's gotta try :)
| karol wrote:
| I think in the future the idea of labelling every passing moment
| and every sounds that anything ever makes will lose prominence.
| Wondering what will replace them. I think we will either drop
| this need altogether or the "natural key" will come from "the
| state of reality".
| SnowHill9902 wrote:
| This is not true. Given two songs A, B, an informed human can
| tell whether A and B are the same song or not. Now you've created
| a UNIQUE constraint on your data, so you have a unique
| identifier.
| quesera wrote:
| This is not true. The article gives the great examples of a
| song through its lifecycle: covered, remixed, remastered,
| compiled on a greatest hits record, included on a soundtrack,
| etc. Same song, sometimes. Same recording, sometimes. Different
| database entries, always.
|
| To the extent that "an informed human" and "the same song" have
| meaningful definitions of "informed" and "same", you've just
| recreated the problem.
___________________________________________________________________
(page generated 2023-05-28 23:02 UTC)