[HN Gopher] Dublin Core's dirty little secret (2010)
       ___________________________________________________________________
        
       Dublin Core's dirty little secret (2010)
        
       Author : dcminter
       Score  : 97 points
       Date   : 2021-09-30 19:11 UTC (2 days ago)
        
 (HTM) web link (reprog.wordpress.com)
 (TXT) w3m dump (reprog.wordpress.com)
        
       | PaulHoule wrote:
       | Dublin Core was always trying too hard to not be hard to use. It
       | always struck me as a metadata standard for the library at an
       | elementary school.
       | 
       | I was amused at how Adobe's XMP format amended DC used features
       | of RDF that most people were scared to use (ordered collections)
       | to solve the simple problem of keeping track of the order of
       | author's names on a paper.
       | 
       | The big trouble I see with semweb standards is that they stop
       | long before getting to the goal. RDFS and OWL for instance solve
       | some problems of mashing up data from different vocabs but can't
       | do the math to convert inches to meters which is absolutely
       | essential.
       | 
       | I was not so amused to see Adobe thoroughly nerfed XMP and it no
       | longer sucks in EXIF metadata from images. I was hoping I could
       | find the XMP packet and put it through an RDF/XML parsers and get
       | all the metadata from my images but even in Lightroom images the
       | XMP packet is a nothingburger these days. I guess people
       | complained about the GIF files that had 2k of pixel data an 80k
       | of XMP and that Adobe was co-opting other standards. Heck they
       | even deleted most of the content from the standard including how
       | to express EXIF in RDF so you can't stick XMP in your database
       | and SPARQL it.
        
         | blipmusic wrote:
         | Depends on what device generates the image perhaps? DJI Drones
         | have a lot of info as EXIF XMP. I just parse that as XML.
         | 
         | EDIT: It's not 80k, but good enough as a GPS log.
        
           | PaulHoule wrote:
           | The gif in the story was generated by adobe tools in the bad
           | old days, circa 2005. I noticed the nerfing happened in the
           | last 5 years when I looked closely at what Lightroom does.
        
             | blipmusic wrote:
             | In terms of wasted space, only 1/8th is actual xmp for that
             | DJI exif chunk, the rest is padding for (13mb jpg:s in our
             | case). 8k, so 1k actual xmp data of which a lot is
             | structural. I've only tried with scripts, verifying with
             | exiftool and a hex editor every now and then.
        
       | sgt101 wrote:
       | Was involved with a project once upon a time that spent six
       | months developing ontologies for event ticketing, not me thank
       | god, but another team of about eight.
       | 
       | The delivered ontology got used twice or maybe three times, and
       | it didn't cover all the corner cases _we knew about_.
        
         | smitty1e wrote:
         | Heard a bloke introduce himself at the onboarding as a "Sematic
         | Ontologist".
         | 
         | His sobriety was only exceeded by his earnestness.
         | 
         | I prayed he'd not become attached to my project, as he looked
         | like the sort who could love a problem for weeks without ever
         | reaching a conclusion.
        
           | Vinnl wrote:
           | Oh wow, having interacted with the SemWeb community quite a
           | lot in the past couple of years,
           | 
           | > the sort who could love a problem for weeks without ever
           | reaching a conclusion.
           | 
           | is such an accurate description of a number of people part of
           | it. Possibly the root cause of all the problems that led to
           | it never taking off at the scale they'd envisioned.
        
             | tasogare wrote:
             | The semantic web is a non-solution in eternal search of
             | non-problems.
        
           | jdixnneij wrote:
           | Sounds like the eternal student, they just don't work in
           | business, although seemingly I keep getting lumped with the
           | useless persist question askers, sometimes I feel like I
           | discussing shit with David Attenborough, just get on with job
           | _ends rant_
        
             | jdixnneij wrote:
             | Truth hurts. Good luck on your third phd
        
       | tuukkah wrote:
       | These days, you can use Wikidata and the related tooling such as
       | Scholia. Here's a starting point to their ontology, including the
       | difficult details such as the "page(s)" qualifier for the
       | property "published in":
       | https://www.wikidata.org/wiki/Wikidata:WikiProject_Source_Me...
       | 
       | Scholia: https://www.wikidata.org/wiki/Wikidata:Scholia
        
       | wheybags wrote:
       | Was 90% sure this was going to be an article about homelessness
       | in Dublin city centre (the real Dublin, not some imitator in Ohio
       | :p)
        
       | thinkingemote wrote:
       | I think one of the issues is because librarians work with things
       | that have been made and do not change. The catalogues also
       | traditionally did not change, but there's no reason now why this
       | should be the case.
       | 
       | If we look at wikidata or openstreetmap for example the way
       | things are categorised changes and evolve, some things stay the
       | same, new things get added, some things get deprecated, and bots
       | or humans can update existing records and machines can handle any
       | deprecations
       | 
       | Far better then to start simple, get adoption and update records
       | as you go, I think.
        
         | dredmorbius wrote:
         | One of the particularly notable things about libraries and
         | catalogues is how much _does_ change.
         | 
         | There are on the order of 1--5 million new books published in
         | English each year (about 300k as "traditional" publications,
         | fairly constant level since the 1950s), the largest collections
         | have tens of millions of volumes, the cataloguing technology
         | and standards have changed tremendously over the past 50 years,
         | and the role and usage of libraries themselves is under
         | constant flux. At the same time, budget and staffs are small
         | relative to the total record store.
         | 
         | The challenge of organising, classifying, and _making
         | accessible_ all human knowledge has occupied numerous great
         | minds for centuries. We 're familiar with Google's take,
         | marrying full-text search with a document-ranking model. That
         | worked relatively well for about 20 years. It's now showing
         | extreme signs of stress.
         | 
         | At the same time, much of the past (and present) traditionally-
         | published media is finding its way online, whether the original
         | publishers and established libraries like it or not.
         | 
         | LibGen and SciHub may have far more to say about the future of
         | cataloguing standards than the OCLC.
        
           | cratermoon wrote:
           | Oh and how those categories change and are subject to human
           | error and bias.
           | 
           | Library of Congress Class D: WORLD HISTORY AND HISTORY OF
           | EUROPE, ASIA, AFRICA, AUSTRALIA, NEW ZEALAND, ETC.
           | 
           | Subclass DB: History of Austria. Liechtenstein. Hungary.
           | Czechoslovakia
           | 
           | https://www.loc.gov/aba/cataloging/classification/lcco/lcco_.
           | ..
        
             | dredmorbius wrote:
             | Right.
             | 
             | If you dig into the LoC Classification what you'll find are
             | sections, some large, some small, that have been entirely
             | retired or supersceded.
             | 
             | Geography is only one of those. Psychology has been huge,
             | notably as regards human sexuality and gender. Technology
             | and business, in which change and obsolescence are quite
             | rapid, another. (Both have significantly co-evolved with
             | the development of cataloguing systems and library science
             | itself.) Emerging domains, particularly computer science
             | and information systems, have struggled to find where they
             | fit within the classification. And different domains are
             | both idiosyncratic and specific to the Library's interests.
             | For the LoCCS, US history and geography are overrepresented
             | relative to the rest of the world. And the entire Law
             | classification, much of it relatively reasonably developed,
             | has a markedly differnt style and character than the rest
             | of the classification.
        
               | cratermoon wrote:
               | > US history and geography are overrepresented relative
               | to the rest of the world.
               | 
               | Yeah in the document I linked, all of Africa gets lumped
               | into the single subclass DT. But Subclass DX is
               | "Romanies" It was formerly "History of Gypsies", btw http
               | s://en.wikipedia.org/wiki/Library_of_Congress_Classifica.
               | ..
        
               | dredmorbius wrote:
               | On "gypsies", one of numerous nomenclature changes.
        
               | dredmorbius wrote:
               | ... which is to say: what _mature_ cataloguing practices
               | have developed _is a change management process_. A
               | recognition that boundaries shift (metaphorical,
               | physical, and political), knowledge evolves, and
               | terminology may come to be outdated or even prejudicial.
               | 
               | LoCCS isn't perfect. But it has developed the self-
               | awareness, capabilities, institutions, and processes to
               | change and adapt without destroying itself in the
               | process.
               | 
               | ty TY {} []
        
               | cratermoon wrote:
               | But why do the Romani get an _entire subclass_ , when
               | Africa - ALL of Africa, is also just one subclass?
        
       | amichal wrote:
       | Dublin core has always been rough and strange and I'm SURE actual
       | librarians can tell me all sorts of stories about complex things
       | but schema.org has a fine model at first glance
       | https://schema.org/ScholarlyArticle. Check out the MLA example at
       | the end and how DOI and articles over multiple issues are handled
       | etc.
        
       | blipmusic wrote:
       | There are also modular approaches like CMDI [0] out there, that
       | allows for constructing a metadata profile with a corresponding
       | schema [1]. A single metadata standard won't be able to cover all
       | disciplines regardless. Too many free text fields that can't be
       | properly validated. E.g. if you work with documenting minority
       | languages, language ISO codes are vital (assuming one has been
       | created for the language in question).                   [0]:
       | https://www.clarin.eu/content/component-metadata         [1]:
       | https://www.clarin.eu/content/component-registry-documentation
        
       | cratermoon wrote:
       | Apropos: Ontology is Overrated -- Categories, Links, and Tags
       | http://digitalcuration.umaine.edu/resources/shirky_ontology_...
        
       | mikewarot wrote:
       | Back when I was an IT Guy at a small marketing company, I did
       | whatever was asked of me (I'm getting paid, so why not).
       | 
       | I got pretty good at data entry, and it's fairly enjoyable.
       | You're not worried about bigger issues than reading scribbles on
       | paper, and trying to get them entered correctly.
       | 
       | It is from this experience, that I think they should have done
       | the following:
       | 
       | Sit everyone on the committee down down at terminals, and make
       | them do card catalog data entry for a week, just typing into a
       | unformatted text file.
       | 
       | After those dues are paid, then you get to participate for the
       | year. I think contact with the data at hand would have rapidly
       | lead to a far better understanding of the required structures.
        
         | cratermoon wrote:
         | Librarians spend years studying just cataloging to be able to
         | take a publication a library acquires, put a Dewey or LoC
         | number on it, and shelve it, and do it all in a way that makes
         | makes it possible to find it again later when someone wants it.
         | Cataloging is an entire sub-specialty of Library and
         | Information Science. A week of data entry would just scratch
         | the surface.
        
           | mikewarot wrote:
           | I'm not suggesting it would give an encyclopedic knowledge,
           | but it would at least acquaint them with the actual data and
           | categorization systems in use in a very visceral way.
           | 
           | In a related way, I find myself watching many of the old IBM
           | videos about mainframes and the punch card systems that
           | proceeded them to have a better feel for what came before.
        
       | jiggawatts wrote:
       | Dublin Core reminds me of SAML. Designed as a bag of standards
       | parts, not as a complete standard.
        
       | mark_l_watson wrote:
       | I am not sure if I agree with this article. The semantic web and
       | linked data are built using layers of schemas and Ontology's and
       | changing old stuff is a bad idea because links get broken.
       | Additive systems, instead of creating new versions, will always
       | seem a bit messy.
       | 
       | If you look at valuable public KGs like DBPedia and WikiData you
       | will see these layers of old and older schemas. My argument that
       | this is not such a bad thing does have a powerful refutation: the
       | slow uptake of linked data tech and KGs.
        
       | xwkd wrote:
       | I worked for OCLC for about five years, so I have a lot of
       | sympathy for the pains that archivists go through when trying to
       | standardize content metadata.
       | 
       | Can't be _that_ hard of a problem to solve, right? I mean, we 're
       | only talking about coming up with a good solution for cataloging
       | the sum total of all human knowledge.
       | 
       | (Not a small undertaking.)
       | 
       | Libraries are caught between a rock and hard place. On the one
       | hand, they have to be stewards of this gigantic and ever growing
       | mound of paper, and on the other hand, they have to deal with the
       | lofty ideas of the W3C and Tim Berners-Lee, trying to connect all
       | of that paper to his early 2000s vision of the ultimate knowledge
       | graph / the semantic web. No wonder we're sticking to MARC 21.
       | 
       | The only people using that web for research are universities. The
       | rest of us are using WHATWG's spawn-of-satan hacked together
       | platforms because the technology always moves faster than
       | catalogued content. Hell, at this point, GPT-3 is probably a
       | better approach to knowledge processing than trying to piece
       | together something actionable from a half baked information graph
       | born of old programmers' utopian fever dreams.
        
         | smitty1e wrote:
         | Sayre's Law: "In any dispute the intensity of feeling is
         | inversely proportional to the value of the issues at stake."
         | 
         | https://en.m.wikipedia.org/wiki/Sayre%27s_law
         | 
         | Was there ever a less essential fiefdom than citation formats?
        
           | dsr_ wrote:
           | Have you noticed that search engines suck?
           | 
           | Citation formats are an attempt to end up with a search
           | engine that doesn't suck. (They can do lots of other things
           | along the way.)
           | 
           | Options:
           | 
           | 1. Accept any format and try to turn it into your own
           | internal representation. Problems: (a) your own
           | representation is a citation format; (b) you need to write an
           | infinite number of converters and hire an infinite number of
           | trained people to work out the edge cases.
           | 
           | 2. Accept a limited number of common formats. Problem: your
           | search engine will not be useful for a majority of the
           | corpus.
           | 
           | 3. Convince everyone that your new citation format is
           | unstoppable. Problems: (a) convincing everyone; (b) actually
           | having that citation format cover, say, 99% of the cases; (c)
           | XKCD#927 (Standards).
           | 
           | Dublin Core is/was a terrible attempt at a type 3 solution.
        
           | karaterobot wrote:
           | Some people catalog information every day, so it matters to
           | them for practical reasons. Same with people who rely on
           | those resources being accurately cataloged.
        
             | smitty1e wrote:
             | I'm ok with obsessing about the data.
             | 
             | Presentation, an order of magnitude less.
             | 
             | Substance >>> style.
        
               | karaterobot wrote:
               | Cataloging format is to style as database modeling is
               | to... well, style. It's got nothing to do with
               | aesthetics, it's about describing the data in a way that
               | makes it useful later.
        
               | brazzy wrote:
               | You're missing the point. Citation formats are not a
               | matter of style here. It's about making research results
               | easier to find, which directly affects the quality of new
               | research.
        
         | lyaa wrote:
         | The problem with models like GPT-3 is that they are unable to
         | differentiate between information sources with different
         | "trustworthiness." They learn conspiracies and wrong claims and
         | repeat them.
         | 
         | It's possible to feed GPT-3 prompts that encourage it to
         | respond with conspiracies (i.e. "who really caused 9/11?") but
         | it also randomly responds to normal prompts with
         | conspiracies/misinformation.
         | 
         | A recent paper[0] has looked into building a testing dataset
         | for language models ability to distinguish truth from
         | falsehood.
         | 
         | [0] https://arxiv.org/abs/2109.07958
        
           | pjc50 wrote:
           | > A recent paper[0] has looked into building a testing
           | dataset for language models ability to distinguish truth from
           | falsehood.
           | 
           | Isn't this a massive category error? Truth or falsehood does
           | not reside _within_ any symbol stream but in the interaction
           | of that stream with observable reality. Does nobody in the AI
           | world know Baudrillard?
        
             | rendall wrote:
             | It should be at least theoretically possible for an AI to
             | identify contradictions, incoherence and inconsistency in a
             | set of assertions. So, not identifying falsehood _per se_ ,
             | but assigning a fairly accurate likelihood score based
             | solely on the internal logic of the symbol stream. In other
             | words, a bullshit detector.
        
               | nimish wrote:
               | > It should be at least theoretically possible for an AI
               | to identify contradictions, incoherence and inconsistency
               | in a set of assertions
               | 
               | Not in the slightest. Likelihood of veracity is opinion
               | -- laundering it as fact to make some people feel better
               | doesn't make it any more subjective, or authoritative.
        
               | rendall wrote:
               | I think we're not disagreeing, exactly.
               | 
               | As a simple example, here is a set of assertions:
               | 
               | * The moon is made of green cheese
               | 
               | * The moon is crystalline rock surrounding an iron core
               | 
               | It wouldn't take an AI to see that both of these can't be
               | true, even if we weren't clear about what a moon is made
               | of, exactly.
               | 
               | Some of our common understanding could contain more
               | complicated internal contradictions that might be harder
               | for a human to tease apart, that an AI might be able to
               | identify.
        
               | akiselev wrote:
               | How would the AI know that "made of green cheese" isn't
               | just another way of saying "crystalline rock surrounding
               | an iron core"? To find a contradiction in a statement
               | like "X is A. X is B" it'd first have to be intelligent
               | enough to know when A != B. In your example, that's not
               | as simple as 1 != 0.
        
             | lyaa wrote:
             | Well, yeah, these models can not interact and observe the
             | world to test the veracity of claims. They are language
             | models and the target for them is text production. No one
             | expects them to understand the universe.
             | 
             | My comment was in response to > GPT-3 is probably a better
             | approach to knowledge processing
             | 
             | and the paper is relevant in that it shows the limitation
             | of current language models in terms of logical consistency
             | or measures of the quality of text sources. GPT-3 and other
             | models are not trained for this and obviously they fail at
             | the task. This is evidence against them being a "better
             | approach to knowledge processing."
             | 
             | Even if we trained future models preferentially on the
             | latest and most cited scientific papers, we will still have
             | issues with conflicting claims and incorrect/fabricated
             | results.
             | 
             | However, that does not mean that it would not be
             | practically useful to figure out a way to include some
             | checks or confidence estimates of truthfulness of model
             | training data and responses. Perhaps just training the
             | models to answer that they don't know when the training
             | data is too conflicted would be useful enough.
        
           | zeckalpha wrote:
           | Isn't there the same problem with structured data?
        
           | Zababa wrote:
           | > The problem with models like GPT-3 is that they are unable
           | to differentiate between information sources with different
           | "trustworthiness." They learn conspiracies and wrong claims
           | and repeat them.
           | 
           | Impressive, AI has already reached human levels of
           | understanding.
        
             | lyaa wrote:
             | ...it saddens me how true that is given the last few years.
             | I still like to think most people are capable of a deeper
             | level of understanding.
        
               | Zababa wrote:
               | I don't agree with "given the last few years". Conspiracy
               | theories have always been a thing, now information just
               | flows faster.
        
         | ghukill wrote:
         | >> "Hell, at this point, GPT-3 is probably a better approach to
         | knowledge processing than trying to piece together something
         | actionable from a half baked information graph born of old
         | programmers' utopian fever dreams."
         | 
         | Greatest thing I've read on HN. As a librarian and developer,
         | can confirm. At least in most cases...(slipping back into fever
         | dream)....
        
         | xg15 wrote:
         | > _Hell, at this point, GPT-3 is probably a better approach to
         | knowledge processing than trying to piece together something
         | actionable from a half baked information graph born of old
         | programmers ' utopian fever dreams._
         | 
         | I mean, at this point, wouldn't it be a lot simpler to go back
         | to the middle ages (or earlier) and have a few humans memorize
         | all that stuff? It's not as if GPT3 would give you any more
         | insight than that approach...
        
         | sswaner wrote:
         | "old programmers' utopian fever dreams" - accurate description
         | of me when I first found Dublin Core. Never made it to
         | production. It was unnecessary overhead for describing internal
         | content.
        
         | wvh wrote:
         | Having worked with DC, Marc21 and some of its precursors, I
         | feel that one major problem with the approach is that
         | cataloguers try to infinitely cut up metadata into ever smaller
         | pieces until you end up with a impossible large collection of
         | fields with an overly high level of specificity and low level
         | of applicability.
         | 
         | For example, the middle name of a married name of a person that
         | contributed to one part of one edition of the work in a certain
         | capacity etc.
         | 
         | You end up with so much unwieldily specific metadata that
         | consistent cataloguing and detailed searching become nigh
         | impossible.
         | 
         | Ever since search engines, the world has moved into using
         | mostly flat search engines rather than a highly specific facet
         | search for deep fields. Of course, one would want something a
         | bit smarter than full text search, but the real, human world is
         | so complex that trying to categorise any data point quickly
         | becomes an exercise in futility.
        
       ___________________________________________________________________
       (page generated 2021-10-02 23:02 UTC)