[HN Gopher] Dublin Core's dirty little secret (2010)
___________________________________________________________________
Dublin Core's dirty little secret (2010)
Author : dcminter
Score : 97 points
Date : 2021-09-30 19:11 UTC (2 days ago)
(HTM) web link (reprog.wordpress.com)
(TXT) w3m dump (reprog.wordpress.com)
| PaulHoule wrote:
| Dublin Core was always trying too hard to not be hard to use. It
| always struck me as a metadata standard for the library at an
| elementary school.
|
| I was amused at how Adobe's XMP format amended DC used features
| of RDF that most people were scared to use (ordered collections)
| to solve the simple problem of keeping track of the order of
| author's names on a paper.
|
| The big trouble I see with semweb standards is that they stop
| long before getting to the goal. RDFS and OWL for instance solve
| some problems of mashing up data from different vocabs but can't
| do the math to convert inches to meters which is absolutely
| essential.
|
| I was not so amused to see Adobe thoroughly nerfed XMP and it no
| longer sucks in EXIF metadata from images. I was hoping I could
| find the XMP packet and put it through an RDF/XML parsers and get
| all the metadata from my images but even in Lightroom images the
| XMP packet is a nothingburger these days. I guess people
| complained about the GIF files that had 2k of pixel data an 80k
| of XMP and that Adobe was co-opting other standards. Heck they
| even deleted most of the content from the standard including how
| to express EXIF in RDF so you can't stick XMP in your database
| and SPARQL it.
| blipmusic wrote:
| Depends on what device generates the image perhaps? DJI Drones
| have a lot of info as EXIF XMP. I just parse that as XML.
|
| EDIT: It's not 80k, but good enough as a GPS log.
| PaulHoule wrote:
| The gif in the story was generated by adobe tools in the bad
| old days, circa 2005. I noticed the nerfing happened in the
| last 5 years when I looked closely at what Lightroom does.
| blipmusic wrote:
| In terms of wasted space, only 1/8th is actual xmp for that
| DJI exif chunk, the rest is padding for (13mb jpg:s in our
| case). 8k, so 1k actual xmp data of which a lot is
| structural. I've only tried with scripts, verifying with
| exiftool and a hex editor every now and then.
| sgt101 wrote:
| Was involved with a project once upon a time that spent six
| months developing ontologies for event ticketing, not me thank
| god, but another team of about eight.
|
| The delivered ontology got used twice or maybe three times, and
| it didn't cover all the corner cases _we knew about_.
| smitty1e wrote:
| Heard a bloke introduce himself at the onboarding as a "Sematic
| Ontologist".
|
| His sobriety was only exceeded by his earnestness.
|
| I prayed he'd not become attached to my project, as he looked
| like the sort who could love a problem for weeks without ever
| reaching a conclusion.
| Vinnl wrote:
| Oh wow, having interacted with the SemWeb community quite a
| lot in the past couple of years,
|
| > the sort who could love a problem for weeks without ever
| reaching a conclusion.
|
| is such an accurate description of a number of people part of
| it. Possibly the root cause of all the problems that led to
| it never taking off at the scale they'd envisioned.
| tasogare wrote:
| The semantic web is a non-solution in eternal search of
| non-problems.
| jdixnneij wrote:
| Sounds like the eternal student, they just don't work in
| business, although seemingly I keep getting lumped with the
| useless persist question askers, sometimes I feel like I
| discussing shit with David Attenborough, just get on with job
| _ends rant_
| jdixnneij wrote:
| Truth hurts. Good luck on your third phd
| tuukkah wrote:
| These days, you can use Wikidata and the related tooling such as
| Scholia. Here's a starting point to their ontology, including the
| difficult details such as the "page(s)" qualifier for the
| property "published in":
| https://www.wikidata.org/wiki/Wikidata:WikiProject_Source_Me...
|
| Scholia: https://www.wikidata.org/wiki/Wikidata:Scholia
| wheybags wrote:
| Was 90% sure this was going to be an article about homelessness
| in Dublin city centre (the real Dublin, not some imitator in Ohio
| :p)
| thinkingemote wrote:
| I think one of the issues is because librarians work with things
| that have been made and do not change. The catalogues also
| traditionally did not change, but there's no reason now why this
| should be the case.
|
| If we look at wikidata or openstreetmap for example the way
| things are categorised changes and evolve, some things stay the
| same, new things get added, some things get deprecated, and bots
| or humans can update existing records and machines can handle any
| deprecations
|
| Far better then to start simple, get adoption and update records
| as you go, I think.
| dredmorbius wrote:
| One of the particularly notable things about libraries and
| catalogues is how much _does_ change.
|
| There are on the order of 1--5 million new books published in
| English each year (about 300k as "traditional" publications,
| fairly constant level since the 1950s), the largest collections
| have tens of millions of volumes, the cataloguing technology
| and standards have changed tremendously over the past 50 years,
| and the role and usage of libraries themselves is under
| constant flux. At the same time, budget and staffs are small
| relative to the total record store.
|
| The challenge of organising, classifying, and _making
| accessible_ all human knowledge has occupied numerous great
| minds for centuries. We 're familiar with Google's take,
| marrying full-text search with a document-ranking model. That
| worked relatively well for about 20 years. It's now showing
| extreme signs of stress.
|
| At the same time, much of the past (and present) traditionally-
| published media is finding its way online, whether the original
| publishers and established libraries like it or not.
|
| LibGen and SciHub may have far more to say about the future of
| cataloguing standards than the OCLC.
| cratermoon wrote:
| Oh and how those categories change and are subject to human
| error and bias.
|
| Library of Congress Class D: WORLD HISTORY AND HISTORY OF
| EUROPE, ASIA, AFRICA, AUSTRALIA, NEW ZEALAND, ETC.
|
| Subclass DB: History of Austria. Liechtenstein. Hungary.
| Czechoslovakia
|
| https://www.loc.gov/aba/cataloging/classification/lcco/lcco_.
| ..
| dredmorbius wrote:
| Right.
|
| If you dig into the LoC Classification what you'll find are
| sections, some large, some small, that have been entirely
| retired or supersceded.
|
| Geography is only one of those. Psychology has been huge,
| notably as regards human sexuality and gender. Technology
| and business, in which change and obsolescence are quite
| rapid, another. (Both have significantly co-evolved with
| the development of cataloguing systems and library science
| itself.) Emerging domains, particularly computer science
| and information systems, have struggled to find where they
| fit within the classification. And different domains are
| both idiosyncratic and specific to the Library's interests.
| For the LoCCS, US history and geography are overrepresented
| relative to the rest of the world. And the entire Law
| classification, much of it relatively reasonably developed,
| has a markedly differnt style and character than the rest
| of the classification.
| cratermoon wrote:
| > US history and geography are overrepresented relative
| to the rest of the world.
|
| Yeah in the document I linked, all of Africa gets lumped
| into the single subclass DT. But Subclass DX is
| "Romanies" It was formerly "History of Gypsies", btw http
| s://en.wikipedia.org/wiki/Library_of_Congress_Classifica.
| ..
| dredmorbius wrote:
| On "gypsies", one of numerous nomenclature changes.
| dredmorbius wrote:
| ... which is to say: what _mature_ cataloguing practices
| have developed _is a change management process_. A
| recognition that boundaries shift (metaphorical,
| physical, and political), knowledge evolves, and
| terminology may come to be outdated or even prejudicial.
|
| LoCCS isn't perfect. But it has developed the self-
| awareness, capabilities, institutions, and processes to
| change and adapt without destroying itself in the
| process.
|
| ty TY {} []
| cratermoon wrote:
| But why do the Romani get an _entire subclass_ , when
| Africa - ALL of Africa, is also just one subclass?
| amichal wrote:
| Dublin core has always been rough and strange and I'm SURE actual
| librarians can tell me all sorts of stories about complex things
| but schema.org has a fine model at first glance
| https://schema.org/ScholarlyArticle. Check out the MLA example at
| the end and how DOI and articles over multiple issues are handled
| etc.
| blipmusic wrote:
| There are also modular approaches like CMDI [0] out there, that
| allows for constructing a metadata profile with a corresponding
| schema [1]. A single metadata standard won't be able to cover all
| disciplines regardless. Too many free text fields that can't be
| properly validated. E.g. if you work with documenting minority
| languages, language ISO codes are vital (assuming one has been
| created for the language in question). [0]:
| https://www.clarin.eu/content/component-metadata [1]:
| https://www.clarin.eu/content/component-registry-documentation
| cratermoon wrote:
| Apropos: Ontology is Overrated -- Categories, Links, and Tags
| http://digitalcuration.umaine.edu/resources/shirky_ontology_...
| mikewarot wrote:
| Back when I was an IT Guy at a small marketing company, I did
| whatever was asked of me (I'm getting paid, so why not).
|
| I got pretty good at data entry, and it's fairly enjoyable.
| You're not worried about bigger issues than reading scribbles on
| paper, and trying to get them entered correctly.
|
| It is from this experience, that I think they should have done
| the following:
|
| Sit everyone on the committee down down at terminals, and make
| them do card catalog data entry for a week, just typing into a
| unformatted text file.
|
| After those dues are paid, then you get to participate for the
| year. I think contact with the data at hand would have rapidly
| lead to a far better understanding of the required structures.
| cratermoon wrote:
| Librarians spend years studying just cataloging to be able to
| take a publication a library acquires, put a Dewey or LoC
| number on it, and shelve it, and do it all in a way that makes
| makes it possible to find it again later when someone wants it.
| Cataloging is an entire sub-specialty of Library and
| Information Science. A week of data entry would just scratch
| the surface.
| mikewarot wrote:
| I'm not suggesting it would give an encyclopedic knowledge,
| but it would at least acquaint them with the actual data and
| categorization systems in use in a very visceral way.
|
| In a related way, I find myself watching many of the old IBM
| videos about mainframes and the punch card systems that
| proceeded them to have a better feel for what came before.
| jiggawatts wrote:
| Dublin Core reminds me of SAML. Designed as a bag of standards
| parts, not as a complete standard.
| mark_l_watson wrote:
| I am not sure if I agree with this article. The semantic web and
| linked data are built using layers of schemas and Ontology's and
| changing old stuff is a bad idea because links get broken.
| Additive systems, instead of creating new versions, will always
| seem a bit messy.
|
| If you look at valuable public KGs like DBPedia and WikiData you
| will see these layers of old and older schemas. My argument that
| this is not such a bad thing does have a powerful refutation: the
| slow uptake of linked data tech and KGs.
| xwkd wrote:
| I worked for OCLC for about five years, so I have a lot of
| sympathy for the pains that archivists go through when trying to
| standardize content metadata.
|
| Can't be _that_ hard of a problem to solve, right? I mean, we 're
| only talking about coming up with a good solution for cataloging
| the sum total of all human knowledge.
|
| (Not a small undertaking.)
|
| Libraries are caught between a rock and hard place. On the one
| hand, they have to be stewards of this gigantic and ever growing
| mound of paper, and on the other hand, they have to deal with the
| lofty ideas of the W3C and Tim Berners-Lee, trying to connect all
| of that paper to his early 2000s vision of the ultimate knowledge
| graph / the semantic web. No wonder we're sticking to MARC 21.
|
| The only people using that web for research are universities. The
| rest of us are using WHATWG's spawn-of-satan hacked together
| platforms because the technology always moves faster than
| catalogued content. Hell, at this point, GPT-3 is probably a
| better approach to knowledge processing than trying to piece
| together something actionable from a half baked information graph
| born of old programmers' utopian fever dreams.
| smitty1e wrote:
| Sayre's Law: "In any dispute the intensity of feeling is
| inversely proportional to the value of the issues at stake."
|
| https://en.m.wikipedia.org/wiki/Sayre%27s_law
|
| Was there ever a less essential fiefdom than citation formats?
| dsr_ wrote:
| Have you noticed that search engines suck?
|
| Citation formats are an attempt to end up with a search
| engine that doesn't suck. (They can do lots of other things
| along the way.)
|
| Options:
|
| 1. Accept any format and try to turn it into your own
| internal representation. Problems: (a) your own
| representation is a citation format; (b) you need to write an
| infinite number of converters and hire an infinite number of
| trained people to work out the edge cases.
|
| 2. Accept a limited number of common formats. Problem: your
| search engine will not be useful for a majority of the
| corpus.
|
| 3. Convince everyone that your new citation format is
| unstoppable. Problems: (a) convincing everyone; (b) actually
| having that citation format cover, say, 99% of the cases; (c)
| XKCD#927 (Standards).
|
| Dublin Core is/was a terrible attempt at a type 3 solution.
| karaterobot wrote:
| Some people catalog information every day, so it matters to
| them for practical reasons. Same with people who rely on
| those resources being accurately cataloged.
| smitty1e wrote:
| I'm ok with obsessing about the data.
|
| Presentation, an order of magnitude less.
|
| Substance >>> style.
| karaterobot wrote:
| Cataloging format is to style as database modeling is
| to... well, style. It's got nothing to do with
| aesthetics, it's about describing the data in a way that
| makes it useful later.
| brazzy wrote:
| You're missing the point. Citation formats are not a
| matter of style here. It's about making research results
| easier to find, which directly affects the quality of new
| research.
| lyaa wrote:
| The problem with models like GPT-3 is that they are unable to
| differentiate between information sources with different
| "trustworthiness." They learn conspiracies and wrong claims and
| repeat them.
|
| It's possible to feed GPT-3 prompts that encourage it to
| respond with conspiracies (i.e. "who really caused 9/11?") but
| it also randomly responds to normal prompts with
| conspiracies/misinformation.
|
| A recent paper[0] has looked into building a testing dataset
| for language models ability to distinguish truth from
| falsehood.
|
| [0] https://arxiv.org/abs/2109.07958
| pjc50 wrote:
| > A recent paper[0] has looked into building a testing
| dataset for language models ability to distinguish truth from
| falsehood.
|
| Isn't this a massive category error? Truth or falsehood does
| not reside _within_ any symbol stream but in the interaction
| of that stream with observable reality. Does nobody in the AI
| world know Baudrillard?
| rendall wrote:
| It should be at least theoretically possible for an AI to
| identify contradictions, incoherence and inconsistency in a
| set of assertions. So, not identifying falsehood _per se_ ,
| but assigning a fairly accurate likelihood score based
| solely on the internal logic of the symbol stream. In other
| words, a bullshit detector.
| nimish wrote:
| > It should be at least theoretically possible for an AI
| to identify contradictions, incoherence and inconsistency
| in a set of assertions
|
| Not in the slightest. Likelihood of veracity is opinion
| -- laundering it as fact to make some people feel better
| doesn't make it any more subjective, or authoritative.
| rendall wrote:
| I think we're not disagreeing, exactly.
|
| As a simple example, here is a set of assertions:
|
| * The moon is made of green cheese
|
| * The moon is crystalline rock surrounding an iron core
|
| It wouldn't take an AI to see that both of these can't be
| true, even if we weren't clear about what a moon is made
| of, exactly.
|
| Some of our common understanding could contain more
| complicated internal contradictions that might be harder
| for a human to tease apart, that an AI might be able to
| identify.
| akiselev wrote:
| How would the AI know that "made of green cheese" isn't
| just another way of saying "crystalline rock surrounding
| an iron core"? To find a contradiction in a statement
| like "X is A. X is B" it'd first have to be intelligent
| enough to know when A != B. In your example, that's not
| as simple as 1 != 0.
| lyaa wrote:
| Well, yeah, these models can not interact and observe the
| world to test the veracity of claims. They are language
| models and the target for them is text production. No one
| expects them to understand the universe.
|
| My comment was in response to > GPT-3 is probably a better
| approach to knowledge processing
|
| and the paper is relevant in that it shows the limitation
| of current language models in terms of logical consistency
| or measures of the quality of text sources. GPT-3 and other
| models are not trained for this and obviously they fail at
| the task. This is evidence against them being a "better
| approach to knowledge processing."
|
| Even if we trained future models preferentially on the
| latest and most cited scientific papers, we will still have
| issues with conflicting claims and incorrect/fabricated
| results.
|
| However, that does not mean that it would not be
| practically useful to figure out a way to include some
| checks or confidence estimates of truthfulness of model
| training data and responses. Perhaps just training the
| models to answer that they don't know when the training
| data is too conflicted would be useful enough.
| zeckalpha wrote:
| Isn't there the same problem with structured data?
| Zababa wrote:
| > The problem with models like GPT-3 is that they are unable
| to differentiate between information sources with different
| "trustworthiness." They learn conspiracies and wrong claims
| and repeat them.
|
| Impressive, AI has already reached human levels of
| understanding.
| lyaa wrote:
| ...it saddens me how true that is given the last few years.
| I still like to think most people are capable of a deeper
| level of understanding.
| Zababa wrote:
| I don't agree with "given the last few years". Conspiracy
| theories have always been a thing, now information just
| flows faster.
| ghukill wrote:
| >> "Hell, at this point, GPT-3 is probably a better approach to
| knowledge processing than trying to piece together something
| actionable from a half baked information graph born of old
| programmers' utopian fever dreams."
|
| Greatest thing I've read on HN. As a librarian and developer,
| can confirm. At least in most cases...(slipping back into fever
| dream)....
| xg15 wrote:
| > _Hell, at this point, GPT-3 is probably a better approach to
| knowledge processing than trying to piece together something
| actionable from a half baked information graph born of old
| programmers ' utopian fever dreams._
|
| I mean, at this point, wouldn't it be a lot simpler to go back
| to the middle ages (or earlier) and have a few humans memorize
| all that stuff? It's not as if GPT3 would give you any more
| insight than that approach...
| sswaner wrote:
| "old programmers' utopian fever dreams" - accurate description
| of me when I first found Dublin Core. Never made it to
| production. It was unnecessary overhead for describing internal
| content.
| wvh wrote:
| Having worked with DC, Marc21 and some of its precursors, I
| feel that one major problem with the approach is that
| cataloguers try to infinitely cut up metadata into ever smaller
| pieces until you end up with a impossible large collection of
| fields with an overly high level of specificity and low level
| of applicability.
|
| For example, the middle name of a married name of a person that
| contributed to one part of one edition of the work in a certain
| capacity etc.
|
| You end up with so much unwieldily specific metadata that
| consistent cataloguing and detailed searching become nigh
| impossible.
|
| Ever since search engines, the world has moved into using
| mostly flat search engines rather than a highly specific facet
| search for deep fields. Of course, one would want something a
| bit smarter than full text search, but the real, human world is
| so complex that trying to categorise any data point quickly
| becomes an exercise in futility.
___________________________________________________________________
(page generated 2021-10-02 23:02 UTC)