https://reprog.wordpress.com/2010/09/03/bibliographic-data-part-2-dublin-cores-dirty-little-secret/ The Reinvigorated Programmer Everything except sauropod vertebrae Skip to content * Home * About * Software * Doctor Who + Doctor Who reviews + My book: The 11th Doctor * TV * Music + Heavy metal timeline + What I've been listening to in ... + Other writing about music * Reading * Cooking * Politics [banner3] - Bibliographic data, part 1: MARC and its vile progeny Bibliographic data, part 3: Has anyone, anywhere, ever read the whole of the RDA specification? - Bibliographic data, part 2: Dublin Core's dirty little secret Posted on September 3, 2010 | 22 Comments [This is part two in a series -- you should read part 1 first for context and then you might go on to part 3.] The Dublin Core -- metadata made dumb Just when librarians were in despair of ever getting their data out to the world in a form it could understand, along came the Dublin Core (DC for short) -- a simple set of fifteen metadata elements ( contributor, coverage, creator, date, description, format, identifier , language, publisher, relation, rights, source, subject, title, and type) that could be used to describe "document-like objects" such as books, journal articles and web pages. Everyone in the library world got really excited about the Dublin Core for about three weeks in 1999, before realising that you can't actually do anything with those elements beyond expressing author (called "creator"), title and date. Everything else was too vague to be of any use -- coverage, anyone? Relation? Format? If you don't believe me, try translating a reference to a journal article into DC -- for example, this one that we used in the previous article: Taylor, Michael P. and Darren Naish. 2007. An unusual new neosauropod dinosaur from the Lower Cretaceous Hastings Beds Group of East Sussex, England. Palaeontology 50 (6): 1547-1564. doi:10.1111/ j.1475-4983.2007.00728.x I can easily see how to map the author, date and articleTitle, but not the journalTitle, volume, issue, startPage, endPage or DOI. So one third of the elements are representable. The Dublin Core people quickly realised that while the fifteen core elements are OK for describing web-pages (which is frankly what they were designed for, despite all the "cross-domain" rhetoric), they were not much use for describing, well, anything else. Not beyond the absolute basics, anyway. [2-thunfisc] [Oh, and by the way: there was, and is, no standard XML format for Dublin Core, merely guidelines for how to roll your own. Just in case you were wondering. There are standard element names to use (e.g. ) but no standard wrapper element to represent the record as a whole.] Qualified Dublin Core -- metadata made slightly less dumb The solution to the paucity of Dublin Core elements was this thing called "qualified Dublin Core" (although that term doesn't seem to be used much any more), in which the fifteen core elements are qualified to make them more specific -- for example, dateAccepted, dateAvailable and dateCopyrighted are refinements of the core element date. According to the Dublin Core's own dumb down principle, "a client should be able to ignore any qualifier and use the value as if it were unqualified [...] Qualification is therefore supposed only to refine, not extend the semantic scope of an Element." Sounds good, right? Except: * There is still no canonical XML representation for Dublin Core records, only canonical XML element names for Dublin Core elements. * The XML representation of dateAccepted is not, as you might expect, but , which means you can't implement the dumb down principle just by discarding qualifiers, you need to encode specific knowledge of how to map "qualified" to core DC elements in your application. In other words, "qualified Dublin Core" is not qualified at all. * The dcterms namespace has its own instances of the fifteen core elements, so when you want to add a contributor, you have to choose (how?) between and . All of this, inexplicable though it may appear, would perhaps be tolerable. Were it not for the core incompetence of the Dublin Core model. And here at last we come to the promised Dirty Little Secret ... [4-thunfisc] Even qualified Dublin Core can't describe a journal article When I first heard this, I flatly refused to believe it. It seemed impossible that anyone could design a metadata element set for describing documents and have it not able to describe a journal article. But it is, amazingly, quite true. When I made my best effort to render the reference above into Qualified Dublin Core, I found that I was able to represent only one additional field (the DOI, and that not very well) beyond the three basic elements (authors, date, title) that basic Dublin Core allowed me. Critics, with the exception of Oscar Wilde, seem mostly to agree that the death of Little Nell (in Dickens's Old Curiosity Shop) is one of the saddest passages in literature. Personally, I lean more towards the separation of Rose and Doctor at the end of Doomsday (you know, before the They Can Never See Each Other Again Because The Path Between Universes Has Closed Forever thing got downgraded to She Can't Appear Again Until The Fourth Season Due To Other Work Commitments). Others may cite the ending of Old Yeller or the departure of the ring-bearers to the Undying Lands after the scouring of the Shire. But for me, the most tragic document ever written is Guidelines for Encoding Bibliographic Citation Information in Dublin Core Metadata: four and half thousand words of desperate flailing that could have been summarised as "Don't even bother trying, it just doesn't work". Turns out that the Qualified Dublin Core solution to the problem of citing journal articles was to add -- get ready for this -- a bibliographicCitation element. Oh, joy! And so the introduction of the Guidelines document concludes with the observation: Before the introduction of the Dublin Core term 'bibliographicCitation' it was not obvious how to describe fully a journal article using Dublin Core metadata. There was no suitable Dublin Core property to capture the journal title, as distinct from the article title, or the volume, issue and page details, other than as part of a general description. Thank heavens that's changed! Now, instead of shoving the journal title, volume, issue and page details into an undifferentiated lump of text in the description field, we can shove the journal title, volume, issue and page details into an undifferentiated lump of text in the bibliographicCitation field! This, let me remind you, in a specification that includes SEVENTY data elements -- the original fifteen core elements, plus 55 added in Qualified DC, of which 15 are duplicates of the originals. And in those 70 elements they couldn't make room for journal title? Seriously? [2-gebraten] The official, sanctioned, allegedly interoperable encoding of my perfectly simple article citation into Dublin Core Here it is, folks, based on the Guide. Read it and weep: Michael P. Taylor Darren Naish 2007 An unusual new neosauropod dinosaur from the Lower Cretaceous Hastings Beds Group of East Sussex, England. urn:ISSN:0081-0239 Blackwell Text Palaeontology 50(6), 1547-1564. (2007) info:doi:10.1111/j.1475-4983.2007.00728.x It makes me want to cry. Note that: * There is still no standard XML format for Dublin Core records, so I had to make up my own wrapper element (which of course can't be in either of the two DC namespaces). * For the actual elements, I am supposed to use a mixture of elements from dc and dcterms namespaces. * The element containing the publication date is not called publicationDate or datePublished, nor even issuedDate or dateIssued, but just issued -- unlike, for example, dateSubmitted or dateAccepted. * The best I can do by way of trying to express the journal title is to use the dcterms:isPartOf element and give it the ISSN of the journal (wrapped up as a URI), in the hope that whoever uses this record will go and look that ISSN up to find out what journal it pertains to. * The publisher is considered an important part of the citation (unlike, say, the journal title, volume, issue or page-range) despite the fact that journal-article citations never include the publisher. * It's considered important to state that the type of the referenced item is Text. * The type "Text" is drawn from a vocabulary whose URI is known (I got it from the Guide) but I couldn't figure out what XML attribute I am supposed to use to point to that URI. And of course all of this is on top of the utterly baffling brain-damage that is the bibliographicCitation element. And by the way, if the sample bibliographicCitation above doesn't seem too dreadful to you, then consider this sample Big Undifferentiated Blob Of Text, straight from the Guide: Proceedings of the International Conference on Dublin Core and metadata for e-communities, 2002; DC-2002: Metadata for e-Communities: Supporting Diversity and Convergence, Florence, Italy, 13-17 October 2002, pp 71-80 [2-kaviar-f] bibliographicCitation format: the pain, the glory, the other pain But at least the client software can reliably parse the journal title, volume, issue, start-page and end-page out of the bibliographicCitation, right? I mean, it must be in a standard format, right? Right? Viewers of a sensitive disposition might wish to look away now. Here's what section 2.2 of the Guide says: Plain text citations may be according to a recognised citation style. Several styles were reviewed by the DCMI Citation Working Group, and are listed on a Citation Styles page, but there is no particular recommendation for choice of style. And indeed the two sample bibliographicCitation examples above are in noticably different formats even allowing that one is for a journal article and the other for a paper in a proceedings volume -- for example, the date is parenthesised in the former but not in the latter. Oh, and from section 2.1: Other details of the resource, such as its title and creators, will be described using the usual Dublin Core properties. Optionally, but redundantly, these details may be included in the citation as well. In other words, any old crap can be shoved in the bibliographicCitation field. So let's review: the official way to represent journal title, volume number, issue number, start page and end page in the 70-element Qualified Dublin Core set is: jam them, and quite possibly some other data you happen to have lying around, together into a text blob in any format you happen to feel like. Of course, for a computer reading the XML to make any use of this information, it will need to parse the bibliographicCitation to figure out what the journal title, etc., are. But since that field can contain any combination of elements in any format, any parser will need to try all sorts of heuristics to match the format and figure out which bits represent what data. Which of course is exactly what you'd have to do if all you had to work with was the plain-text citation that we started this article with, long, long ago. To summarise: Qualified Dublin Core, with its 70 fields, is no more useful for expressing journal-article citations than plain text. Oh, am I shouting? Sorry. [2-riesenga] Appendix. Don't even get me started on the use of the OpenURL 1.0 (ANSI/NISO Z39.88) ContextObject KEV format as an alternative for the content of the field Having written that heading, I feel no need to expand further on it. OK, I'm out of here. I need to take a shower. Tune in next time for yet more pain. Share this: * Twitter * Facebook * Like this: Like Loading... Related This entry was posted in Culture, Formats, Frustration, Not my favourite. Bookmark the permalink. - Bibliographic data, part 1: MARC and its vile progeny Bibliographic data, part 3: Has anyone, anywhere, ever read the whole of the RDA specification? - 22 responses to "Bibliographic data, part 2: Dublin Core's dirty little secret" 1. Pingback: Bibliographic data, part 1: MARC and its vile progeny | The Reinvigorated Programmer 2. [00ce1] Michel S. | September 5, 2010 at 10:52 am | One would think they could learn a thing or two from BibTeX ... 3. Pingback: Bibliographic data, part 3: Has anyone, anywhere, ever read the whole of the RDA specification? | The Reinvigorated Programmer 4. [86745] Chris Purcell | September 6, 2010 at 7:05 am | You're supposed to use RDF (1.0) to encode DC. That way, OWL gives you the subtype and equality information 5. [512b3] Khairul | September 6, 2010 at 8:43 am | Could this be because your standard brick-and-mortar library catalogs physical items? I mean, it makes sense that a library can lend out its print copy of a journal issue, but how does a library lend out a journal article? 6. [d7611] Mike Taylor | September 6, 2010 at 9:23 am | Thanks, all, for some interesting comments. Chris, I don't think bringing on RDF and OWL helps very much here, because the actual fieldnames just don't exist. By saying that a journal article isPartOf a journal issues, you can then use the dc:title element to specify the journal name, but (A) that doesn't help with the volume number, issues number, and page-range; and (B) it's not really right anyway, since that specific issue doesn't have a title, it's an instance of a journal that has a title. If you have to say that the article isPartOf and issue and that the issue isInstanceOf a journal just so you can give the journal title, then things have gone very, very wrong -- it's information architecture astronautics gone wild. Khairul, you are almost certainly right that the poor support for journal articles in library standards goes back to brick-and-mortar days. But (A) that excuse would have been less unacceptable in 2000 than it is now, and (B) that doesn't explain the lack of volume and issue fields, which you'd surely need for cataloguing hardcopy issues. 7. [512b3] Khairul | September 6, 2010 at 10:00 am | Unless there exist journals that can have multiple issues in a single day, a date and title uniquely identifies a journal issue. For retrieval purposes, I suppose that's all is needed. 8. [d7611] Mike Taylor | September 6, 2010 at 10:15 am | The date that is included in citations is almost always a year alone -- not 12 March 1968, but just 1968. And the great majority of journals publish more than one issue a year, so "date" as given is certainly not an adequate substitute for volume and issue. 9. [512b3] Khairul | September 6, 2010 at 1:41 pm | To go from citation to journal issue, one could pull up all the issues for the year and count, but that does sound rather inefficient. 10. [86745] Chris Purcell | September 7, 2010 at 12:33 am | Mike: Sure, I appreciate RDF/OWL don't solve your main problem, just wanted to address some of your sub-problems! 11. Pingback: Datos bibliograficos, parte 1: MARC y su vil progenie | BaDoc 12. [e42ec] Douglas Campbell | November 14, 2010 at 10:55 pm | Looking at the background discussions, the DC Citation Guidelines were developed in response to the question: How do I quickly add the journal details into an article's DC record? It seems no one has ever actually said to DC that they need to encode a full journal article citation - maybe you should? Since those guidelines came out, a couple of schemas have come out that play nicely with DC - Bibliographic Ontology http:// bibliontology.com/ and PRISM http://www.prismstandard.org/ - they might do the trick? 13. Pingback: PSNC Digital Libraries Team >> Dublin Core's dirty little secret 14. [f99c6] Andy Powell | December 15, 2010 at 6:23 pm | Sorry... I'm rather late to this. It's very funny... but I think you are confused. (Being confused about DC is fine by the way... most people are (including me)). DC has evolved to be used as an RDF vocabulary. It didn't start out that way of course... because the original 15 'elements' pre-dated RDF. It started out as a set of 'attributes' to be used in the HTML meta tag, flirted briefly with XML (not least in the form of the OAI-PMH) and finally emerged into the butterfly of an RDF vocabulary. I use the word 'butterfly' loosely. To complain that DC doesn't work as an XML language is like complaining that concrete makes bad cakes. You're right... but so what!? Part of the reason you are confused is because DCMI itself is confused. The long history of DC has left a lot of people with differing views about where DC sits in that HTML/XML/RDF spectrum. Indeed, many people 'inside' the DC camp consider that DC should function across the whole piste. The trouble is, in trying to do so DC becomes jack of all trades, master of none. My view (a view that is shared by a few others but that is also violently disagreed with by many others) is that DC now has to be viewed pretty much solely as an RDF vocabulary. The properties are now declared using RDFS for example. If viewed in that way... many of your complaints above disappear. Sure, DC doesn't have all the properties necessary to capture a full citation. So what? It was never intended to. The whole point of RDF is that others can come up with such a set of properties and use them inter-mixed with DC (or on their own) as necessary. Small pieces loosely joined and all that. All IMHO of course. (BTW, I hate bibliographicCitation as well). 15. [d7611] Mike Taylor | December 16, 2010 at 1:00 pm | I don't know, Andy. This "it's just for RDF" thing sounds like post-hoc pleading to me. There is literally nothing about RDF on the Dublin Core home page http://dublincore.org/ -- if that really was its whole raison d'etre, wouldn't you expect to see it at least mentioned? And let's not forget that even if you leap into the RDF swamp and model stuff like journal title as the title of a separate "The Journal" object that is linked to the article using isPartOf, that still doesn't get you anywhere near everything you need. Even a fully RDF-bedrangled bibliographic reference would be missing such core information as the volume, issue and page-range. All of this comes from the horrible tendency of library scientists such as your good self and, well, me, to want to model everything. This always -- always results in confusion, yet we never seem to learn: OpenURL 1.0, RDA, FRBR, Dublin Core/RDF scholarly references ... All that infrastructure, all that learning curve, and still we can't do the trivial thing that the RIS format has been happily doing since the dawn of time: TY - JOUR AU - Taylor, Michael P. AU - Naish, Darren PY - 2007 TI - An unusual new neosauropod dinosaur from the Lower Cretaceous Hastings Beds Group of East Sussex, England JO - Palaeontology VL - 50 IS - 6 SP - 1547 EP - 1564 ID - doi:10.1111/j.1475-4983.2007.00728.x ER - There -- was that really so hard? I'm not saying this is a good format, but it does at least allow you to Say What You Mean, Simply And Directly. 16. Pingback: Semantic mapping is hard | The Reinvigorated Programmer 17. Pingback: links for 2010-10-29 << sonofbluerobot 18. [me3_n] Dan Brickley (@danbri) | May 7, 2013 at 1:16 pm | How about this? https://gist.github.com/danbri/5532479 19. [d7611] Mike Taylor | May 7, 2013 at 11:44 pm | Har! 20. Pingback: spitting my tea and thinking of Grant Campbell. | librarian @large 21. [cb45e] Mike Curtis | November 22, 2015 at 5:36 pm | Best written, most depressing article I've read about libraries and metadata, sigh! 22. [d7611] Mike Taylor | November 22, 2015 at 5:38 pm | Thanks for those, I guess, kind words :-) Leave a Reply Cancel reply Enter your comment here... [ ] Fill in your details below or click an icon to log in: * * * * * Gravatar Email (required) (Address never made public) [ ] Name (required) [ ] Website [ ] WordPress.com Logo You are commenting using your WordPress.com account. ( Log Out / Change ) Google photo You are commenting using your Google account. ( Log Out / Change ) Twitter picture You are commenting using your Twitter account. ( Log Out / Change ) Facebook photo You are commenting using your Facebook account. ( Log Out / Change ) Cancel Connecting to %s [ ] Notify me of new comments via email. [ ] Notify me of new posts via email. [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] D[ ] This site uses Akismet to reduce spam. Learn how your comment data is processed. * Search for: [ ] [Search] * RSS Feeds + RSS - Posts + RSS - Comments * Recent Posts + I AM A SITH + A conservatory on the cheap + An accidentally sensational pizza + What I've been reading lately, part 41 + Hands up who enjoys feeling old! * Recent Comments [c9b] Simon on Metal Jester: the full st... [a1f] Richard G. Whitbread on Metal Jester: the full st... [d76] Mike Taylor on Are you one of the 10% of prog... [969] Mr. A on Are you one of the 10% of prog... [c9b] Simon on Metal Jester: the full st... * Archives + July 2021 + June 2021 + May 2021 + April 2021 + March 2021 + January 2021 + December 2020 + November 2020 + October 2020 + July 2020 + June 2020 + May 2020 + April 2020 + March 2020 + January 2020 + December 2019 + November 2019 + October 2019 + August 2019 + July 2019 + June 2019 + May 2019 + April 2019 + March 2019 + February 2019 + January 2019 + December 2018 + November 2018 + October 2018 + September 2018 + August 2018 + July 2018 + June 2018 + May 2018 + April 2018 + March 2018 + February 2018 + January 2018 + December 2017 + November 2017 + October 2017 + September 2017 + August 2017 + July 2017 + June 2017 + May 2017 + April 2017 + March 2017 + February 2017 + January 2017 + December 2016 + November 2016 + October 2016 + September 2016 + August 2016 + July 2016 + June 2016 + May 2016 + April 2016 + March 2016 + February 2016 + January 2016 + December 2015 + November 2015 + October 2015 + September 2015 + August 2015 + July 2015 + June 2015 + May 2015 + April 2015 + March 2015 + February 2015 + January 2015 + December 2014 + November 2014 + October 2014 + September 2014 + August 2014 + July 2014 + June 2014 + May 2014 + April 2014 + March 2014 + February 2014 + January 2014 + December 2013 + November 2013 + October 2013 + September 2013 + July 2013 + June 2013 + May 2013 + April 2013 + March 2013 + February 2013 + January 2013 + December 2012 + November 2012 + October 2012 + September 2012 + June 2012 + May 2012 + April 2012 + March 2012 + February 2012 + January 2012 + December 2011 + November 2011 + October 2011 + September 2011 + August 2011 + July 2011 + June 2011 + May 2011 + April 2011 + March 2011 + February 2011 + January 2011 + December 2010 + November 2010 + October 2010 + September 2010 + August 2010 + July 2010 + June 2010 + May 2010 + April 2010 + March 2010 + February 2010 * Doctor Who, Series 5 Reviews of all episodes. * Greatest Hits In reverse chronological order: + C. S. Lewis on intelligence in Christianity + Programming Pearls + Steve Jobs "never had any designs. He has not designed a single project" + Another challenge: can you write a correct selection sort + Still hatin' on git: now with added Actual Reasons! + The long-overdue serious attempt at The Silmarillion, part 1: what it isn't and what it is + Writing correct code, part 1: invariants + Only 10% of programmers can write a binary search + The long-overdue serious attempt at Lisp, part 2: is Lisp just too hard? + The C Programming Language + The eleventh Doctor: first impressions + A brief, yet helpful, lesson on elementary resource-locking strategy + The hacker, the architect and the superhero + Programming the Commodore 64 + Where Dijkstra went wrong + The Elements of Programming Style + Whatever happened to programming?, redux + Whatever happened to programming? * Most viewed today/yesterday + Bibliographic data, part 2: Dublin Core's dirty little secret + Bibliographic data, part 1: MARC and its vile progeny + Bibliographic data, part 3: Has anyone, anywhere, ever read the whole of the RDA specification? + I wish Jackson hadn't ruined Galadriel's speech + About + Software + All the cool kids are using JSON instead of XML + Whatever happened to programming? + Are you one of the 10% of programmers who can write a binary search? + Ages of Doctor Who actors on their debuts * Art Bad habits Books Challenges Culture Doctor Who Europe Food and drink Frustration Games Heavy Metal timeline Life Me singing folk songs Movies Music Nostalgia Not my favourite Politics Programming Recipe Reviews Series 5 Series 6 Star Wars Sushi The Real World Train wrecks TV Uncategorized What I've been reading lately * Blog Stats + 2,681,931 hits Blog at WordPress.com. Loading Comments... Write a Comment... [ ] Email (Required) [ ] Name (Required) [ ] Website [ ] [Post Comment] [Close and accept] Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use. To find out more, including how to control cookies, see here: Cookie Policy %d bloggers like this: [b]