2025-04-30 - Z39.50 Library Science For Dummies =============================================== I enjoyed reading these blog posts from Wolfram Scneider titled Z39.50 For Dummies, originally posted in 2009 and 2010. It introduces Unix tools to work with z39.50 and MARC records. It describes how to download the Library of Congress card catalog for a local offline searchable database. Contents ======== * Z39.50 for Dummies * Z39.50 for Dummies - Part 1 * Z39.50 for Dummies - Part 2 * Z39.50 for Dummies - Part 3 * Z39.50 for Dummies - Part 4 * Z39.50 for Dummies - Part 5 Z39.50 for Dummies ================== by Wolfram Schneider on 2009-08-27 One of the things Index Data is known for is the YAZ toolkit--an open source programmers' toolkit supporting the development of Z39.50/SRW/SRU clients and servers. The first release was in 1995 and I've been using it for my own metasearch engine ZACK Gateway since 1998, long before I joined Index Data. YAZ toolkit ZACK Gateway (defunct) Z39.50 is a client-server protocol for searching and retrieving information from remote computer databases. It is a mature low level protocol like HTTP and FTP. You don't implement Z39.50 yourself, you use the YAZ utilities and the libraries and frameworks for in other languages (C++, PHP, Perl, etc.). There are many people who thinks that Z39.50 is a dead standard, and hard to understand. That is not true. Z39.50 is still growing in use, stable and very fast. It is the only widely available protocol for metasearch. Using Z39.50 is not harder than using FTP. I think that the overhead for learning Z39.50 is less than a half day for an experienced programmer. Every problem which you have later is not related to the Z39.50 protocol itself, it is related to underlying system behind the Z39.50 server. Keep in mind that Z39.50 is an API to access (bibliographic) databases. It does not define how the data is structured and indexed in the database. Z39.50 for Dummies Series - Part 1 ================================== I will now start a Z39.50 for Dummies series and show some example how to access a remote database. I'm using in the following demos the zoomsh program from the YAZ toolkit zoomsh Let's start with a simple question: does the Library of Congress have the book "library mashups"? (I strongly recommend you buy this book--I wrote chapter 19): $ zoomsh "connect z3950.loc.gov:7090/voyager" \ 'search "library mashups"' quit z3950.loc.gov:7090/voyager: 2 hits That's all! Only one line on the command line. A SRU or SOAP request would not be shorter. Now, retrieve the record: $ zoomsh "connect z3950.loc.gov:7090/voyager" \ 'search "library mashups"' "show 0 1" "quit" z3950.loc.gov:7090/voyager: 2 hits 0 database=VOYAGER syntax=USmarc schema=unknown 02438cam 22003018a 4500 001 15804854 005 20090710141909.0 008 090706s2009 nju b 001 0 eng 906 $a 7 $b cbc $c orignew $d 1 $e ecip $f 20 $g y-gencatlg 925 0 $a acquire $b 2 shelf copies $x policy default 955 $b rg11 2009-07-06 $i rg11 2009-07-06 $a rg11 2009-07-08 to \ Policy (CLED/SHED) $a td04 2009-07-09 to Dewey $w rd14 2009-07-10 010 $a 2009025999 020 $a 9781573873727 040 $a DLC $c DLC 050 00 $a Z674.75.W67 $b L52 2009 082 00 $a 020.285/4678 $2 22 245 00 $a Library mashups : $b exploring new ways to deliver library data / $c edited by Nicole C. Engard. 260 $a Medford, N.J. : $b Information Today, Inc., $c c2009. 263 $a 0908 300 $a p. cm. 504 $a Includes bibliographical references and index. 505 0 $a What is a mashup? / Darlene Fichter -- Behind the scenes \ : some technical details on mashups / Bonaria Biancu -- Making \ your data available to be mashed up / Ross Singer -- Mashing up \ with librarian knowledge / Thomas Brevik -- Information in \ context / Brian Herzog -- Mashing up the library website / \ Lichen Rancourt -- Piping out library data / Nicole C. Engard -- \ Mashups @ Libraries interact / Corey Wallis -- Library catalog \ mashup : using Blacklight to expose collections / Bess Sadler, \ Joseph Gilbert, and Matt Mitchell -- Breaking into the OPAC / \ Tim Spalding -- Mashing up open data with biblios.net Web \ services / Joshua Ferraro -- SOPAC 2.0 : the thrashable, \ mashable catalog / John Blyberg -- Mashups with the WorldCat \ Affiliate Services / Karen A. Coombs -- Flickr and digital image \ collections / Mark Dahl and Jeremy McWilliams -- Blip.tv and \ digital video collections in the library / Jason A. Clark -- \ Where's the nearest computer lab? : mapping up campus / Derik A. \ Badman -- The repository mashup map / Stuart Lewis -- \ The LibraryThing API and libraries / Robin Hastings -- ZACK \ bookmaps / Wolfram Schneider -- Federated database search mashup \ / Stephen Hedges, Laura Solomon, and Karl Jendretzky -- \ Electronic dissertation mashups using SRU / Michael C. Witt. 650 0 $a Mashups (World Wide Web) $x Library applications. 650 0 $a Libraries and the Internet. 650 0 $a Library Web sites $x Design. 650 0 $a Web site development. 700 1 $a Engard, Nicole C., $d 1979- 963 $a Amy Reeve; phone: 609-654-6266; email: areeve @ \ infotoday.com; bc: nellor @ infotoday.com The default exchange format for bibliographic records in Z39.50 is MARC21. This is maybe not what you want to parse yourself. Ok, now let's download the record in XML format: $ zoomsh "connect z3950.loc.gov:7090/voyager" \ 'search "library mashups"' "show 0 1 xml" "quit" z3950.loc.gov:7090/voyager: 2 hits 0 database=VOYAGER syntax=USmarc schema=unknown 02438cam a22003018a 4500 15804854 20090710141909.0 090706s2009 nju b 001 0 eng \ 7 cbc orignew 1 ecip 20 y-gencatlg [large XML output...] You can parse the XML output with your favorite tools, usually an XSLT style sheet. Next time I will show you how to run a meta search in one line. -Wolfram UPDATE: The latest release of YAZ, inspired by this blog post, supports client-side mapping of MARC to MARCXML, so you can dump XML records even from targets that do not support XML. Z39.50 for Dummies - Part 2 =========================== In the last blog post Z39.50 for Dummies I gave an introduction on how to use the zoomsh program to access the Z39.50 Server of the Library of Congress. Today I will show you how to run a simple metasearch on the command line. You want to know which library has the book with the ISBN 0-13-949876-1 (UNIX network programming / W. Richard Stevens)? You can run the zoomsh in a shell loop. Put the list of databases (zURL's) line by line in the text file zurl.txt: z3950.loc.gov:7090/voyager melvyl.cdlib.org:210/CDL90 library.ox.ac.uk:210/ADVANCE z3950.library.wisc.edu:210/madison and run a little loop in a shell script: $ for zurl in `cat zurl.txt` do zoomsh "connect $zurl" "search @attr 1=7 0-13-949876-1" "quit" done z3950.loc.gov:7090/voyager: 0 hits melvyl.cdlib.org:210/CDL90: 1 hits library.ox.ac.uk:210/ADVANCE: 1 hits z3950.library.wisc.edu:210/madison: 0 hits Of course it takes time to run one search request after another. How about a parallel search? Modern xargs(1) commands on BSD based Operating Systems (MacOS, FreeBSD) and the GNU xargs supports to run several processes at a time. This example runs up to 2 search request at a time and is 2 times faster than the shell script above: $ xargs -n1 -P2 perl -e 'exec "zoomsh", "connect $ARGV[0]", \ "search \@attr 1=7 0-13-949876-1", "quit"' < zurl.txt melvyl.cdlib.org:210/CDL90: 1 hits library.ox.ac.uk:210/ADVANCE: 1 hits z3950.loc.gov:7090/voyager: 0 hits z3950.library.wisc.edu:210/madison: 0 hits You see here that the order of responses is different, the fastest databases wins and displayed first. I think it is safe to run up to 20 searches in parallel on modern hardware. Note that there is a lot of process overhead here, for each request 2 processes will be executed. If a connection hangs you must wait until you hit the time out. This was an example how easy it is to run your own metasearch on the command line. If you want setup a real metasearch for your organization I recommend to try out our metasearch middleware pazpar2, featuring merging, relevance ranking, record sorting, and faceted results. In a nutshell, pazpar2 is a web-oriented Z39.50 client. It will search a lot of targets in parallel and provide on-the-fly integration of the results. The interface is entirely webservice-based, and you can use it from any development environment. pazpar2 home page Z39.50 for Dummies Series - Part 3 ================================== This is part 3 of the Z39.50 series for dummies. In the first part I explained what Z39.50 is and how to run a simple search. In the second part I showed how to run a simple meta search on the command line. I searched for the book: UNIX network programming / W. Richard Stevens, ISBN 0-13-949876-1 in four large libraries: $ for zurl in `cat zurl.txt` do zoomsh "connect $zurl" "search @attr 1=7 0-13-949876-1" "quit" done z3950.loc.gov:7090/voyager: 0 hits melvyl.cdlib.org:210/CDL90: 1 hits library.ox.ac.uk:210/ADVANCE: 1 hits z3950.library.wisc.edu:210/madison: 0 hits Only 2 out of 4 libraries own this must-have book. Can this be true? Well, lets modify the ISBN and search without dashes ('-') $ for zurl in `cat zurl.txt` do zoomsh "connect $zurl" "search @attr 1=7 0139498761" "quit" done z3950.loc.gov:7090/voyager: 1 hits melvyl.cdlib.org:210/CDL90: 1 hits library.ox.ac.uk:210/ADVANCE: 1 hits z3950.library.wisc.edu:210/madison: 1 hits Bingo--every library has a copy of UNIX network programming by W. Richard Stevens! Z39.50 defines the syntax to search in a database. It does not define the semantic of a search, how an ISBN is structured. If you build a search engine on top of Z39.50 you need an additional layer to handle the semantic of a search for each database. (You need this layer too to add workaround for broken implementations.) In this example above we must remove the dashes in an ISBN search for the Library of Congress and University of Wisconsin-Madinson Libraries. Another thing which you must be aware: libraries use for historical reasons different character sets: utf-8, iso8859-1, iso5426, and marc8. You must convert your search query to the right character set for each library, for searching and retrieving the records. In this article I described the challenges to run a meta search on top of Z39.50. All these problems are due the underlying databases and not Z39.50--you will have the same problems if you use a web based XML services such as SRU or a proprietary, vendor-based API. The truth is that running a metasearch is not a trivial task. Z39.50 for Dummies - Part 4 =========================== Libraries store and exchange bibliographic data in MARC records. A MARC record is a MAchine-Readable Cataloging record. It was developed at the Library of Congress (LoC) beginning in the 1960s. MAchine-Readable Cataloging record Library of Congress A dump of the LoC catalog (and other libraries) is available at the Internet Archive in the collection marcrecords. The LoC catalog dump is split into 29 files, part01.dat to part29.dat. Each file is roughly 200MB large. LoC catalog dump The great news is that the data from LoC is public domain (already paid by the US taxpayers, thank you!) and you can use the data for your own system. MARC Open-Access (2016) MDSConnect datasets (2020) Before you can import data, you must validate, convert, or fix the bibliographic data. I will show now how you can do this with the Index Data YAZ toolkit. The YAZ toolkit contains the program yaz-marcdump to dump MARC records. yaz-marcdump yaz-marcdump called without an option will print the records in line format: $ yaz-marcdump part01.dat | more 00720cam 22002051 4500 001 00000002 003 DLC 005 20040505165105.0 008 800108s1899 ilu 000 0 eng 010 $a 00000002 035 $a (OCoLC)5853149 040 $a DLC $c DSI $d DLC 050 00 $a RX671 $b .A92 100 1 $a Aurand, Samuel Herbert, $d 1854- 245 10 $a Botanical materia medica and pharmacology; $b drugs \ considered from a botanical, pharmaceutical, physiological, \ therapeutical and toxicological standpoint. $c By S. H. Aurand. 260 $a Chicago, $b P. H. Mallen Company, $c 1899. 300 $a 406 p. $c 24 cm. 500 $a Homeopathic formulae. 650 0 $a Botany, Medical. 650 0 $a Homeopathy $x Materia medica and therapeutics. [...] First converts the MARC21 records in MARC-8 encoding to MARC21 in UTF-8 encoding: $ yaz-marcdump -f marc-8 -t utf-8 -o marc part01.dat > part.mrc For MARC21, the leader offset 9 tells whether it is really MARC8 (almost always the case) or whether it's UTF-8. A MARC21 must have position 9='a' (value 97). For this reason, the option -l for yaz-marcdump may come in handy: $ yaz-marcdump -f marc-8 -t utf-8 -o marc -l 9=97 part01.dat \ > part.mrc If you prefer MARCXML instead MARC21 records you may convert the records: $ yaz-marcdump -o marcxml -f MARC-8 -t UTF-8 part01.dat \ > part.marcxml 00720cam a22002051 4500 00000002 DLC 20040505165105.0 800108s1899 ilu 000 0 eng 00000002 (OCoLC)5853149 [...] The Library of Congress has over 7 million records. That's huge data, total 5.6GB raw data. If you compress that data it is only 1.7GB. To convert compressed data, run yaz-marcdump in a UNIX pipe: $ zcat part01.dat.gz | yaz-marcdump -f MARC-8 -t UTF-8 \ -o marcxml /dev/stdin > part01.marcxml You can search a marc dump with the UNIX grep tool: $ yaz-marcdump -f marc-8 -t utf-8 part01.dat | grep Sausalito 260 $a Sausalito, Calif. : $b University Science Books, $c 2000. 260 $a Sausalito, Calif. : $b Math Solutions Publications, \ $c c2000. 260 $a Sausalito, Calif. : $b Post-Apollo Press, $c c2000. 260 $a Sausalito, Calif. : $b University Science Books, \ $c c2002. 260 $a Sausalito, Calif. : $b Post-Apollo Press, $c c2000. 260 $a Sausalito, CA : $b Toland Communications, $c c2000. 260 $a Sausalito, CA : $b In Between Books, $c 2001. [...] The yaz-marcdump tool supports the character sets UTF-8, MARC-8, ISO8859-1, ISO5426 and some other encodings. For more information, see the yaz-iconv manual pages. yaz-iconv In this article I showed how to validate, convert, or fix bibliographic data dumped in MARC format. Next time I will show some advanced examples how to analyze MARC records on modern standard PC hardware. Z39.50 for Dummies - Part 5 =========================== In this article I will show you how to analyze MARC data on a modern PC hardware. PC are very fast now and incredibly cheap. You can rent a quad-core Intel machine with 8GB RAM and unlimited traffic for 40 Euro/month (+VAT) in a data center. If the computer is fast enough, you don't have to spend too much time on complex algorithms. You can use the raw power of your computer and do a brute force approach. In the following example I will use the 7 million records from a dump of the Library of Congress (LoC) catalog. For details, please read the previous article Z39.50 for Dummies - Part 4. $ for i in *.dat; do yaz-marcdump -f marc-8 -t utf-8 -o line done > loc.txt $ du -hs loc.txt 4.9G The line dump of the LoC is 4.9GB large and fits into main memory--great! # count for the last name "Calaminus" $ egrep -c Calaminus loc.txt 4 hits, the search took 4 seconds real time # count records with ISBN number $ egrep -c ^020 loc.txt 3999863 There are nearly 4 million ISBN numbers (out of 7 million records). The search took 11 seconds. # count URLs $ egrep -c http:// loc.txt 265540 There are 265,540 URLs in the LoC records. # check for subject headings for the city of # Sausalito, California using regular expression $ egrep -c '^[67][0-9[0-9].*Sausalito' \ loc.txt 19 There are 19 subject headings for Sausalito # search with a typo in name (a => o) $ egrep Sausolito loc.txt No hits due a typo in the name, try it with agrep, a grep program with approximate matching capabilities: $ agrep -c -1 Sausolito loc.txt 282 282 hits, the search took 8 seconds agrep The examples above are for software developers and experienced librarians. They are helpful for a quick check of your bibliographic records, for data mining, analyzing or to double-check if your indexer works correctly. If you want setup a public system for end-users you need of course a real full text engine [such] as our zebra software. zebra From: tags: article,technical,unix Tags ==== article technical unix