hngopher.com

       [HN Gopher] 1.3B Worldcat scrape and data science mini-competition
       ___________________________________________________________________
        
       1.3B Worldcat scrape and data science mini-competition
        
       Author : crtasm
       Score  : 183 points
       Date   : 2023-10-04 12:22 UTC (10 hours ago)
        
 (HTM) web link (annas-blog.org)
 (TXT) w3m dump (annas-blog.org)
        
       | mannyv wrote:
       | It's unclear what exactly the competition is about. Just to poke
       | around the dataset?
        
         | mannyv wrote:
         | As a note, I wish I had enough space to mirror their library.
         | Looking at this brings out the collector in me...a tendency
         | that I've successfully suppressed. You can only keep so many
         | terabytes of archive around.
        
       | sterlind wrote:
       | It's infuriating seeing non-profits gatekeep datasets that were
       | compiled with grant money. At least Elsevier doesn't present
       | itself as a charity.
       | 
       | I was recently trying to get my hands on the Switchboard and
       | Fisher conversational speech datasets. Both were funded by DARPA
       | grants, and maintained by the non-profit LDC, which charges you
       | thousands of dollars for access (and no discounts for individual
       | researchers) - that is, if they'll even pay attention to you
       | without a .edu email address. And both are standard corpora in
       | the field of audio NLP, which makes replicating studies
       | impossible.
       | 
       | Sadly, I couldn't find any way to pirate the datasets - they're
       | too niche. So I applaud the authors for sticking it to Worldcat
       | and scraping their data.
        
         | 3abiton wrote:
         | I hope one day there will a piratebay for datasets (the pile)
         | and ai models (llama)
        
       | hedora wrote:
       | I looked into using Dewey Decimal for a hobby project. OCLC has a
       | _de facto_ monopoly on it due to the Worldcat database. They 're
       | a non-profit, but they're supported by having libraries pay a
       | subscription fee for Worldcat.
       | 
       | Back when OCLC was founded, the idea that people would want to
       | have a copy of a card catalog for personal use was laughable, so
       | I'm sympathetic to the people that set up their funding model.
       | It's far cheaper for a library to subscribe to Worldcat than to
       | hire a team to maintain such a database, so it created a win-win
       | situation.
       | 
       | However, keeping the world's books' metadata a secret (and
       | leaving control in the hands of a monopoly) is an anachronism.
       | 
       | It's well past the time when someone (such as an international
       | coalition of Libraries of Congress) should figure out how to
       | sustainably fund OCLC while also releasing their work into the
       | public domain.
        
         | mgr86 wrote:
         | I have been told that my organization developed a system, HABS,
         | that pre-dated OCLC [0]. That OCLC used this system as an
         | inspiration. However, I cannot confirm this. Closest I can do
         | is to find a footnote that Thanks Fred Kilgore, the founder of
         | OCLC [1]. I should reach out to Koh, a friend of a friend,
         | while she is still alive to confirm the story. Nevertheless, we
         | have a collection of punch cards in a dusty room in an attic
         | that was once the HABS system. I think it is a pretty
         | fascinating legacy, and I wish it was better preserved.
         | 
         | [0]
         | https://journals.sagepub.com/doi/pdf/10.1177/106939716900400...
         | [1]https://journals.sagepub.com/doi/abs/10.1177/106939717300800
         | ...
        
           | gmcharlt wrote:
           | Neat! I hope you can learn more from Koh.
           | 
           | I know a bit about Henriette Avram and her work at LoC
           | developing MARC, but it of course makes sense that other
           | libraries were thinking along similar lines at the time.
        
         | actuallyalys wrote:
         | I suppose one way to do it would be to allow patrons of
         | subscriber libraries to access the database dumps and API.
         | 
         | The downside is that would still make it harder than necessary
         | to access and leave some people out. The upside is that it's
         | not that much of change from their existing model. I'm sure
         | there would also be concerns about database dumps being shared
         | publicly, although Anna's Archive has already released their
         | entire database, and I suspect most people who would pay for
         | formal access wouldn't use an authorized copy. Ultimately, I
         | suspect OCLC would still be resistant to this change, as it
         | would feel like a huge shift, even if I'm not sure it would
         | change much from their perspective.
        
         | empthought wrote:
         | OCLC is a nonprofit membership cooperative and would argue that
         | it itself is that international coalition of national libraries
         | and archives.
        
           | IKantRead wrote:
           | OCLC is a parasitic company masquerading as a "membership
           | cooperative".
           | 
           | Libraries (often publicly funded) produce all the work, OCLC
           | claims ownership of the results of that work, Libraries pay
           | to get it back (but they do get a discount if they
           | contribute).
           | 
           | The only reason OCLC continues to exist is because libraries
           | don't have the support or resources to fight them. It's very
           | similar to the Elsevier issue in academic publishing, but
           | OCLC does a better job with PR.
        
             | empthought wrote:
             | I mean you've clearly read Aaron Swartz's diatribes, but
             | you also clearly have no clue about OCLC's business model.
             | The catalog data is intellectually interesting, but the
             | value is in the holdings data and more importantly the
             | interlibrary loan service it enables.
             | 
             | OCLC is exactly what happens when the libraries want to
             | avoid another EBSCOhost or Proquest situation with ILL.
        
         | wayathr0w wrote:
         | [dead]
        
       | neilv wrote:
       | How does Anna's Archive keep their all their lawyers from
       | quitting?
       | 
       | > _Even though OCLC is a non-profit, their business model
       | requires protecting their database. Well, we're sorry to say,
       | friends at OCLC, we're giving it all away. :-) [...]_
       | 
       | > _This included a substantial overhaul of their backend systems,
       | introducing many security flaws. We immediately seized the
       | opportunity, and were able scrape hundreds of millions (!) of
       | records in mere days. [...]_
       | 
       | > _PS: We do want to give a genuine shout-out to the Worldcat
       | team. Even though it was a small tragedy that your data was
       | locked up, you did an amazing job at getting 30,000 libraries on
       | board to share their metadata with you. As with many of our
       | releases, we could not have done it without the decades of hard
       | work you put into building the collections that we now liberate.
       | Truly: thank you._
        
         | Sebguer wrote:
         | You realize they're not a business, right?
        
         | RhodesianHunter wrote:
         | For real. Openly bragging about exploiting security flaws to
         | scrape out data en-masse, which undoubtedly put massive strain
         | on back-end systems, is a far cry from what is considered legal
         | (politely scraping public information).
        
           | chatmasta wrote:
           | I think this is the least of their concerns considering the
           | rest of their activities. I guess they've got a sort of
           | pirate's privilege in that they can openly brag about this
           | stuff since they're already starting from the point of openly
           | flaunting the law.
           | 
           | Also, I wouldn't be surprised if there simply _are_ no
           | lawyers working at, for or with Anna 's Archive.
        
       | harveywi wrote:
       | >Over the past year, we've meticulously scraped all Worldcat
       | records. At first, we hit a lucky break. Worldcat was just
       | rolling out their complete website redesign (in Aug 2022). This
       | included a substantial overhaul of their backend systems,
       | introducing many security flaws. We immediately seized the
       | opportunity, and were able scrape hundreds of millions (!) of
       | records in mere days.
       | 
       | >After that, security flaws were slowly fixed one by one, until
       | the final one we found was patched about a month ago. By that
       | time we had pretty much all records, and were only going for
       | slightly higher quality records.
       | 
       | OCLC carelessly fiddlefarted around with their moat and lost it.
       | Poof!
        
         | RhodesianHunter wrote:
         | I don't think anyone is (legally) going to prop up a business
         | or non-profit using data that was admittedly taken from them
         | using their security holes.
        
       | itissid wrote:
       | Noob Question: Isn't this going to be an great source for
       | training Language models? Is it safe to assume that
       | OpenAI/Google/Meta etc already have these?
       | 
       | In any case great work!
        
         | m00dy wrote:
         | yes, of course. AA has a special program for llm developers.
        
         | crtasm wrote:
         | Worldcat is a database of books, not the books themselves. The
         | summary and description text might be useful though?
        
         | mannyv wrote:
         | If you could somehow download the entire archive you could feed
         | it into your LLM for training. This is a huge corpus and is
         | sort of ill-gotten. That said, it would be pretty awesome.
         | 
         | Google has this sort of thing already, since they have that
         | whole "let's digitize the world's books" project. Interesting
         | as to why google never developed a ChatGPT, given that they
         | literally have a large amount of the world's books digitized.
        
           | crtasm wrote:
           | Google launched Bard earlier this year.
        
             | mannyv wrote:
             | Yes, but why weren't they first?
        
               | wordpad25 wrote:
               | It's like asking why wasn't X invented earlier.
               | 
               | Google and everyone else had no idea how successful LLMs
               | could be until OpenAI did it.
        
       | freewizard wrote:
       | ISBN is the default ID when it comes to book related projects,
       | yes it is convenient but not without its caveats. The often
       | overlooked fact is ISBN was introduced in late 1960s, so books
       | published prior to that obviously does not have that number; and
       | not all countries adopted ISBN from day one, some like China was
       | on its own catalog systems until 1980s; and bc ISBN are usually
       | centralized managed by govt or commercial agencies, censorship
       | with political or commercial reasons are not uncommon, some books
       | were not able to get published, or may only see the world without
       | an ISBN.
       | 
       | For obvious reasons, older / non-English / suppressed books may
       | be those need more care when it comes to preserving.
        
         | wiml wrote:
         | A second issue is that ISBNs identify a specific SKU (different
         | formats will have different ISBNs, different printings may even
         | get different ISBNs, etc), but book-related projects typically
         | want some way to identify "the same book" across all these
         | different formats, printings, sometimes even editions and
         | translations and collections. OCLC IDs are identifying a
         | different space than ISBNs are.
        
           | raybb wrote:
           | What you're referring to is sometimes referred to as a "work"
           | vs "edition"
           | 
           | https://openlibrary.org/help/faq/editing#work-edition
        
             | ahi wrote:
             | It gets much much more complicated than that. There are
             | never ending discussions about FRBR: Functional
             | Requirements for Bibliographic Records.
             | https://www.ifla.org/references/best-practice-for-
             | national-b...
             | 
             | * Work is defined as the intellectual or artistic content
             | of a distinct creation. It refers to a very abstract idea
             | of a creation e.g. Shakespeare's Romeo and Juliet and not a
             | specific expression.
             | 
             | * Expression is the intellectual or artistic realization of
             | a work. The realization may take the form of text, sound,
             | image, object, movement, etc., or any combination of such
             | forms.
             | 
             | * Manifestation is the embodiment of an expression of a
             | work. For example a particular edition of a book or a
             | specific music recording.
             | 
             | * Item is a single exemplar of a manifestation. Cataloguing
             | is generally done, based on an item directly available to a
             | cataloguer
        
       | DoctorOetker wrote:
       | > We scraped ISBNdb, and downloaded the Open Library dataset, but
       | the results were unsatisfactory. The main problem was that there
       | was not a ton of overlap of ISBNs.
       | 
       | What prevents ISBN number collisions between authors? Is there a
       | central authority assigning them, or is there say a national
       | prefix, with each government assigning ISBN's for local
       | publications (perhaps delegating this to another body in that
       | nation).
       | 
       | Surely such bodies would have the most complete view on all this
       | data.
       | 
       | It's also bizarre that this simple metadata is not available from
       | whatever authority assigns ISBN numbers..
        
         | yorwba wrote:
         | Companies can get a range of ISBNs before deciding what to
         | publish under each ISBN, or whether to publish something at
         | all. So the authorities assigning ISBNs don't necessarily know
         | what they're being used for.
        
         | sevenseventen wrote:
         | There are regional ISBN agencies. The US agency, Bowker,
         | assigns ISBN prefixes by publisher, and publishers assign
         | within their prefix as they please. They're supposed to use one
         | ISBN per edition and format, but many publishers use ISBN as a
         | kind of SKU so you can't 100% count on that.
         | 
         | If that sounds sloppy...I went to publishing conferences fairly
         | regularly from the late 90's into the teens, and I never saw a
         | program that didn't have at least one session or panel titled
         | something like "Publishers must improve their metadata
         | practices."
        
           | jszymborski wrote:
           | Sounds a bit like how DOIs are assigned.
        
           | bambax wrote:
           | It's exactly like this. Publishers get a range, and do
           | whatever they want with it.
           | 
           | Also, some agencies sell the numbers range to publishers (I
           | believe the US is in that case) and others give them away at
           | no cost (France). As a result, some small publishers get
           | ISBNs from more liberal agencies than their own country: one
           | can never be 100% sure a French ISBN matches a French
           | publisher for example.
           | 
           | It's also possible some publishers re-use old numbers or
           | affix the same number to different releases/editions of a
           | book.
           | 
           | It's a mess.
           | 
           | But a global centralized system would probably be way worse,
           | so we have to live with that mess.
        
             | amelius wrote:
             | Why not use UUIDs?
             | 
             | People never enter ISBNs manually, anyways. So it might as
             | well be longer string. Or a QR code.
        
               | dwringer wrote:
               | As someone who's worked in the field of used books, I can
               | say from personal experience that's not quite true. It's
               | pretty common to have to type in an ISBN for a variety of
               | reasons. Many times the barcodes have been covered up or
               | defaced, and many publications don't have barcodes in the
               | first place.
        
               | amelius wrote:
               | It was true, for sure. But the question is, is it still
               | true?
        
               | bambax wrote:
               | Schools often ask for a specific edition of a classic
               | book and the only way to be reasonably sure you're buying
               | the correct one is to search by ISBN.
        
               | amelius wrote:
               | Well, just copy and paste from the email you got from
               | school.
               | 
               | I mean, if IT can do anything, it is to solve this
               | problem.
        
         | skuxxlife wrote:
         | For US/UK/NZ/Aus/SA, ISBNs are granted through Bowker who does
         | maintain their "Books In Print" data set that, in theory,
         | contains metadata for all of the ISBNs they've granted. In
         | practice though it's a mess. It's expensive to access and
         | relies on publishers to enter in accurate and consistent
         | metadata, which is...variable in quality to say the least.
         | Often publishers buy blocks of ISBNs to use later so no
         | metadata is entered up front and has to be pushed back to
         | Bowker at a later date. To be somewhat fair to Bowker, the
         | history of ISBNs far predates modern data standards and I can
         | imagine wrangling publishers to get accurate data is a
         | difficult task. But on the other hand, you'd think they'd have
         | a lot to gain for doing it right. As someone who runs a book
         | website, it is endlessly frustrating.
        
         | gmcharlt wrote:
         | ISBNs are messy.
         | 
         | The International ISBN Agency coordinates assigning ISBN ranges
         | to national agencies, who in turn will assign subranges to
         | publishers. The publishers in turn assign specific numbers to
         | their own works. However, the international agency does not
         | itself maintain a universal database of assigned ISBNs - the
         | most it operates is a global database of publishers and their
         | assigned ranges. And since it's the publishers who are
         | assigning numbers from their allocations, various errors can
         | crop up, including reusing ISBNs for different works and
         | failing to issue distinct ISBNs for different formats. (For
         | example, if you publish hardcover, paperbook, and ebook
         | versions of a book, you should assign three ISBNs. That rule is
         | not always observed.)
         | 
         | Also, libraries hold many books that long predate ISBNs; it
         | wasn't until 1965 that the immediate predecessor of the ISBN,
         | the SBN, was a twinkle in a bookseller's eye.
        
           | bambax wrote:
           | Yes.
           | 
           | And while in most countries you can't properly publish a book
           | without an ISBN (ie, have it sold in bookshops), you can
           | publish a Kindle book without it (if you opt to only offer
           | the ebook).
           | 
           | That leaves a huge part of publications completely out of the
           | system. Kindle-only books are on Amazon servers and nowhere
           | else.
        
             | wayathr0w wrote:
             | [dead]
        
         | crtasm wrote:
         | There's lots of agencies, they hand out blocks of numbers from
         | their allocation. Seems there's no central database for the
         | metadata:
         | 
         | https://www.isbn-international.org/content/isbn-users-manual...
         | ISBN_FAQs_to_7ed_Manual_Absolutely_final.docx
         | 
         | > Will people in other countries be able to search for my books
         | in search engines in those countries?
         | 
         | > This does not happen automatically ... In order for your book
         | to be listed in other countries you should contact the
         | respective ISBN Agency and ask them for details of how to be
         | entered into their national catalogue for books in circulation
         | (books in print). Sometimes you will have to obtain a
         | distributor from that country or have an address in that
         | country before this is possible. In some circumstances in order
         | to be listed, the book must be in the language of that country.
         | As well as catalogues of books in circulation, you may also
         | want to ensure that you are listed by internet retailers... .
         | Again, you will need to contact each of these organisations
         | directly (including each separate international branch) with
         | details of your book.
        
       | pagekicker wrote:
       | This is theft.
        
         | arboles wrote:
         | Why wouldn't you download a car?
        
       | qingcharles wrote:
       | I can't get the .torrent file to work in my client. Can anyone
       | give me a magnet link for it?
       | 
       | I need the magazine ISSNs for my magazine encyclopedia.
       | 
       | edit: got the .torrent working in qTorrent
        
       | raybb wrote:
       | From the end:
       | 
       | > We do want to give a genuine shout-out to the Worldcat team.
       | Even though it was a small tragedy that your data was locked up,
       | you did an amazing job at getting 30,000 libraries on board to
       | share their metadata with you.
       | 
       | I wonder what the story is behind Worldcat getting so many
       | libraries across the world on board? I don't know much about the
       | software but it must be pretty compelling.
        
         | gmcharlt wrote:
         | It's not the software per se, which is generally fit for
         | purpose but not amazing, but the traditions and economics
         | underpinning how libraries maintain their bibliographic
         | metadata.
         | 
         | Libraries sharing metadata for their catalogs has a long
         | history, dating back to at least 1902 when the Library of
         | Congress started selling catalog cards for use by other
         | libraries. In the 1960s, the Library of Congress embarked on
         | various projects to computerize their catalog, leading to the
         | creation of the MARC format as a common metadata format for
         | exchanging bibliographic records. (And there is a straight line
         | between how card catalogs were put together and much of how
         | library metadata is conceptualized, although that's been
         | (slowly) changing.)
         | 
         | One problem is that bibliographic metadata from the Library of
         | Congress is mostly generated in-house, and LoC does not catalog
         | everything; not even close. In the late 1960s, OCLC, the
         | organization behind Worldcat, was started to operate a union
         | catalog. The idea is that libraries could download
         | bibliographic records needed for their own catalogs ("copy
         | cataloging") and contribute new records for the unique stuff
         | they cataloged ("original cataloging"). Under the aegis of OCLC
         | as a non-profit organization, it was a pretty good deal for
         | libraries, and over time led to additional services such as
         | brokering interlibrary loan requests. After all, since Worldcat
         | had a good idea of the holdings of libraries in North America
         | (and over time, a good chunk of Europe and other areas), it was
         | straightforward to set up an exchange for ILL requests.
         | 
         | Tie this with a general trend over the past couple decades of
         | libraries decreasing the funding and staffing for maintaining
         | their local catalogs, and need for sharing in the creation and
         | maintenance of library metadata has gotten only more important.
         | 
         | However, OCLC has had a long history of trying to control
         | access and use of the metadata in WorldCat, to the point of
         | earning a general perception in many library quarters of trying
         | to monopolize it. To give a taste, Aaron Swartz tangled with
         | them back in the day. [1] One irony, among many, is that the
         | majority of metadata in Worldcat has its origins in the efforts
         | by publicly-funded libraries and as such shouldn't have been
         | enclosed in the first place. OCLC also has a focus on growing
         | itself, to the point where it does far more than run Worldcat.
         | Its various ventures have earned itself a reputation for
         | charging high prices to libraries, to the point where it can be
         | too expensive for smaller libraries to participate in Worldcat.
         | (Fortunately for them, there are various alternative ways of
         | getting MARC records for free or very cheap, but nobody has a
         | database more comprehensive than Worldcat.)
         | 
         | That said, OCLC does do quite a bit itself to improve the
         | overall quality of Worldcat and to try to push libraries past
         | the 1960s-era MARC format. But one of the ironies of the
         | scraping is that it's not going to be immediately helpful to
         | the libraries who are unable to afford to participate in
         | Worldcat. This is because the scrape didn't (and quite possibly
         | never could have) capture the data in MARC format, which is
         | what most library catalog software uses. While MARC records
         | could be cross-walked from the JSON, they will undoubtedly omit
         | some data elements found in the original MARC.
         | 
         | [1] http://www.aaronsw.com/weblog/oclcreply
        
           | wayathr0w wrote:
           | If you liked the comment-length analysis OCLC & want more,
           | there's a whole essay on the subject. [1]
           | 
           | >But one of the ironies of the scraping is that it's not
           | going to be immediately helpful to the libraries who are
           | unable to afford to participate in Worldcat. This is because
           | the scrape didn't (and quite possibly never could have)
           | capture the data in MARC format, which is what most library
           | catalog software uses. While MARC records could be cross-
           | walked from the JSON, they will undoubtedly omit some data
           | elements found in the original MARC.
           | 
           | While it would have been ideal to get all the data in MARC &
           | as many other formats as possible, I wonder how true this is
           | worldwide - many libraries don't use MARC or have a digital
           | catalog at all. Maybe there are some ways the data could be
           | processed that make it easier to integrate into such places,
           | but of course local needs/desires will vary widely.
           | 
           | [1] https://core.ac.uk/download/pdf/11883899.pdf - it was
           | also published in this book:
           | https://archive.org/details/radicalcatalogin0000unse
        
             | gmcharlt wrote:
             | > While it would have been ideal to get all the data in
             | MARC & as many other formats as possible, I wonder how true
             | this is worldwide - many libraries don't use MARC or have a
             | digital catalog at all. Maybe there are some ways the data
             | could be processed that make it easier to integrate into
             | such places, but of course local needs/desires will vary
             | widely.
             | 
             | Indeed, MARC is not universal (and for that matter, it
             | wouldn't surprise me if at this point the majority of
             | records in Worldcat were _not_ derived from MARC sources),
             | and there are certainly non-MARC library catalog platforms
             | out there. That said, as the growth of Koha shows, for
             | better or worse MARC has become a close to a global
             | baseline for a lot of libraries.
        
               | ahi wrote:
               | Worse, definitely worse.
        
         | [deleted]
        
         | kinos wrote:
         | they probably have a good interface for personal library
         | tracking
        
           | gmcharlt wrote:
           | It's not a LibraryThing or GoodReads; it's meant for
           | libraries that are institutions. That said, I don't think
           | there is anything stopping an individual person signing up
           | and entering their collection, but there would be no point in
           | paying the fees unless you had (say) a unique scholarly
           | collection and wanted to lend books to other libraries - and
           | if so, in the long run you'd likely be better off seeing if a
           | library wanted to acquire your collection.
        
         | justaguitarist wrote:
         | Disclaimer, I was a Linux admin at OCLC for a few years. The
         | WorldCat database has been around since the early 70s, so I
         | think that helps the numbers a bit. I don't have any insight
         | into their marketing/sales/end user experience though.
        
       ___________________________________________________________________
       (page generated 2023-10-04 23:00 UTC)