[HN Gopher] 1.3B Worldcat scrape and data science mini-competition
___________________________________________________________________
1.3B Worldcat scrape and data science mini-competition
Author : crtasm
Score : 183 points
Date : 2023-10-04 12:22 UTC (10 hours ago)
(HTM) web link (annas-blog.org)
(TXT) w3m dump (annas-blog.org)
| mannyv wrote:
| It's unclear what exactly the competition is about. Just to poke
| around the dataset?
| mannyv wrote:
| As a note, I wish I had enough space to mirror their library.
| Looking at this brings out the collector in me...a tendency
| that I've successfully suppressed. You can only keep so many
| terabytes of archive around.
| sterlind wrote:
| It's infuriating seeing non-profits gatekeep datasets that were
| compiled with grant money. At least Elsevier doesn't present
| itself as a charity.
|
| I was recently trying to get my hands on the Switchboard and
| Fisher conversational speech datasets. Both were funded by DARPA
| grants, and maintained by the non-profit LDC, which charges you
| thousands of dollars for access (and no discounts for individual
| researchers) - that is, if they'll even pay attention to you
| without a .edu email address. And both are standard corpora in
| the field of audio NLP, which makes replicating studies
| impossible.
|
| Sadly, I couldn't find any way to pirate the datasets - they're
| too niche. So I applaud the authors for sticking it to Worldcat
| and scraping their data.
| 3abiton wrote:
| I hope one day there will a piratebay for datasets (the pile)
| and ai models (llama)
| hedora wrote:
| I looked into using Dewey Decimal for a hobby project. OCLC has a
| _de facto_ monopoly on it due to the Worldcat database. They 're
| a non-profit, but they're supported by having libraries pay a
| subscription fee for Worldcat.
|
| Back when OCLC was founded, the idea that people would want to
| have a copy of a card catalog for personal use was laughable, so
| I'm sympathetic to the people that set up their funding model.
| It's far cheaper for a library to subscribe to Worldcat than to
| hire a team to maintain such a database, so it created a win-win
| situation.
|
| However, keeping the world's books' metadata a secret (and
| leaving control in the hands of a monopoly) is an anachronism.
|
| It's well past the time when someone (such as an international
| coalition of Libraries of Congress) should figure out how to
| sustainably fund OCLC while also releasing their work into the
| public domain.
| mgr86 wrote:
| I have been told that my organization developed a system, HABS,
| that pre-dated OCLC [0]. That OCLC used this system as an
| inspiration. However, I cannot confirm this. Closest I can do
| is to find a footnote that Thanks Fred Kilgore, the founder of
| OCLC [1]. I should reach out to Koh, a friend of a friend,
| while she is still alive to confirm the story. Nevertheless, we
| have a collection of punch cards in a dusty room in an attic
| that was once the HABS system. I think it is a pretty
| fascinating legacy, and I wish it was better preserved.
|
| [0]
| https://journals.sagepub.com/doi/pdf/10.1177/106939716900400...
| [1]https://journals.sagepub.com/doi/abs/10.1177/106939717300800
| ...
| gmcharlt wrote:
| Neat! I hope you can learn more from Koh.
|
| I know a bit about Henriette Avram and her work at LoC
| developing MARC, but it of course makes sense that other
| libraries were thinking along similar lines at the time.
| actuallyalys wrote:
| I suppose one way to do it would be to allow patrons of
| subscriber libraries to access the database dumps and API.
|
| The downside is that would still make it harder than necessary
| to access and leave some people out. The upside is that it's
| not that much of change from their existing model. I'm sure
| there would also be concerns about database dumps being shared
| publicly, although Anna's Archive has already released their
| entire database, and I suspect most people who would pay for
| formal access wouldn't use an authorized copy. Ultimately, I
| suspect OCLC would still be resistant to this change, as it
| would feel like a huge shift, even if I'm not sure it would
| change much from their perspective.
| empthought wrote:
| OCLC is a nonprofit membership cooperative and would argue that
| it itself is that international coalition of national libraries
| and archives.
| IKantRead wrote:
| OCLC is a parasitic company masquerading as a "membership
| cooperative".
|
| Libraries (often publicly funded) produce all the work, OCLC
| claims ownership of the results of that work, Libraries pay
| to get it back (but they do get a discount if they
| contribute).
|
| The only reason OCLC continues to exist is because libraries
| don't have the support or resources to fight them. It's very
| similar to the Elsevier issue in academic publishing, but
| OCLC does a better job with PR.
| empthought wrote:
| I mean you've clearly read Aaron Swartz's diatribes, but
| you also clearly have no clue about OCLC's business model.
| The catalog data is intellectually interesting, but the
| value is in the holdings data and more importantly the
| interlibrary loan service it enables.
|
| OCLC is exactly what happens when the libraries want to
| avoid another EBSCOhost or Proquest situation with ILL.
| wayathr0w wrote:
| [dead]
| neilv wrote:
| How does Anna's Archive keep their all their lawyers from
| quitting?
|
| > _Even though OCLC is a non-profit, their business model
| requires protecting their database. Well, we're sorry to say,
| friends at OCLC, we're giving it all away. :-) [...]_
|
| > _This included a substantial overhaul of their backend systems,
| introducing many security flaws. We immediately seized the
| opportunity, and were able scrape hundreds of millions (!) of
| records in mere days. [...]_
|
| > _PS: We do want to give a genuine shout-out to the Worldcat
| team. Even though it was a small tragedy that your data was
| locked up, you did an amazing job at getting 30,000 libraries on
| board to share their metadata with you. As with many of our
| releases, we could not have done it without the decades of hard
| work you put into building the collections that we now liberate.
| Truly: thank you._
| Sebguer wrote:
| You realize they're not a business, right?
| RhodesianHunter wrote:
| For real. Openly bragging about exploiting security flaws to
| scrape out data en-masse, which undoubtedly put massive strain
| on back-end systems, is a far cry from what is considered legal
| (politely scraping public information).
| chatmasta wrote:
| I think this is the least of their concerns considering the
| rest of their activities. I guess they've got a sort of
| pirate's privilege in that they can openly brag about this
| stuff since they're already starting from the point of openly
| flaunting the law.
|
| Also, I wouldn't be surprised if there simply _are_ no
| lawyers working at, for or with Anna 's Archive.
| harveywi wrote:
| >Over the past year, we've meticulously scraped all Worldcat
| records. At first, we hit a lucky break. Worldcat was just
| rolling out their complete website redesign (in Aug 2022). This
| included a substantial overhaul of their backend systems,
| introducing many security flaws. We immediately seized the
| opportunity, and were able scrape hundreds of millions (!) of
| records in mere days.
|
| >After that, security flaws were slowly fixed one by one, until
| the final one we found was patched about a month ago. By that
| time we had pretty much all records, and were only going for
| slightly higher quality records.
|
| OCLC carelessly fiddlefarted around with their moat and lost it.
| Poof!
| RhodesianHunter wrote:
| I don't think anyone is (legally) going to prop up a business
| or non-profit using data that was admittedly taken from them
| using their security holes.
| itissid wrote:
| Noob Question: Isn't this going to be an great source for
| training Language models? Is it safe to assume that
| OpenAI/Google/Meta etc already have these?
|
| In any case great work!
| m00dy wrote:
| yes, of course. AA has a special program for llm developers.
| crtasm wrote:
| Worldcat is a database of books, not the books themselves. The
| summary and description text might be useful though?
| mannyv wrote:
| If you could somehow download the entire archive you could feed
| it into your LLM for training. This is a huge corpus and is
| sort of ill-gotten. That said, it would be pretty awesome.
|
| Google has this sort of thing already, since they have that
| whole "let's digitize the world's books" project. Interesting
| as to why google never developed a ChatGPT, given that they
| literally have a large amount of the world's books digitized.
| crtasm wrote:
| Google launched Bard earlier this year.
| mannyv wrote:
| Yes, but why weren't they first?
| wordpad25 wrote:
| It's like asking why wasn't X invented earlier.
|
| Google and everyone else had no idea how successful LLMs
| could be until OpenAI did it.
| freewizard wrote:
| ISBN is the default ID when it comes to book related projects,
| yes it is convenient but not without its caveats. The often
| overlooked fact is ISBN was introduced in late 1960s, so books
| published prior to that obviously does not have that number; and
| not all countries adopted ISBN from day one, some like China was
| on its own catalog systems until 1980s; and bc ISBN are usually
| centralized managed by govt or commercial agencies, censorship
| with political or commercial reasons are not uncommon, some books
| were not able to get published, or may only see the world without
| an ISBN.
|
| For obvious reasons, older / non-English / suppressed books may
| be those need more care when it comes to preserving.
| wiml wrote:
| A second issue is that ISBNs identify a specific SKU (different
| formats will have different ISBNs, different printings may even
| get different ISBNs, etc), but book-related projects typically
| want some way to identify "the same book" across all these
| different formats, printings, sometimes even editions and
| translations and collections. OCLC IDs are identifying a
| different space than ISBNs are.
| raybb wrote:
| What you're referring to is sometimes referred to as a "work"
| vs "edition"
|
| https://openlibrary.org/help/faq/editing#work-edition
| ahi wrote:
| It gets much much more complicated than that. There are
| never ending discussions about FRBR: Functional
| Requirements for Bibliographic Records.
| https://www.ifla.org/references/best-practice-for-
| national-b...
|
| * Work is defined as the intellectual or artistic content
| of a distinct creation. It refers to a very abstract idea
| of a creation e.g. Shakespeare's Romeo and Juliet and not a
| specific expression.
|
| * Expression is the intellectual or artistic realization of
| a work. The realization may take the form of text, sound,
| image, object, movement, etc., or any combination of such
| forms.
|
| * Manifestation is the embodiment of an expression of a
| work. For example a particular edition of a book or a
| specific music recording.
|
| * Item is a single exemplar of a manifestation. Cataloguing
| is generally done, based on an item directly available to a
| cataloguer
| DoctorOetker wrote:
| > We scraped ISBNdb, and downloaded the Open Library dataset, but
| the results were unsatisfactory. The main problem was that there
| was not a ton of overlap of ISBNs.
|
| What prevents ISBN number collisions between authors? Is there a
| central authority assigning them, or is there say a national
| prefix, with each government assigning ISBN's for local
| publications (perhaps delegating this to another body in that
| nation).
|
| Surely such bodies would have the most complete view on all this
| data.
|
| It's also bizarre that this simple metadata is not available from
| whatever authority assigns ISBN numbers..
| yorwba wrote:
| Companies can get a range of ISBNs before deciding what to
| publish under each ISBN, or whether to publish something at
| all. So the authorities assigning ISBNs don't necessarily know
| what they're being used for.
| sevenseventen wrote:
| There are regional ISBN agencies. The US agency, Bowker,
| assigns ISBN prefixes by publisher, and publishers assign
| within their prefix as they please. They're supposed to use one
| ISBN per edition and format, but many publishers use ISBN as a
| kind of SKU so you can't 100% count on that.
|
| If that sounds sloppy...I went to publishing conferences fairly
| regularly from the late 90's into the teens, and I never saw a
| program that didn't have at least one session or panel titled
| something like "Publishers must improve their metadata
| practices."
| jszymborski wrote:
| Sounds a bit like how DOIs are assigned.
| bambax wrote:
| It's exactly like this. Publishers get a range, and do
| whatever they want with it.
|
| Also, some agencies sell the numbers range to publishers (I
| believe the US is in that case) and others give them away at
| no cost (France). As a result, some small publishers get
| ISBNs from more liberal agencies than their own country: one
| can never be 100% sure a French ISBN matches a French
| publisher for example.
|
| It's also possible some publishers re-use old numbers or
| affix the same number to different releases/editions of a
| book.
|
| It's a mess.
|
| But a global centralized system would probably be way worse,
| so we have to live with that mess.
| amelius wrote:
| Why not use UUIDs?
|
| People never enter ISBNs manually, anyways. So it might as
| well be longer string. Or a QR code.
| dwringer wrote:
| As someone who's worked in the field of used books, I can
| say from personal experience that's not quite true. It's
| pretty common to have to type in an ISBN for a variety of
| reasons. Many times the barcodes have been covered up or
| defaced, and many publications don't have barcodes in the
| first place.
| amelius wrote:
| It was true, for sure. But the question is, is it still
| true?
| bambax wrote:
| Schools often ask for a specific edition of a classic
| book and the only way to be reasonably sure you're buying
| the correct one is to search by ISBN.
| amelius wrote:
| Well, just copy and paste from the email you got from
| school.
|
| I mean, if IT can do anything, it is to solve this
| problem.
| skuxxlife wrote:
| For US/UK/NZ/Aus/SA, ISBNs are granted through Bowker who does
| maintain their "Books In Print" data set that, in theory,
| contains metadata for all of the ISBNs they've granted. In
| practice though it's a mess. It's expensive to access and
| relies on publishers to enter in accurate and consistent
| metadata, which is...variable in quality to say the least.
| Often publishers buy blocks of ISBNs to use later so no
| metadata is entered up front and has to be pushed back to
| Bowker at a later date. To be somewhat fair to Bowker, the
| history of ISBNs far predates modern data standards and I can
| imagine wrangling publishers to get accurate data is a
| difficult task. But on the other hand, you'd think they'd have
| a lot to gain for doing it right. As someone who runs a book
| website, it is endlessly frustrating.
| gmcharlt wrote:
| ISBNs are messy.
|
| The International ISBN Agency coordinates assigning ISBN ranges
| to national agencies, who in turn will assign subranges to
| publishers. The publishers in turn assign specific numbers to
| their own works. However, the international agency does not
| itself maintain a universal database of assigned ISBNs - the
| most it operates is a global database of publishers and their
| assigned ranges. And since it's the publishers who are
| assigning numbers from their allocations, various errors can
| crop up, including reusing ISBNs for different works and
| failing to issue distinct ISBNs for different formats. (For
| example, if you publish hardcover, paperbook, and ebook
| versions of a book, you should assign three ISBNs. That rule is
| not always observed.)
|
| Also, libraries hold many books that long predate ISBNs; it
| wasn't until 1965 that the immediate predecessor of the ISBN,
| the SBN, was a twinkle in a bookseller's eye.
| bambax wrote:
| Yes.
|
| And while in most countries you can't properly publish a book
| without an ISBN (ie, have it sold in bookshops), you can
| publish a Kindle book without it (if you opt to only offer
| the ebook).
|
| That leaves a huge part of publications completely out of the
| system. Kindle-only books are on Amazon servers and nowhere
| else.
| wayathr0w wrote:
| [dead]
| crtasm wrote:
| There's lots of agencies, they hand out blocks of numbers from
| their allocation. Seems there's no central database for the
| metadata:
|
| https://www.isbn-international.org/content/isbn-users-manual...
| ISBN_FAQs_to_7ed_Manual_Absolutely_final.docx
|
| > Will people in other countries be able to search for my books
| in search engines in those countries?
|
| > This does not happen automatically ... In order for your book
| to be listed in other countries you should contact the
| respective ISBN Agency and ask them for details of how to be
| entered into their national catalogue for books in circulation
| (books in print). Sometimes you will have to obtain a
| distributor from that country or have an address in that
| country before this is possible. In some circumstances in order
| to be listed, the book must be in the language of that country.
| As well as catalogues of books in circulation, you may also
| want to ensure that you are listed by internet retailers... .
| Again, you will need to contact each of these organisations
| directly (including each separate international branch) with
| details of your book.
| pagekicker wrote:
| This is theft.
| arboles wrote:
| Why wouldn't you download a car?
| qingcharles wrote:
| I can't get the .torrent file to work in my client. Can anyone
| give me a magnet link for it?
|
| I need the magazine ISSNs for my magazine encyclopedia.
|
| edit: got the .torrent working in qTorrent
| raybb wrote:
| From the end:
|
| > We do want to give a genuine shout-out to the Worldcat team.
| Even though it was a small tragedy that your data was locked up,
| you did an amazing job at getting 30,000 libraries on board to
| share their metadata with you.
|
| I wonder what the story is behind Worldcat getting so many
| libraries across the world on board? I don't know much about the
| software but it must be pretty compelling.
| gmcharlt wrote:
| It's not the software per se, which is generally fit for
| purpose but not amazing, but the traditions and economics
| underpinning how libraries maintain their bibliographic
| metadata.
|
| Libraries sharing metadata for their catalogs has a long
| history, dating back to at least 1902 when the Library of
| Congress started selling catalog cards for use by other
| libraries. In the 1960s, the Library of Congress embarked on
| various projects to computerize their catalog, leading to the
| creation of the MARC format as a common metadata format for
| exchanging bibliographic records. (And there is a straight line
| between how card catalogs were put together and much of how
| library metadata is conceptualized, although that's been
| (slowly) changing.)
|
| One problem is that bibliographic metadata from the Library of
| Congress is mostly generated in-house, and LoC does not catalog
| everything; not even close. In the late 1960s, OCLC, the
| organization behind Worldcat, was started to operate a union
| catalog. The idea is that libraries could download
| bibliographic records needed for their own catalogs ("copy
| cataloging") and contribute new records for the unique stuff
| they cataloged ("original cataloging"). Under the aegis of OCLC
| as a non-profit organization, it was a pretty good deal for
| libraries, and over time led to additional services such as
| brokering interlibrary loan requests. After all, since Worldcat
| had a good idea of the holdings of libraries in North America
| (and over time, a good chunk of Europe and other areas), it was
| straightforward to set up an exchange for ILL requests.
|
| Tie this with a general trend over the past couple decades of
| libraries decreasing the funding and staffing for maintaining
| their local catalogs, and need for sharing in the creation and
| maintenance of library metadata has gotten only more important.
|
| However, OCLC has had a long history of trying to control
| access and use of the metadata in WorldCat, to the point of
| earning a general perception in many library quarters of trying
| to monopolize it. To give a taste, Aaron Swartz tangled with
| them back in the day. [1] One irony, among many, is that the
| majority of metadata in Worldcat has its origins in the efforts
| by publicly-funded libraries and as such shouldn't have been
| enclosed in the first place. OCLC also has a focus on growing
| itself, to the point where it does far more than run Worldcat.
| Its various ventures have earned itself a reputation for
| charging high prices to libraries, to the point where it can be
| too expensive for smaller libraries to participate in Worldcat.
| (Fortunately for them, there are various alternative ways of
| getting MARC records for free or very cheap, but nobody has a
| database more comprehensive than Worldcat.)
|
| That said, OCLC does do quite a bit itself to improve the
| overall quality of Worldcat and to try to push libraries past
| the 1960s-era MARC format. But one of the ironies of the
| scraping is that it's not going to be immediately helpful to
| the libraries who are unable to afford to participate in
| Worldcat. This is because the scrape didn't (and quite possibly
| never could have) capture the data in MARC format, which is
| what most library catalog software uses. While MARC records
| could be cross-walked from the JSON, they will undoubtedly omit
| some data elements found in the original MARC.
|
| [1] http://www.aaronsw.com/weblog/oclcreply
| wayathr0w wrote:
| If you liked the comment-length analysis OCLC & want more,
| there's a whole essay on the subject. [1]
|
| >But one of the ironies of the scraping is that it's not
| going to be immediately helpful to the libraries who are
| unable to afford to participate in Worldcat. This is because
| the scrape didn't (and quite possibly never could have)
| capture the data in MARC format, which is what most library
| catalog software uses. While MARC records could be cross-
| walked from the JSON, they will undoubtedly omit some data
| elements found in the original MARC.
|
| While it would have been ideal to get all the data in MARC &
| as many other formats as possible, I wonder how true this is
| worldwide - many libraries don't use MARC or have a digital
| catalog at all. Maybe there are some ways the data could be
| processed that make it easier to integrate into such places,
| but of course local needs/desires will vary widely.
|
| [1] https://core.ac.uk/download/pdf/11883899.pdf - it was
| also published in this book:
| https://archive.org/details/radicalcatalogin0000unse
| gmcharlt wrote:
| > While it would have been ideal to get all the data in
| MARC & as many other formats as possible, I wonder how true
| this is worldwide - many libraries don't use MARC or have a
| digital catalog at all. Maybe there are some ways the data
| could be processed that make it easier to integrate into
| such places, but of course local needs/desires will vary
| widely.
|
| Indeed, MARC is not universal (and for that matter, it
| wouldn't surprise me if at this point the majority of
| records in Worldcat were _not_ derived from MARC sources),
| and there are certainly non-MARC library catalog platforms
| out there. That said, as the growth of Koha shows, for
| better or worse MARC has become a close to a global
| baseline for a lot of libraries.
| ahi wrote:
| Worse, definitely worse.
| [deleted]
| kinos wrote:
| they probably have a good interface for personal library
| tracking
| gmcharlt wrote:
| It's not a LibraryThing or GoodReads; it's meant for
| libraries that are institutions. That said, I don't think
| there is anything stopping an individual person signing up
| and entering their collection, but there would be no point in
| paying the fees unless you had (say) a unique scholarly
| collection and wanted to lend books to other libraries - and
| if so, in the long run you'd likely be better off seeing if a
| library wanted to acquire your collection.
| justaguitarist wrote:
| Disclaimer, I was a Linux admin at OCLC for a few years. The
| WorldCat database has been around since the early 70s, so I
| think that helps the numbers a bit. I don't have any insight
| into their marketing/sales/end user experience though.
___________________________________________________________________
(page generated 2023-10-04 23:00 UTC)