[HN Gopher] Where is all the book data?
       ___________________________________________________________________
        
       Where is all the book data?
        
       Author : hkhn
       Score  : 52 points
       Date   : 2022-11-10 08:02 UTC (1 days ago)
        
 (HTM) web link (www.publicbooks.org)
 (TXT) w3m dump (www.publicbooks.org)
        
       | epaulson wrote:
       | The book data that I wish was more accessible was bibliographic
       | data. I wish there was a cheap ISBN API (cheap enough for an
       | individual to afford) that I could use to look up all of the data
       | for my book from just a barcode scan. I know there are some API
       | providers for that, but the plans are clearly meant for big users
       | and not for someone who just wants to use it a couple hundred
       | times.
       | 
       | This would be something the Library of Congress should run, or
       | maybe one of the university library consortia, like the formerly-
       | named Committee on Institutional Cooperation, (which has renamed
       | itself the 'Big Ten Academic Alliance', because football -
       | https://btaa.org/library/Libraries )
        
         | dredmorbius wrote:
         | The dominant force in that department has been the OCLC.
         | 
         | I submitted an article yesterday about its own power-grab
         | regarding bibliographic metadata, "Let the Metadata Wars
         | Begin":
         | 
         | <https://scholarlykitchen.sspnet.org/2022/06/22/oclc-sues-
         | cla...>
         | 
         | <https://news.ycombinator.com/item?id=33556442>
         | 
         | That lists a number of resources:
         | 
         | National Libraries <https://www.dnb.de/EN/Ueber-
         | uns/Presse/ArchivPM2015/metadate...> (linked data:
         | <https://www.loc.gov/item/lcwaN0018834/>)
         | 
         | Harvard <https://hangingtogether.org/harvard-bibliographic-
         | data-relea...>
         | 
         | MetaDoor <https://meli.org.il/wp-content/uploads/2021/12/Chani-
         | Yehuda_...> (from Clarivate, subject of the lawsuit headlining
         | this article)
         | 
         | I'm also aware of the Internet Archive's Open Library
         | (initiated by Aaron Swartz, see below), and some Wikidata
         | efforts based largely on ISBN. And of course the almost wholly
         | useless HathiTrust.
         | 
         | On Open Library data: <https://openlibrary.org/help/faq/using>
         | 
         | Wikipedia Book Sources:
         | <https://en.wikipedia.org/wiki/Special:BookSources/>
         | 
         | There's also the lawsuit by OCLC against Clarivate:
         | 
         | <https://www.infodocket.com/2022/06/15/oclc-files-lawsuit-
         | aga...>
         | 
         | And ... searching "OCLC" in the HN archives turns up numerous
         | other references, including Aaron Swartz (miss you, guy),
         | "Stealing your Library: The OCLC Powergrab":
         | 
         | <https://web.archive.org/web/20081218092812/www.aaronsw.com/w..
         | .>
         | 
         | <https://news.ycombinator.com/item?id=362769> (2008)
         | 
         | Algolia search for "OCLC" on HN:
         | <https://hn.algolia.com/?q=oclc>
        
           | fatneckbeardz wrote:
           | yeah. why i left the "library industry" and now work 'for the
           | man'.
           | 
           | the library industry should have embraced open source in the
           | 90s but they just never "got it". they seem to think they
           | just need to be involved in some hyper expensive vendor
           | projects and somehow that will bring them validation.
           | 
           | i worked in this tiny library with a few ten thousands books
           | and they were paying for system that used oracle as the
           | Database. so basically they were paying for oracle. this was
           | 20 years ago.
           | 
           | then there is JSTOR and the whole Aaron Swartz thing. JSTOR
           | acted really inappropriately
           | 
           | like i feel really bad about what has happened to libraries
           | over the past 20 years, with funding cut to the bone. but
           | they kind of did it to themselves by thinking that "serving
           | the public" means shoveling the publics money to proprietary
           | vendors for no apparent reason. like OCLC has no right to
           | take information generated by public institutions that are
           | almost entirely funded by local taxpayers and somehow claim
           | ownership of that information, and act like a monopolistic
           | for profit corporation.
           | 
           | there are a lot of very innovating library people doing stuff
           | like maker spaces and kids education despite all the
           | hardships but.... this revolution has not made it to the
           | 'library industry'.
        
             | dredmorbius wrote:
             | NB: JSTOR, to its credit, Larry Lessig says "great credit"
             | (<https://lessig.tumblr.com/post/40347463044/prosecutor-as-
             | bul...>), didn't pursue prosecution of Swartz. M.I.T.,
             | however certainly did (also noted by Lessig). From
             | Abelson's report, commissioned by MIT:
             | 
             |  _If the Review Panel is forced to highlight just one issue
             | for reflection, we would choose to look to the MIT
             | administration's maintenance of a "neutral" hands-off
             | attitude that regarded the prosecution as a legal dispute
             | to which it was not a party. This attitude was complemented
             | by the MIT community's apparent lack of attention to the
             | ruinous collision of hacker ethics, open-source ideals,
             | questionable laws, and aggressive prosecutions that was
             | playing out in its midst. As a case study, this is a
             | textbook example of the very controversies where the world
             | seeks MIT's insight and leadership. A friend of Aaron
             | Swartz stressed in one of our interviews that MIT will
             | continue to be at the cutting edge in information
             | technology and, in today's world, challenges like those
             | presented in Aaron Swartz's case will arise again and
             | again. With that realization, "Neutrality on these cases is
             | an incoherent stance. It's not the right choice for a tough
             | leader or a moral leader."_
             | 
             | <http://swartz-report.mit.edu/docs/report-to-the-
             | president.pd...>
             | 
             | pp. 100-101
             | 
             | And the US DoJ and courts have bloody hands.
             | 
             | The careers of Ortiz and Heymann (DoJ) have suffered
             | somewhat: Heymann left the DoJ, Ortiz's ambitions for
             | higher office (reputedly she'd had interest in the Mass.
             | governorship) were thwarted. The judge remains on the US
             | District Court of Massachussetts.
        
         | danofsteel32 wrote:
         | The Open Library project https://openlibrary.org/developers has
         | a free ISBN API. I used it along with tesseract OCR and a
         | webcam to build a database of my physical books (218 total). I
         | never had any rate limiting issues.
        
       | eslaught wrote:
       | Oh, hey. I was just looking at this article the other day.
       | 
       | I think this article makes some good points, but it's a little
       | too absolutist about this data. As far as I can tell, _if you are
       | an author or industry professional_ , you can get access to the
       | following data:
       | 
       | * If you want data on your own books: sign up for Amazon Author
       | Central and they'll give you the BookScan data on your books.
       | This is free. https://press.aboutamazon.com/2010/12/weekly-
       | nielsen-booksca...
       | 
       | * If you want data on comps (i.e., comparable books, or books you
       | are competing against): sign up for Publishers Marketplace and
       | pay for the monthly package ($25/month on top of the PM
       | subscription). This gives you the ability to track 5 ISBNs (and I
       | assume, you can pick new ISBNs every month).
       | https://www.publishersmarketplace.com/bookscan/about.cgi#dat...
       | 
       | * If you want public library data checkout data: as linked in the
       | article, go to the Seattle Public Library. Free.
       | https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-y...
       | 
       | The situation only really gets ugly if you want access to broad
       | market data (i.e., across all ISBNs for a given time window, and
       | covering a majority of retail outlets). The best I'm able to find
       | is this comment by Kristen McLean from NPD:
       | https://countercraft.substack.com/p/no-most-books-dont-sell-...
       | 
       | But this data is (a) limited, and (b) I think it has some pretty
       | serious issues [1]. I sent an email to Kristen to try to address
       | this, but so far no response. (If anyone has any connections that
       | might help, please contact me!)
       | 
       | And if you want to get access to the data yourself, you're
       | talking about something to the tune of $2,500 USD. And the terms
       | are pretty restrictive.
       | https://www.publishersmarketplace.com/bookscan/about.cgi#pri...
       | https://www.publishersmarketplace.com/bookscan/terms.shtml
       | 
       | I am actively working on improving this situation, and I've got
       | some ideas for what we could do while still abiding by NPD's
       | terms. If that's something that interests you, please contact me
       | (see my profile).
       | 
       | [1]: My issue with Kristen's analysis is that it follows (in her
       | words) a "conveyor belt" pattern. That is, the time window is
       | fixed. Within that time window, some books have been on the
       | market 364 days. Some may have been on the market 1 day. So it's
       | not surprising that some books have very few sales: they may
       | simply have not been on the market long enough. And you can't
       | just say, "well, multiply the data by 2x to account for the
       | average case" because I'm pretty sure that doesn't work. But
       | without real data I can't fix this.
        
       | afandian wrote:
       | Not about sales data, per this article, but bibliographic
       | metadata. Check out POSI, the Principles for Open Scholarly
       | Infrastructure:
       | 
       | https://openscholarlyinfrastructure.org/
       | 
       | There are people dedicated to open metadata and open systems to
       | work with it.
       | 
       | https://openscholarlyinfrastructure.org/posse/
       | 
       | (I work at Crossref)
        
       | kmeisthax wrote:
       | One of the bigger complaints about AI art generation is that "oh,
       | it'll become a closed loop system, and then we'll all be sitting
       | at our chairs watching a neural network spew art all day while
       | human artists starve to death". This is kind of funny because, if
       | Public Books' article here is even remotely true, the existing
       | publishing system is _already_ a closed loop. Publishers only
       | commission or purchase works that match the particular taste
       | profiles that are already trained into the sales data. If you
       | want to make something new, the publishing companies have already
       | boycotted and cancelled you.
        
         | bombcar wrote:
         | There is an out which is self-publishing. That's now entirely
         | cheap (read: you can do it yourself for free).
         | 
         | You don't get the publisher money, but it's an option available
         | for you, and you have somewhat the same chance (read: zero) of
         | striking it rich and becoming popular.
        
       | dglass wrote:
       | While I agree with the general argument in the article that sales
       | data and similar metrics should be public, I think there's a lot
       | more that can be done to unlock all of the knowledge stored in
       | books. There are vast amounts of knowledge that humanity has
       | built up over centuries that are either hard to find or hard to
       | access unless you know where to look. How does someone like me
       | discover that knowledge for a topic I'm interested in?
       | 
       | I wrote a book that was recently published to help junior and
       | mid-level programmers build up their soft-skills to advance their
       | career[0]. The book was published by Holloway[1]. They have an
       | interesting platform to solve this problem, which is why I chose
       | to publish with them. They publish works primarily through their
       | online reader, which is indexable by search engines. So someone
       | searching for "How to get up to speed on a new codebase" in their
       | preferred search engine could stumble across the chapter titled
       | "How to read unfamiliar code"[2] and read a free preview of the
       | book. Over time, people can discover and access the knowledge
       | stored in any book that is published on Holloway's platform.
       | 
       | Another nice side effect of the platform is that it can be
       | updated over time, so outdated knowledge or content can be
       | revised, updated, and re-indexed by the search engines as
       | knowledge about topics evolve.
       | 
       | If you're considering writing a book, or have a manuscript and
       | are looking for a publisher, I'd recommend giving Holloway a look
       | to see if it would be a good fit.
       | 
       | [0]: https://www.holloway.com/b/junior-to-senior
       | 
       | [1]: https://www.holloway.com/
       | 
       | [2]: https://www.holloway.com/g/junior-to-senior/sections/how-
       | to-...
        
       ___________________________________________________________________
       (page generated 2022-11-11 23:01 UTC)