[HN Gopher] Where is all the book data?
___________________________________________________________________
Where is all the book data?
Author : hkhn
Score : 52 points
Date : 2022-11-10 08:02 UTC (1 days ago)
(HTM) web link (www.publicbooks.org)
(TXT) w3m dump (www.publicbooks.org)
| epaulson wrote:
| The book data that I wish was more accessible was bibliographic
| data. I wish there was a cheap ISBN API (cheap enough for an
| individual to afford) that I could use to look up all of the data
| for my book from just a barcode scan. I know there are some API
| providers for that, but the plans are clearly meant for big users
| and not for someone who just wants to use it a couple hundred
| times.
|
| This would be something the Library of Congress should run, or
| maybe one of the university library consortia, like the formerly-
| named Committee on Institutional Cooperation, (which has renamed
| itself the 'Big Ten Academic Alliance', because football -
| https://btaa.org/library/Libraries )
| dredmorbius wrote:
| The dominant force in that department has been the OCLC.
|
| I submitted an article yesterday about its own power-grab
| regarding bibliographic metadata, "Let the Metadata Wars
| Begin":
|
| <https://scholarlykitchen.sspnet.org/2022/06/22/oclc-sues-
| cla...>
|
| <https://news.ycombinator.com/item?id=33556442>
|
| That lists a number of resources:
|
| National Libraries <https://www.dnb.de/EN/Ueber-
| uns/Presse/ArchivPM2015/metadate...> (linked data:
| <https://www.loc.gov/item/lcwaN0018834/>)
|
| Harvard <https://hangingtogether.org/harvard-bibliographic-
| data-relea...>
|
| MetaDoor <https://meli.org.il/wp-content/uploads/2021/12/Chani-
| Yehuda_...> (from Clarivate, subject of the lawsuit headlining
| this article)
|
| I'm also aware of the Internet Archive's Open Library
| (initiated by Aaron Swartz, see below), and some Wikidata
| efforts based largely on ISBN. And of course the almost wholly
| useless HathiTrust.
|
| On Open Library data: <https://openlibrary.org/help/faq/using>
|
| Wikipedia Book Sources:
| <https://en.wikipedia.org/wiki/Special:BookSources/>
|
| There's also the lawsuit by OCLC against Clarivate:
|
| <https://www.infodocket.com/2022/06/15/oclc-files-lawsuit-
| aga...>
|
| And ... searching "OCLC" in the HN archives turns up numerous
| other references, including Aaron Swartz (miss you, guy),
| "Stealing your Library: The OCLC Powergrab":
|
| <https://web.archive.org/web/20081218092812/www.aaronsw.com/w..
| .>
|
| <https://news.ycombinator.com/item?id=362769> (2008)
|
| Algolia search for "OCLC" on HN:
| <https://hn.algolia.com/?q=oclc>
| fatneckbeardz wrote:
| yeah. why i left the "library industry" and now work 'for the
| man'.
|
| the library industry should have embraced open source in the
| 90s but they just never "got it". they seem to think they
| just need to be involved in some hyper expensive vendor
| projects and somehow that will bring them validation.
|
| i worked in this tiny library with a few ten thousands books
| and they were paying for system that used oracle as the
| Database. so basically they were paying for oracle. this was
| 20 years ago.
|
| then there is JSTOR and the whole Aaron Swartz thing. JSTOR
| acted really inappropriately
|
| like i feel really bad about what has happened to libraries
| over the past 20 years, with funding cut to the bone. but
| they kind of did it to themselves by thinking that "serving
| the public" means shoveling the publics money to proprietary
| vendors for no apparent reason. like OCLC has no right to
| take information generated by public institutions that are
| almost entirely funded by local taxpayers and somehow claim
| ownership of that information, and act like a monopolistic
| for profit corporation.
|
| there are a lot of very innovating library people doing stuff
| like maker spaces and kids education despite all the
| hardships but.... this revolution has not made it to the
| 'library industry'.
| dredmorbius wrote:
| NB: JSTOR, to its credit, Larry Lessig says "great credit"
| (<https://lessig.tumblr.com/post/40347463044/prosecutor-as-
| bul...>), didn't pursue prosecution of Swartz. M.I.T.,
| however certainly did (also noted by Lessig). From
| Abelson's report, commissioned by MIT:
|
| _If the Review Panel is forced to highlight just one issue
| for reflection, we would choose to look to the MIT
| administration's maintenance of a "neutral" hands-off
| attitude that regarded the prosecution as a legal dispute
| to which it was not a party. This attitude was complemented
| by the MIT community's apparent lack of attention to the
| ruinous collision of hacker ethics, open-source ideals,
| questionable laws, and aggressive prosecutions that was
| playing out in its midst. As a case study, this is a
| textbook example of the very controversies where the world
| seeks MIT's insight and leadership. A friend of Aaron
| Swartz stressed in one of our interviews that MIT will
| continue to be at the cutting edge in information
| technology and, in today's world, challenges like those
| presented in Aaron Swartz's case will arise again and
| again. With that realization, "Neutrality on these cases is
| an incoherent stance. It's not the right choice for a tough
| leader or a moral leader."_
|
| <http://swartz-report.mit.edu/docs/report-to-the-
| president.pd...>
|
| pp. 100-101
|
| And the US DoJ and courts have bloody hands.
|
| The careers of Ortiz and Heymann (DoJ) have suffered
| somewhat: Heymann left the DoJ, Ortiz's ambitions for
| higher office (reputedly she'd had interest in the Mass.
| governorship) were thwarted. The judge remains on the US
| District Court of Massachussetts.
| danofsteel32 wrote:
| The Open Library project https://openlibrary.org/developers has
| a free ISBN API. I used it along with tesseract OCR and a
| webcam to build a database of my physical books (218 total). I
| never had any rate limiting issues.
| eslaught wrote:
| Oh, hey. I was just looking at this article the other day.
|
| I think this article makes some good points, but it's a little
| too absolutist about this data. As far as I can tell, _if you are
| an author or industry professional_ , you can get access to the
| following data:
|
| * If you want data on your own books: sign up for Amazon Author
| Central and they'll give you the BookScan data on your books.
| This is free. https://press.aboutamazon.com/2010/12/weekly-
| nielsen-booksca...
|
| * If you want data on comps (i.e., comparable books, or books you
| are competing against): sign up for Publishers Marketplace and
| pay for the monthly package ($25/month on top of the PM
| subscription). This gives you the ability to track 5 ISBNs (and I
| assume, you can pick new ISBNs every month).
| https://www.publishersmarketplace.com/bookscan/about.cgi#dat...
|
| * If you want public library data checkout data: as linked in the
| article, go to the Seattle Public Library. Free.
| https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-y...
|
| The situation only really gets ugly if you want access to broad
| market data (i.e., across all ISBNs for a given time window, and
| covering a majority of retail outlets). The best I'm able to find
| is this comment by Kristen McLean from NPD:
| https://countercraft.substack.com/p/no-most-books-dont-sell-...
|
| But this data is (a) limited, and (b) I think it has some pretty
| serious issues [1]. I sent an email to Kristen to try to address
| this, but so far no response. (If anyone has any connections that
| might help, please contact me!)
|
| And if you want to get access to the data yourself, you're
| talking about something to the tune of $2,500 USD. And the terms
| are pretty restrictive.
| https://www.publishersmarketplace.com/bookscan/about.cgi#pri...
| https://www.publishersmarketplace.com/bookscan/terms.shtml
|
| I am actively working on improving this situation, and I've got
| some ideas for what we could do while still abiding by NPD's
| terms. If that's something that interests you, please contact me
| (see my profile).
|
| [1]: My issue with Kristen's analysis is that it follows (in her
| words) a "conveyor belt" pattern. That is, the time window is
| fixed. Within that time window, some books have been on the
| market 364 days. Some may have been on the market 1 day. So it's
| not surprising that some books have very few sales: they may
| simply have not been on the market long enough. And you can't
| just say, "well, multiply the data by 2x to account for the
| average case" because I'm pretty sure that doesn't work. But
| without real data I can't fix this.
| afandian wrote:
| Not about sales data, per this article, but bibliographic
| metadata. Check out POSI, the Principles for Open Scholarly
| Infrastructure:
|
| https://openscholarlyinfrastructure.org/
|
| There are people dedicated to open metadata and open systems to
| work with it.
|
| https://openscholarlyinfrastructure.org/posse/
|
| (I work at Crossref)
| kmeisthax wrote:
| One of the bigger complaints about AI art generation is that "oh,
| it'll become a closed loop system, and then we'll all be sitting
| at our chairs watching a neural network spew art all day while
| human artists starve to death". This is kind of funny because, if
| Public Books' article here is even remotely true, the existing
| publishing system is _already_ a closed loop. Publishers only
| commission or purchase works that match the particular taste
| profiles that are already trained into the sales data. If you
| want to make something new, the publishing companies have already
| boycotted and cancelled you.
| bombcar wrote:
| There is an out which is self-publishing. That's now entirely
| cheap (read: you can do it yourself for free).
|
| You don't get the publisher money, but it's an option available
| for you, and you have somewhat the same chance (read: zero) of
| striking it rich and becoming popular.
| dglass wrote:
| While I agree with the general argument in the article that sales
| data and similar metrics should be public, I think there's a lot
| more that can be done to unlock all of the knowledge stored in
| books. There are vast amounts of knowledge that humanity has
| built up over centuries that are either hard to find or hard to
| access unless you know where to look. How does someone like me
| discover that knowledge for a topic I'm interested in?
|
| I wrote a book that was recently published to help junior and
| mid-level programmers build up their soft-skills to advance their
| career[0]. The book was published by Holloway[1]. They have an
| interesting platform to solve this problem, which is why I chose
| to publish with them. They publish works primarily through their
| online reader, which is indexable by search engines. So someone
| searching for "How to get up to speed on a new codebase" in their
| preferred search engine could stumble across the chapter titled
| "How to read unfamiliar code"[2] and read a free preview of the
| book. Over time, people can discover and access the knowledge
| stored in any book that is published on Holloway's platform.
|
| Another nice side effect of the platform is that it can be
| updated over time, so outdated knowledge or content can be
| revised, updated, and re-indexed by the search engines as
| knowledge about topics evolve.
|
| If you're considering writing a book, or have a manuscript and
| are looking for a publisher, I'd recommend giving Holloway a look
| to see if it would be a good fit.
|
| [0]: https://www.holloway.com/b/junior-to-senior
|
| [1]: https://www.holloway.com/
|
| [2]: https://www.holloway.com/g/junior-to-senior/sections/how-
| to-...
___________________________________________________________________
(page generated 2022-11-11 23:01 UTC)