[HN Gopher] Show HN: I scraped 3B Goodreads reviews to train a b...
       ___________________________________________________________________
        
       Show HN: I scraped 3B Goodreads reviews to train a better
       recommendation model
        
       Hi everyone,  For the past couple months I've been working on a
       website with two main features:  - https://book.sv - put in a list
       of books and get recommendations on what to read next from a model
       trained on over a billion reviews  - https://book.sv/intersect -
       put in a list of books and find the users on Goodreads who have
       read them all (if you don't want to be included in these results,
       you can opt-out here: https://book.sv/remove-my-data)  Technical
       info available here: https://book.sv/how-it-works  Note 1: If you
       only provide one or two books, the model doesn't have a lot to work
       with and may include a handful of somewhat unrelated popular books
       in the results. If you want recommendations based on just one book,
       click the "Similar" button next to the book after adding it to the
       input book list on the recommendations page.  Note 2: This is
       uncommon, but if you get an unexpected non-English titled book in
       the results, it is probably not a mistake and it very likely has an
       English edition. The "canonical" edition of a book I use for
       display is whatever one is the most popular, which is usually the
       English version, but this is not the case for all books, especially
       those by famous French or Russian authors.
        
       Author : costco
       Score  : 151 points
       Date   : 2025-11-05 17:50 UTC (1 days ago)
        
 (HTM) web link (book.sv)
 (TXT) w3m dump (book.sv)
        
       | thinkcontext wrote:
       | I'm impressed! It didn't take many books for it to start
       | suggesting other books that I liked and it showed me several
       | solid choices I'm adding to my queue.
        
       | aj_hackman wrote:
       | Thank you! Because of this, "The Making of Prince of Persia:
       | Journals 1985-1993" by Jordan Mechner is on its way to my house.
        
         | qingcharles wrote:
         | You definitely will not regret that purchase. It's a very
         | enjoyable read.
        
       | jamesponddotco wrote:
       | The recommendations are pretty good; even though I only input six
       | books, it was enough for it to recommend books I have on my wish
       | list. Definitely going to play around some more. Plus, the
       | website is super fast, very impressive.
       | 
       | Any chance we could get an API going at some point? Are you
       | planning to open source the work?
       | 
       | I'm interested in the scrapping of Goodreads too. I'm building a
       | book metadata aggregation API and plan on building a scrapper for
       | Goodreads, but I imagine using a data center IP address will be a
       | problem very fast. Were you scrapping from your home network?
        
         | costco wrote:
         | Thank you for the compliments :) I used 50-100 datacenter
         | proxies. I just logged requests made by the iOS app with
         | Charles and then recreated the headers to the best of my
         | ability though the server did not seem to be very strict at
         | all. Worth noting though that static residential proxies are
         | not too expensive these days anyways.
         | 
         | Re the API: The model does actually run fairly well on CPU so
         | it probably wouldn't be too expensive to serve. I guess if
         | there is demand for it I could do it. I think most social book
         | sites would probably like to own their recommendation system
         | though.
        
           | goatsi wrote:
           | Speaking of sustained scraping for AI services, I found a
           | strange file on your site: https://book.sv/robots.txt. Would
           | you be able to explain the intent behind it?
        
             | costco wrote:
             | I didn't want an agent to get stuck on an infinite loop
             | invoking endpoints that cost GPU resources. Those fears are
             | probably unfounded, so if people really cared I could
             | remove those. /similar is blocked by default because I
             | don't want 500000 "similar books for" pages to pollute the
             | search results for my website but I do not mind if people
             | scrape those pages.
        
           | dbl000 wrote:
           | I would love an API or the dataset if you could share it
           | somehow! Just to play around with my own book lists.
        
       | esafak wrote:
       | It is interesting that you chose a contextual recommender when
       | you would think book affinity is not very susceptible to context.
       | Did you try other models too?
        
       | skerit wrote:
       | Please make this for tv series too!
        
       | vessenes wrote:
       | OK, I just added books until you told me I had too many. Fun
       | idea! I have a couple of suggestions:
       | 
       | * UI - once someone clicks "Add" you really should remove that
       | item from the suggested list - it's very confusing to still see
       | it.
       | 
       | * Beam search / diversification -- Your system threw like 100
       | books at me of which I'd read 95 and heard of 2 of the other 3,
       | so it worked for me as a predictor of what I'd read, but not so
       | well for discovery.
       | 
       | I'd be interested in recommendations that pushed me into a new
       | area, or gave me a surprising read. This is easier to do if you
       | have a fairly complete list of what someone's read, I know. But
       | off the top of my head, I'm imagining finding my eigenfriends,
       | then finding books that are either controversial (very wide
       | rating differences amongst my fellow readers) or possibly
       | ghettoized, that is, some portion of similar readers also read
       | this X or Y subject, but not all.
       | 
       | Anyway, thanks, this is fun! Hook up a VLM and let people take
       | pictures of their bookshelf next.
        
       | comrade1234 wrote:
       | I gave up on goodreads reviews. I've been burned too many times
       | by highly rated books that weren't that good. If you're into
       | (horny) ya romance fantasy then goodreads is great, but it's not
       | for me. I haven't really found a substitute.
        
         | jamesponddotco wrote:
         | I'm not into the social aspect, so Goodreads was never an
         | option, but Hardcover[1] seems like a pretty good alternative.
         | 
         | [1]: https://hardcover.app
        
         | owenversteeg wrote:
         | Any broadly used ratings system is total garbage. Goodreads
         | ratings, Google Maps ratings, Amazon reviews, Vivino for wine,
         | et cetera. Even assuming the reviews are real and genuine, most
         | people just aren't good at writing reviews, and the handful
         | that are often have wildly different criteria than you. Someone
         | already commented with one enthusiast site - and sure,
         | enthusiast sites are often better than the mainstream option
         | (see also: CellarTracker for wine) but honestly my advice is to
         | get good at determining the quality of the thing yourself. For
         | books there are a ton of hints about what you'll be getting.
         | "NYT Bestseller", "xyz book club", certain publishers, who's
         | quoted on the back, when was it published, who wrote it? All of
         | those things can help you rapidly identify books. I personally
         | dislike most modern books and prefer the "classics", so a lot
         | of this is only useful as a negative signal, but even then
         | there are positive signals, for example a reference to a much
         | older book.
        
         | HeinzStuckeIt wrote:
         | GR is also great if you are into academic nonfiction, Classics,
         | poetry, etc. The site does, after all, let you track and review
         | any publication with an ISBN. What my peers and I use it for is
         | worlds apart from the romance novel or LGBT young-adult book
         | reviewing community that often puts GR in the news, and far
         | away from all the drama that rages around genre fiction.
        
       | noir_lord wrote:
       | It has a tendency to recommend books in the same series as are
       | input (putting aside that if I like a book in a series I've
       | likely already read the series).
       | 
       | It did suggest Murderbot Diaries (not on the input but a series I
       | have read and did like) and an Adrian Tchaikovsky I hadn't read
       | :).
        
         | bananaflag wrote:
         | Yeah the hardest problem for recommendation systems is to find
         | non-Star Wars books which are like some specific Star Wars
         | books and unlike some other Star Wars books. I would say it's
         | AGI-complete ;)
        
           | noir_lord wrote:
           | Ironically that is one of the few uses where I've found an
           | LLM to actually be useful.
           | 
           | ChatGPT does a fairly good job at letting you negate/refine
           | whatever it was you where looking for.
        
         | costco wrote:
         | It's explicitly trained to predict the next book read in a
         | sequence, which is why you get that behavior. There's probably
         | a better way for me to handle it rather than having 5 books
         | from the same series tend towards the top though.
        
           | noir_lord wrote:
           | If you have the data to know the other books in a series
           | maybe split the results so you have "books in series" in one
           | column and "books not in a series mentioned" in the other but
           | other than that it did a better job than Kindle
           | recommendations which are often hilariously off the mark.
        
       | walthamstow wrote:
       | Works pretty well with cookbooks. Very cool work.
       | 
       | One suggestion would be to make the search less strict on
       | diacritics. Searching for popular cook J. Kenji Lopez Alt was
       | only successful if I entered the correct O.
        
       | NitpickLawyer wrote:
       | Interesting. I tested it with sci-fi, and it definitely
       | recommends good books, but not sure how accurate it is at
       | surfacing the sub genres / themes. For example for [aurora -ksr,
       | seveneves, project hail mary, ender's game] it gave me dune.
       | Which is a great book, but not in the "first-ish contact" style I
       | hoped it would be.
       | 
       | Another thing I noticed is that it tends to recommend 2nd and 3rd
       | books in a series, which is a bit so-so. If I add the first book
       | in a series, I probably already read the whole series...
        
         | 28304283409234 wrote:
         | Came here to say this (recommending book 2 and 3 in a trilogy).
         | Great app otherwise!
        
       | qingcharles wrote:
       | I put in a bunch of books and hit recommendations and... I'd
       | already read 95% of them, so at least we know it works well!
       | (checking out the other 5% now)
       | 
       | p.s. one idea: when you click [Add] on the recommended books
       | list, it should remove it from that list
       | 
       | p.p.s. if there is a way to filter out the spam "Summary of ____"
       | books, that would be good too
        
         | jacquesm wrote:
         | I have a hard time remembering titles of books I've read if
         | they are not directly related to the subject matter. No problem
         | remembering the content though. With movies I remember both.
        
       | yoz-y wrote:
       | It works pretty well in the sense that after inputting only a few
       | quite diverse books it gave me recommendations for a lot of books
       | that I've already also read and enjoyed.
       | 
       | I would also really like a possibility to add negative signal. It
       | did also recommend books that seemed interesting to me but I
       | ultimately didn't like.
       | 
       | Overall quite impressive.
        
       | momocowcow wrote:
       | Whatever I put in, it wants me to read Sapiens :_(
        
         | oever wrote:
         | Can confirm. Stallman, Torvalds, Orwell, Harari
         | 
         | https://book.sv/#2300585,644416
        
       | skayvr wrote:
       | I've worked in recommender systems for a while, and it's great to
       | see them publicized.
       | 
       | SASRec was released in 2018 just after transformer paper, and
       | uses the same attention mechanism but different losses than LLMs.
       | Any plans to upgrade to other item/user prediction models?
        
         | costco wrote:
         | I'm not an expert by any means but as far as sequential
         | recommendations go, aren't SASRec and its derivatives pretty
         | much the name of the game? I probably should have looked into
         | HSTUs more. Also this / sparse transformers in general:
         | https://arxiv.org/pdf/2212.04120
        
           | bigskydog wrote:
           | Recommend OneRec which is an improvement of HSTU and it
           | recently became open source
        
           | skayvr wrote:
           | There's a few alternatives, but SASRec is a good baseline for
           | next-item recommendation. I'd look at BERT4Rec too. HSTU is
           | definitely a strong step forward, but stays in the domain of
           | ID models. HSTU also seems to rely heavily on some extra item
           | information that SASRec does not (timestamps).
           | 
           | Other models include Google's TIGER model which uses a VAE to
           | encode more information about items. Similar to how modern
           | text-to-voice operates.
        
             | costco wrote:
             | Thank you for the recommendations. I didn't try BERT4Rec
             | because I assumed it would perform the same or worse as
             | what I already had after having read
             | https://dl.acm.org/doi/pdf/10.1145/3699521. The TIGER paper
             | seems interesting - I definitely want to explore semantic
             | IDs in general and also because I think it could allow
             | including more long-tail items.
        
       | varenc wrote:
       | I love this site, and the approach! Great seeing someone making
       | good use of Goodreads data.
       | 
       | Sadly my experience with the book recommender isn't too great
       | because of the 64 book limit. If I import either the most recent
       | or least recent 64 book, 95% of the books it recommends to me are
       | books I've read. Though it was helpful for spotting a few books
       | I've read that I didn't log on Goodreads. Guess I'm pretty
       | consistent.
        
         | costco wrote:
         | I think I will expand the input books limit (sadly requires
         | retraining) and or the output books limit of 30.
        
       | nsypteras wrote:
       | I'm impressed it recommended so many books i've already read and
       | liked! I have a big reading backlog but once it's whittled down I
       | will likely come back to this. One feature request would be to
       | also show a "why this is recommended" for each recommendation so
       | I can further narrow down the list for what I'm looking for
        
       | mcbrit wrote:
       | I don't know. I entered, trying to be popular but at least
       | slightly? opiniated:
       | 
       | Tigana, Hyperion, A Fire Upon the Deep, Blindsight, Moby Dick
       | 
       | and I got a list. Sure, read all that or wasn't interested for
       | reasons, I added (only Neuromancer on initial recommendations):
       | 
       | Neuromancer, VALIS, Quantum Thief, Towing Jehovah.
       | 
       | List did not get more interesting.
       | 
       | Book recommendations are still kind of difficult.
        
         | mcbrit wrote:
         | If I provide that list, a (real) person doesn't ask me if I've
         | read the Hobbit.
        
         | teaearlgraycold wrote:
         | I don't think past liked books are nearly enough information to
         | provide a good book for you today. You need a lot more
         | information about the state of someone's mind.
        
           | mcbrit wrote:
           | You're talking to a dude. (in my case.) I mentioned 8 books.
           | 
           | I won't tell you exactly what to do, but one way to do it is
           | to measure your surprise with me choosing each of those 8
           | books when you provide a recommendation back to me of what I
           | should read next. I think I get kind of that experience
           | talking to someone about books.
           | 
           | The algorithm didn't do that.
        
             | teaearlgraycold wrote:
             | Talking to someone about books gives you so much more
             | information than a book list. Their expressions, their
             | accent, their energy level, their clothes, and many other
             | things help to provide supplemental information.
        
       | submeta wrote:
       | Like the idea! Wondering: Weren't the early LLMs trained on data
       | in Goodreads as well? I can upload and ask ChatGPT as well, and
       | it will give me similar recommendations, no?
        
       | djoldman wrote:
       | Can you share the details about the Meilisearch instance? How big
       | is the box and database size?
        
         | costco wrote:
         | Everything (namely Meilisearch, Postgres and the web server in
         | Go) besides the model inference is running on a Hetzner server
         | with a large SSD and an "AMD Ryzen 7 3700X 8-Core Processor."
         | The data.ms directory is about 40GB. Once the HN traffic dies
         | down I will probably move the model back to the Hetzner server
         | so I don't have to pay $0.15/hour for an A4000.
        
       | __alexander wrote:
       | Care to share the scrapped data? I would love to play around with
       | it.
        
         | demaga wrote:
         | I am not sure about legal side of things here, but a Kaggle
         | dataset would be really cool
        
         | costco wrote:
         | Not sure if I can. At the very least book descriptions most
         | likely could not be distributed. There is an academic dataset
         | with around 200M reviews though:
         | https://cseweb.ucsd.edu/~jmcauley/datasets/goodreads.html
        
         | guelo wrote:
         | I'm surprised he got that much data. Goodreads uses several
         | tricks to try to stop scrapers, for example pagination only
         | works up to a few pages.
        
           | jacquesm wrote:
           | They might send him a bill for use of resources.
        
       | MattGrommes wrote:
       | This is cool but I'd love the option to filter out the author of
       | the book you entered. I put in Shroud by Adrian Tchaikovsky and
       | almost all the books are others by him, which is fine but doesn't
       | really mix up the stuff I'm reading.
        
       | nwhnwh wrote:
       | I entered "Alone Together: Why We Expect More from Technology and
       | Less from Each Other" and I received books about Steve Jobs,
       | Harry Potter and "The Subtle Art of Not Giving a F*ck". Like
       | how???
        
         | costco wrote:
         | If you want recommendations solely based on one book, please
         | try the similar page: https://book.sv/similar?id=13566692
         | 
         | These seem to fit the description you are going for better. The
         | model is trained to predict the next book in the sequence.
         | Those other books you listed happen to be very popular, so in
         | the absence of information about you (only having 1 book), the
         | model will tend to recommend those.
        
         | BeetleB wrote:
         | > Provide 3+ books for best results.
        
       | jauntywundrkind wrote:
       | Where do nice scrapes like this end up? Are there BitTorrents out
       | there for scrapes like this?
       | 
       | Honestly this would finally be the web2.0 we all wanted & hoped
       | for. It's against majesty that it's all captured owned user
       | content that is legally captured by essentially public message
       | boards/sites.
        
       | jimmoores wrote:
       | I unexpectedly liked this. I thought the recommendations were
       | actually useful.
        
         | parkersweb wrote:
         | I sadly didn't share that experience - I fed it my goodreads
         | most recent - but it largely picked up on 2 or 3 series I've
         | been slowly working my way through so that most of the
         | recommendation list was ALL the other books in the series (and
         | the spin-off series) so I didn't really get anything useful...
        
       | dbl000 wrote:
       | Echoing what everyone else has said here - awesome site, love how
       | fast it was.
       | 
       | I did notice that when I put in a single book in a series (in my
       | case Going Postal, Discworld #33) that tended to dominate the
       | rest of the selection. That does make sense, but I don't want
       | recommendations for a series I'm already well into.
       | 
       | Also noticed that a few books (Spycraft by Nadine Akkerman and
       | Pete Langman, Tribalism is Dumb by Andrew Heaton) that I know are
       | in goodreads and reviewed didn't show up in the search. I tried
       | both author's name and the title of the book. Maybe they aren't
       | in the dataset.
       | 
       | It did stumble with some books more niche books (The Complete Yes
       | Minister). Trying the "Similar" button gave me more books that
       | were _technically_ similar because they were novelizations of
       | British comedy shows, but not what I was looking for.
       | 
       | For more common books though it lined up very well with books
       | already on my wishlist!
        
         | costco wrote:
         | Yes I would say the handling of series is probably the biggest
         | problem. Once my test metrics got to a point I was happy with
         | and my quality spot checks passed (can I follow the models
         | recommendations from one generic history book to Steven
         | Runciman, making sure popular books don't always dominate the
         | results), I was ready to release because I had been working on
         | this project for so long. The solution is probably using the
         | transformer model to generate 100-200 candidates and then
         | having a reranker on top.
        
       | xkbarkar wrote:
       | Have nothing to add that hasn't already been commented. Like the
       | entries in the add list stay. Other than that, my recommendation
       | list keeps coming up with books I have already read and loved and
       | I am hitting the limit :(.
       | 
       | So filtering would be great,
       | 
       | I have seen a few versions of the same books listed more than
       | once.
       | 
       | Loved this. Hope you get to tune it a little.
       | 
       | Also, thank you for not ruining the site with a single popup,
       | email subscription list offer, chatbot, wheelspin from hell
       | anywhere.
       | 
       | Blessings from the popup hating part of the interwebs.
        
       | _virtu wrote:
       | Hey OP I'm building a bookclub app. Do you happen to have an api
       | I could plug into? I'd love to add this to our member suggestions
       | section.
        
       | androng wrote:
       | I tried to import my book list with "Import goodreads" button and
       | inputting https://www.goodreads.com/user/show/68515148-andrew but
       | it said "import failed, see console"
        
         | costco wrote:
         | Worked for me, could be due to server being overwhelmed
         | 
         | Here is the URL with your books:
         | https://book.sv/#52752877,46049530,18437030,52480873,3260654...
        
       | blehn wrote:
       | You should filter out authors from the input books in the output.
       | If liked a book by an author, surely I'd read more of their work
       | if I wanted to -- recommending them isn't helpful. Along the same
       | lines, I think interesting recommendations tend to be the ones
       | that (1) I like and (2) I didn't expect. The more similar the
       | recommendations are to the input, the more likely I already know
       | them, and the more likely to create a recommendation echo
       | chamber.
        
       | sodality2 wrote:
       | This is fantastic!!! I've added many results to my want-to-read
       | list, they're very on-point from very few inputs. It would be
       | really cool to import from a user ID, where you can choose some
       | subset of your read list to inspire new suggestions, while
       | excluding all books in your want-to-read and already-read lists.
       | But that's an ongoing scrape to maintain, it's a cat and mouse
       | game you probably don't want to start. I wonder what the legal
       | status of scraped training data is... if you don't reproduce any
       | of the review data I presume you're fine?
        
         | costco wrote:
         | You can import the first or last 64 books of your read, to-
         | read, or currently-reading shelves if you press the "Import
         | Goodreads" button and provide your Goodreads ID.
        
           | sodality2 wrote:
           | D'oh, didn't even notice that button :P Wow, that greatly
           | improved the recommendations, it even found a book I wouldn't
           | say is particularly related to the others but _I_ found it
           | interesting-sounding. Thanks for such a cool site!!
        
       | stevage wrote:
       | This is great. would be really nice to be able to reject
       | suggestions though.
        
       | nickthesick wrote:
       | I have a web app https://bookhive.buzz which is a GoodReads
       | alternative based on BlueSky's protocol. I scrape all of the book
       | data from Goodreads too.
       | 
       | I would love to be able to add a recommendation system based on
       | this.
        
       | tristor wrote:
       | Two bugs to know about. First, you are using a deprecated API
       | call that fails in Firefox. Second, you are using an HTTP
       | endpoint that fails to upgrade to HTTPS to call the GoodReads
       | API, which also fails with HTTPS-Only enabled in both Chrome and
       | Firefox.
       | 
       | The idea seems good, but since I can't import my GoodReads
       | successfully, it's hard for me to try
        
       | mscbuck wrote:
       | Awesome site and speed!
       | 
       | My advice from someone who has built recommendation systems: Now
       | comes the hard part! It seems like a lot of the feedback here is
       | that it's operating pretty heavily like a content based system
       | system, which is fine. But this is where you can probably start
       | evaluating on other metrics like serendipity, novelty, etc. One
       | of the best things I did for recommender systems in production is
       | having different ones for different purposes, then aggregating
       | them together into a final. Have a heavy content-based one to
       | keep people in the rabbit hole. Have a heavy graph based to try
       | and traverse and find new stuff. Have one that is heavily tuned
       | on a specific metric for a specific purpose. Hell, throw in a
       | pure TF-IDF/BM25/Splade based one.
       | 
       | The real trick of rec systems is that people want to be
       | recommnded things differently. Having multiple systems that you
       | can weigh differently per user is one way to be able to achieve
       | that, usually one algorithm can't quite do that effectively.
        
       | fennec-posix wrote:
       | Very neat. Even found a couple Cold War-setting books to read and
       | an entire series of 6 books on the same topic, All from searching
       | up Team Yankee.
       | 
       | Thanks for the new reading list :D
        
       ___________________________________________________________________
       (page generated 2025-11-06 23:00 UTC)