[HN Gopher] Building an Open Source Decentralized E-Book Search ...
       ___________________________________________________________________
        
       Building an Open Source Decentralized E-Book Search Engine
        
       Author : j2qk3b
       Score  : 216 points
       Date   : 2024-03-11 11:56 UTC (11 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | boredumb wrote:
       | Many moons ago I wanted to do something similar for AI data sets
       | and models over IPFS. I don't know the future for IPFS but I do
       | hope the essence of a p2p data sharing infrastructure becomes
       | more accessible to help individuals tackle some of the issues
       | with large datasets with less hardware on hand.
       | 
       | https://github.com/JakeKalstad/IPFSPytorchDataset
       | https://github.com/JakeKalstad/load_ipfs_pytorch_model
        
       | Mortiffer wrote:
       | Could you detail how you populate the search index and what you
       | expect the memory limits to be?
        
       | MrThoughtful wrote:
       | What on earth is this about?
       | 
       | "I was recommended ... Liber3 ..., which uses ENS domain names
       | ... running on ENS and IPFS ... they appear to be using Glitter
       | ... a ... service built with Tendermint."
       | 
       | This sounds like a signal from outer space to me. In a language
       | used in a different galaxy.
       | 
       | I tried that Liber3 thing, but whatever I do, I get "Oops!
       | Something went wrong. Please refresh or try again later".
       | 
       | What is this all about?
        
         | WolfeReader wrote:
         | The title is the de-jargonized version. It's a set of
         | instructions to build an open-source ebook search engine.
         | (Admittedly there is still some jargon in that description, but
         | not to the level of naming specific libraries.)
         | 
         | The bulk of the article is implementation details, helpfully
         | hyperlinked.
        
       | droopyEyelids wrote:
       | The title got me really excited that they were doing full text
       | search. Boy that would be an awesome project. Zlib and Google
       | Books do it, but it would be great to have a open source version
       | that everyone could contribute to, and provided access to full
       | texts
        
         | raybb wrote:
         | OpenLibrary does provide search access to full texts. For
         | example:
         | https://openlibrary.org/search/inside?q=%22institutional+thi...
         | 
         | It is open source and they're always looking for contributors.
         | I think they'd especially welcome help improving search!
         | 
         | https://github.com/internetarchive/openlibrary/
        
       | devops000 wrote:
       | Cool! Could be used for torrent searching? Like running web
       | torrent with video streaming and a decentralized search engine.
        
         | j2qk3b wrote:
         | Yes! Try this one: https://anybt.eth.limo/
         | 
         | I will build an open sourced version too!
        
           | hanniabu wrote:
           | Nice to find eth.limo being used in the wild
        
       | throwawayyyyyy2 wrote:
       | And then realize it has existed for almost 15 years and it's
       | called libgen.rs
        
         | spondylosaurus wrote:
         | Anna's Archive is even better!
        
           | brevitea wrote:
           | IMO, the more the merrier. That's the joy of decentralization
           | and P2P.
        
           | tamimio wrote:
           | It seems they are using flask in their code, just to show you
           | don't to go crazy with your stack to build useful software.
        
       | ValleZ wrote:
       | Is this an actual search engine or just a front end which builds
       | "select from" queries?
        
       | carlosjobim wrote:
       | There's 13 search engines in a dozen if you only want book title
       | or author. What's lacking is a search index of the content of
       | e-books. Something that will soon be incredibly important in the
       | face of generative AI. Somebody here on HN told me it only takes
       | a laptop to index the content of millions of books, while other
       | people say the scope is almost impossible.
       | 
       | Is there any project working on this?
        
         | bt1a wrote:
         | Perhaps the initial creation of the index is indeed something
         | that an average laptop could accomplish, but I'd imagine that
         | frequently updating the index and serving requests against it
         | would be compute-intensive. I have nothing to back this up but
         | speculation. Would love to learn more!
        
         | CWuestefeld wrote:
         | I believe that Calibre, the popular and free ebook management
         | tool, now supports indexing the content all books in your
         | library.
        
         | myco_logic wrote:
         | Depends on how beefy that laptop is...
         | 
         | I've been doing some local LLM stuff at work recently, and even
         | with the amazing advances in quantization lately, doing that
         | kind of stuff on a ThinkPad _is feasible_ , but still strongly
         | inferior to just renting out a VPS with a couple 4090/H100s for
         | several hours.
         | 
         | The biggest thing with summarizing stuff is that most local LLM
         | models often don't have very big context-windows, so they have
         | trouble with larger texts like even a short Vonnegut novel (I
         | was just testing em' with summarizing GitHub issues, and even
         | with a 16k token context window they still sometimes struggle
         | if there are a lot of comments).
         | 
         | There are probably smarter people than I who could get this
         | working on a Raspberry Pi though... ;)
        
         | dmotz wrote:
         | I have a side project that aims to organize your ebook
         | highlight collections with on-device semantic search. [1] Right
         | now it only indexes your own content but I'd like to add a mode
         | that allows you to share your collection and let others find
         | relevant ideas via semantic search -- a discovery platform for
         | ideas found in books. It's open source if you want a sense of
         | how it works now. [2]
         | 
         | [1] https://emdash.ai/
         | 
         | [2] https://github.com/dmotz/emdash
        
       | neilv wrote:
       | This seems to be intended for IP piracy. Clarifying that in the
       | title would help.
       | 
       | I'm trying to encourage publishers and authors to offer
       | legitimate sales of DRM-free ebooks, so would prefer we try not
       | to have the term "ebook" associated with piracy.
        
         | RamblingCTO wrote:
         | Since when are ebooks piracy? I think that might only be you
        
           | neilv wrote:
           | Title is "Building an Open Source Decentralized E-Book Search
           | Engine", and screenshot seems to suggest piracy.
        
             | t-3 wrote:
             | Nothing about the title suggests piracy, and the screenshot
             | doesn't show download links - hell, there aren't any actual
             | Harry Potter books in a search for "Harry Potter". Even if
             | it _were_ searching for files, free and legal ebooks are
             | ubiquitous, no copyright infringement necessary to make it
             | a worthwhile endeavor.
        
             | WolfeReader wrote:
             | Please be specific about how the screenshot advocates
             | piracy.
             | 
             | (Also, a personal preference: never use the phrase "seems
             | to suggest" again; if you're going to make an accusation,
             | be honest enough to actually make it.)
        
         | sureglymop wrote:
         | It's a search engine... What about it makes it specific to IP
         | piracy?
         | 
         | I actually understand your point well but I think it's even
         | more important not to group in any legitimate use of technology
         | with illegitimate use of it. Especially considering recent
         | events (lawsuits over Yuzu and Dolphin emulators).
        
         | citruscomputing wrote:
         | It does seem to be! Isn't that cool?
        
       ___________________________________________________________________
       (page generated 2024-03-11 23:00 UTC)