hngopher.com

       [HN Gopher] Xapian: Open source search engine library
       ___________________________________________________________________
        
       Xapian: Open source search engine library
        
       Author : Bluestein
       Score  : 147 points
       Date   : 2024-08-17 10:42 UTC (12 hours ago)
        
 (HTM) web link (xapian.org)
 (TXT) w3m dump (xapian.org)
        
       | dvdkon wrote:
       | Xapian is nice. I've used it before to add interactive
       | autocomplete to a Python web app, since my previous favourite,
       | Whoosh, is unmaintained and somehow slower than grep on a folder
       | (I remember it being pretty fast years back, I'd love to know
       | what happened).
       | 
       | I'd say my favourite thing about Xapian is that it's just a
       | simple library you can embed in any app, no need for a separate
       | database and JVM tuning. For simple usecases and small-to-medium
       | datasets, it just works.
        
       | infocollector wrote:
       | This project has been around and maintained for more than a
       | decade! Small footprint, good speed. One downside might be GPL v2
       | for commercial use.
        
         | Bluestein wrote:
         | In fact, AIUI it's roots go back about 3 decades. The "about"
         | page has a nice historical overview.-
        
         | the_mitsuhiko wrote:
         | The project at one point started tracking files for potential
         | rewrite to rid itself of GPL history. I used it many years ago
         | and I quite enjoyed it (pre elastic search times) but
         | unfortunately the license situation didn't help the project to
         | become popular.
        
         | frenchman99 wrote:
         | You can always build a small search webservice that you open
         | source and that your proprietary software calls out too,
         | removing the need to open source everything.
         | 
         | Linux is GPL too, didn't hinder companies making trillions on
         | top of it.
        
           | synergy20 wrote:
           | you mean, don't compile it and link it within my application,
           | instead wrap it as a separate program then call it via rpc
           | remotely or locally?
        
             | frenchman99 wrote:
             | Yes, exactly that.
        
           | the_mitsuhiko wrote:
           | Linux is not a good example because of the syscall exemption.
           | The licensing situation is not at all comparable as xapian's
           | original point of existence was embedding.
        
         | synergy20 wrote:
         | that's true, wonder if there is alternative that is not gpl
        
           | bearjaws wrote:
           | Sonic search https://github.com/valeriansaliou/sonic
           | 
           | Maybe not exactly the same, its a server that you can store
           | documents and then retrieve their ID using a search string.
        
           | Bluestein wrote:
           | Elastic Search and its Amazon fork Opensearch perhaps?
        
             | JackSlateur wrote:
             | Xapian is a library, while elastic has a client-server
             | model
             | 
             | Xapian is more like sqlite while elastic would be mariadb
        
               | Bluestein wrote:
               | Thanks for the spot-on, very illuminated comparison.-
        
               | inertiatic wrote:
               | Lucene which is what ES builds upon is a library with
               | bindings in languages other than Java, and it's Apache
               | licensed.
        
         | rbanffy wrote:
         | For what kind of use do you think GPLv2 would be a blocker?
        
         | donio wrote:
         | A lot more than a decade. I've been using it for 15 years at it
         | was a very mature project even then. Repo history goes back to
         | 1999 and according to the history page the project's roots go
         | back to the 80s. A bit like Postgres in this respect.
         | 
         | https://xapian.org/history
         | https://sigir.org/files/forum/S2000/MUSCAT_note.pdf
        
       | jfmc wrote:
       | Xapian is used in https://www.djcbsoftware.nl/code/mu/ for
       | indexing emails.
        
         | calvinmorrison wrote:
         | And by Fastmail
        
           | nathell wrote:
           | And by Notmuch https://notmuchmail.org/
        
             | Bluestein wrote:
             | In itsel, Notmuch, a very interesting tool.-
        
               | Bluestein wrote:
               | s/itsel/itself/
        
       | openrisk wrote:
       | used also by recoll, the desktop search app:
       | https://www.recoll.org/
        
         | nanna wrote:
         | I use recoll to index and search thousands of pdfs. Because I
         | always have the author name in the filename I can filter
         | queries like this:
         | 
         | Cybernetics OR steering filename:Heidegger ext:pdf
         | 
         | It's an absolute power tool.
        
           | nickpsecurity wrote:
           | I do it like this:
           | 
           | Title Year Author Name.pdf
           | 
           | Same benefits as you mentioned. You can also filter by time
           | that way.
        
             | nanna wrote:
             | My way is to do
             | 
             | year__author1_author2_author-n~~title-of-
             | book~subtitle##tag1#tag-2#tagn.pdf
             | 
             | This means that files are automatically organised by year
             | of publication,that I can search by tag name, and that I
             | dont have to escape chars in the terminal. One day I hope
             | to get round to building an Emacs mode to filter by the
             | different elements.
        
           | donio wrote:
           | If your PDFs have Author properties then you might be able to
           | do "author:Heidegger" too. The Recoll PDF filter extracts
           | some of these fields and if I remember right it can be
           | configured to extract additional custom properties too.
        
         | Beijinger wrote:
         | Recoll is magic...
        
       | rbanffy wrote:
       | I remember having used, a very long time ago, a self-hosted
       | search engine on my library of PDFs, and it was unbelievably
       | useful.
       | 
       | I dream about a similar thing that can do OCR on scanned docs and
       | extract text from my also sprawling library of epub and mobi
       | files. If someone builds something like this, with maybe a
       | _LOCAL_ LLM to extract text descriptions from photos and movies
       | as well as indexing metadata for everything, subtitles from
       | movies and lyrics for songs, and add that to a NAS appliance,
       | it'd be a killer.
        
         | andyfilms1 wrote:
         | Evernote will do this, you can feed it a bunch of PDFs and
         | other documents, it will OCR them and make them all searchable.
         | it's not perfect, but you can also add manual tags for things
         | you know are important.
        
           | rbanffy wrote:
           | At some point I'd love to further train an LLM on all my PDFs
           | and be able to ask it questions.
        
         | kordlessagain wrote:
         | I have most of this code for doing this - just needs to get
         | rewritten for local storage (I was running it on Google Cloud).
         | Need to pick something that doesn't run Solr as a service for
         | local use. With Ollama, we have function calls running, so
         | should be doable. I was also thinking about using the Open
         | WebUI for use.
        
         | glompers wrote:
         | DEVONthink 3 [0] (Apple only) will do most of that although I
         | don't keep up at all with its interoperability with LLM
         | extensions.
         | 
         | [0] https://www.devontechnologies.com/apps/devonthink
        
         | Bluestein wrote:
         | > extract text descriptions from photos and movies as well as
         | indexing metadata for everything, subtitles from movies and
         | lyrics for songs
         | 
         | The big AI players are probably already scraping the bottom of
         | this "barrel" in their search for training data, I am sure ...
        
         | theolivenbaum wrote:
         | That's what our app does: curiosity.ai, local index, support
         | for many files types and apps out of the box, and integrated
         | local OCR, STT and even local LLM
        
           | Bluestein wrote:
           | That is the future :) Much success!
        
           | rmholt wrote:
           | I couldn't find any mention on your website about LOCAL LLMs
           | and according to your FAQ, it requires an account with your
           | website.
           | 
           | Is there a way how to run curiosity.ai fully offline, without
           | an account on your servers?
        
         | pdw wrote:
         | > similar thing that can do OCR on scanned docs
         | 
         | It's only part of what you want, but ocrmypdf will add a OCRed
         | text layer to PDF files, making the text selectable and
         | indexable
        
       | importsaas wrote:
       | Much better docs here at https://getting-started-with-
       | xapian.readthedocs.io/en/latest...
        
       | donio wrote:
       | Love Xapian, been using it for many years via notmuch (mail) and
       | recoll (document indexing, mainly PDFs in my case).
       | 
       | It's been trouble free and very performant, a real workhorse.
       | 
       | https://notmuchmail.org/
       | 
       | https://www.recoll.org/
        
       ___________________________________________________________________
       (page generated 2024-08-17 23:01 UTC)