[HN Gopher] Xapian: Open source search engine library
___________________________________________________________________
Xapian: Open source search engine library
Author : Bluestein
Score : 147 points
Date : 2024-08-17 10:42 UTC (12 hours ago)
(HTM) web link (xapian.org)
(TXT) w3m dump (xapian.org)
| dvdkon wrote:
| Xapian is nice. I've used it before to add interactive
| autocomplete to a Python web app, since my previous favourite,
| Whoosh, is unmaintained and somehow slower than grep on a folder
| (I remember it being pretty fast years back, I'd love to know
| what happened).
|
| I'd say my favourite thing about Xapian is that it's just a
| simple library you can embed in any app, no need for a separate
| database and JVM tuning. For simple usecases and small-to-medium
| datasets, it just works.
| infocollector wrote:
| This project has been around and maintained for more than a
| decade! Small footprint, good speed. One downside might be GPL v2
| for commercial use.
| Bluestein wrote:
| In fact, AIUI it's roots go back about 3 decades. The "about"
| page has a nice historical overview.-
| the_mitsuhiko wrote:
| The project at one point started tracking files for potential
| rewrite to rid itself of GPL history. I used it many years ago
| and I quite enjoyed it (pre elastic search times) but
| unfortunately the license situation didn't help the project to
| become popular.
| frenchman99 wrote:
| You can always build a small search webservice that you open
| source and that your proprietary software calls out too,
| removing the need to open source everything.
|
| Linux is GPL too, didn't hinder companies making trillions on
| top of it.
| synergy20 wrote:
| you mean, don't compile it and link it within my application,
| instead wrap it as a separate program then call it via rpc
| remotely or locally?
| frenchman99 wrote:
| Yes, exactly that.
| the_mitsuhiko wrote:
| Linux is not a good example because of the syscall exemption.
| The licensing situation is not at all comparable as xapian's
| original point of existence was embedding.
| synergy20 wrote:
| that's true, wonder if there is alternative that is not gpl
| bearjaws wrote:
| Sonic search https://github.com/valeriansaliou/sonic
|
| Maybe not exactly the same, its a server that you can store
| documents and then retrieve their ID using a search string.
| Bluestein wrote:
| Elastic Search and its Amazon fork Opensearch perhaps?
| JackSlateur wrote:
| Xapian is a library, while elastic has a client-server
| model
|
| Xapian is more like sqlite while elastic would be mariadb
| Bluestein wrote:
| Thanks for the spot-on, very illuminated comparison.-
| inertiatic wrote:
| Lucene which is what ES builds upon is a library with
| bindings in languages other than Java, and it's Apache
| licensed.
| rbanffy wrote:
| For what kind of use do you think GPLv2 would be a blocker?
| donio wrote:
| A lot more than a decade. I've been using it for 15 years at it
| was a very mature project even then. Repo history goes back to
| 1999 and according to the history page the project's roots go
| back to the 80s. A bit like Postgres in this respect.
|
| https://xapian.org/history
| https://sigir.org/files/forum/S2000/MUSCAT_note.pdf
| jfmc wrote:
| Xapian is used in https://www.djcbsoftware.nl/code/mu/ for
| indexing emails.
| calvinmorrison wrote:
| And by Fastmail
| nathell wrote:
| And by Notmuch https://notmuchmail.org/
| Bluestein wrote:
| In itsel, Notmuch, a very interesting tool.-
| Bluestein wrote:
| s/itsel/itself/
| openrisk wrote:
| used also by recoll, the desktop search app:
| https://www.recoll.org/
| nanna wrote:
| I use recoll to index and search thousands of pdfs. Because I
| always have the author name in the filename I can filter
| queries like this:
|
| Cybernetics OR steering filename:Heidegger ext:pdf
|
| It's an absolute power tool.
| nickpsecurity wrote:
| I do it like this:
|
| Title Year Author Name.pdf
|
| Same benefits as you mentioned. You can also filter by time
| that way.
| nanna wrote:
| My way is to do
|
| year__author1_author2_author-n~~title-of-
| book~subtitle##tag1#tag-2#tagn.pdf
|
| This means that files are automatically organised by year
| of publication,that I can search by tag name, and that I
| dont have to escape chars in the terminal. One day I hope
| to get round to building an Emacs mode to filter by the
| different elements.
| donio wrote:
| If your PDFs have Author properties then you might be able to
| do "author:Heidegger" too. The Recoll PDF filter extracts
| some of these fields and if I remember right it can be
| configured to extract additional custom properties too.
| Beijinger wrote:
| Recoll is magic...
| rbanffy wrote:
| I remember having used, a very long time ago, a self-hosted
| search engine on my library of PDFs, and it was unbelievably
| useful.
|
| I dream about a similar thing that can do OCR on scanned docs and
| extract text from my also sprawling library of epub and mobi
| files. If someone builds something like this, with maybe a
| _LOCAL_ LLM to extract text descriptions from photos and movies
| as well as indexing metadata for everything, subtitles from
| movies and lyrics for songs, and add that to a NAS appliance,
| it'd be a killer.
| andyfilms1 wrote:
| Evernote will do this, you can feed it a bunch of PDFs and
| other documents, it will OCR them and make them all searchable.
| it's not perfect, but you can also add manual tags for things
| you know are important.
| rbanffy wrote:
| At some point I'd love to further train an LLM on all my PDFs
| and be able to ask it questions.
| kordlessagain wrote:
| I have most of this code for doing this - just needs to get
| rewritten for local storage (I was running it on Google Cloud).
| Need to pick something that doesn't run Solr as a service for
| local use. With Ollama, we have function calls running, so
| should be doable. I was also thinking about using the Open
| WebUI for use.
| glompers wrote:
| DEVONthink 3 [0] (Apple only) will do most of that although I
| don't keep up at all with its interoperability with LLM
| extensions.
|
| [0] https://www.devontechnologies.com/apps/devonthink
| Bluestein wrote:
| > extract text descriptions from photos and movies as well as
| indexing metadata for everything, subtitles from movies and
| lyrics for songs
|
| The big AI players are probably already scraping the bottom of
| this "barrel" in their search for training data, I am sure ...
| theolivenbaum wrote:
| That's what our app does: curiosity.ai, local index, support
| for many files types and apps out of the box, and integrated
| local OCR, STT and even local LLM
| Bluestein wrote:
| That is the future :) Much success!
| rmholt wrote:
| I couldn't find any mention on your website about LOCAL LLMs
| and according to your FAQ, it requires an account with your
| website.
|
| Is there a way how to run curiosity.ai fully offline, without
| an account on your servers?
| pdw wrote:
| > similar thing that can do OCR on scanned docs
|
| It's only part of what you want, but ocrmypdf will add a OCRed
| text layer to PDF files, making the text selectable and
| indexable
| importsaas wrote:
| Much better docs here at https://getting-started-with-
| xapian.readthedocs.io/en/latest...
| donio wrote:
| Love Xapian, been using it for many years via notmuch (mail) and
| recoll (document indexing, mainly PDFs in my case).
|
| It's been trouble free and very performant, a real workhorse.
|
| https://notmuchmail.org/
|
| https://www.recoll.org/
___________________________________________________________________
(page generated 2024-08-17 23:01 UTC)