[HN Gopher] Tantivy - full-text search engine library inspired b...
       ___________________________________________________________________
        
       Tantivy - full-text search engine library inspired by Apache Lucene
        
       Author : kaathewise
       Score  : 129 points
       Date   : 2024-05-27 17:30 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | kaathewise wrote:
       | I was searching for a Meilisearch alternative (which sends out
       | telemetry by default) and found Tantivy. It's more of a search
       | engine builder, but the setup looks pretty simple [0].
       | 
       | [0]: https://github.com/quickwit-oss/tantivy-cli
        
         | OtomotO wrote:
         | Hm, I am interested, but I would love to use it as a rust lib
         | and just have rust types instead of some json config...
         | 
         | The java sdk of meilisearch was also nice, same: no need for a
         | cli and manual configuration. I just pointed it to a db entity
         | and indexed whole tables...
         | 
         | Would love that for tantivy
        
       | mmastrac wrote:
       | Major props to the authors of this library. I re-built
       | https://progscrape.com [1] on top of it last year, replacing an
       | ancient Python2 AppEngine codebase that I had neglected for a
       | while. It's a great library and insanely fast, as in indexing the
       | entire library of 1M stories on a Raspberry Pi in seconds.
       | 
       | I'm able to host a service on a Pi at home with full-text search
       | and a regular peak load of a few rps (not much, admittedly), with
       | a CPU that barely spikes above a few percent. I've load tested
       | searches on the Pi up to ~100rps and it held up. I keep thinking
       | I should write up my experiences with it. It was pretty much a
       | drop-in, super-useful library and the team was very responsive
       | with bug reports, of which there were very few.
       | 
       | If you want to see how responsive the search is on such a small
       | device, try clicking the labels on each story -- it's virtually
       | instantaneous to query, and this is hitting up to 10 years * 12
       | months of search shards!
       | https://progscrape.com/?search=javascript
       | 
       | I'd recommend looking at it over Lucene for modern projects. I am
       | a big fan, as you might be able to tell. Given how well it scales
       | on a tiny little ARM64, I'd wager your experiences on bigger iron
       | will be even more fantastic.
       | 
       | [1] https://github.com/progscrape/progscrape
        
         | OtomotO wrote:
         | Thanks for that! A couple of days ago I used meilisearch for a
         | quick proof of concept, but I'll check out tantivy again via
         | your repo.
         | 
         | I basically just need a fulltext search.
        
         | snorremd wrote:
         | It is a very nice library. I'm using it for a very work in
         | progress incremental email backup CLI tool for email providers
         | using JMAP.
         | 
         | I wanted users to be able to search their backups. As I'm using
         | Rust Tantivy looked like just the right thing for the job.
         | Indexing happens so fast for an email I did not bother to move
         | the work to a separate thread. And search across thousands of
         | emails seems to be no problem.
         | 
         | If anyone wants search for their Rust application they should
         | take a look at Tantivy.
        
       | adeptima wrote:
       | Found recently Tantivy inside of ParadeDB (Postgres extension
       | aiming to replace Elastic)
       | 
       | https://github.com/paradedb/paradedb/blob/dev/pg_search/Carg...
       | 
       | after listening
       | 
       | Extending Postgres for High Performance Analytics (with Philippe
       | Noel) https://www.youtube.com/watch?v=NbOAEJrsbaM
       | 
       | And inside of the main thing - Quickwit(logs, traces, and soon
       | metrics) https://github.com/quickwit-oss/quickwit
       | 
       | Had a surprisingly good experience with combined power of
       | Quickwit and Clickhouse for multilingual search pet project.
       | Finally something usable for Chinese, Japanese, Korean
       | 
       | https://quickwit.io/docs/guides/add-full-text-search-to-your...
       | 
       | to_tsvector in PG never worked well for my use cases
       | 
       | SELECT * FROM dump WHERE to_tsvector('english'::regconfig,
       | hh_fullname) @@ to_tsquery('english'::regconfig, 'query');
       | 
       | Wish them to succeed. Will automatically upvote any post with
       | Tantivy as keyword
        
         | fulmicoton wrote:
         | Thank you so much for sharing!!!
        
       | karmakaze wrote:
       | Another resource is a trigram search index (in Go) used by
       | etsy/hound[0] based on an article (and code) from Russ Cox:
       | Regular Expression Matching with a Trigram Index[1].
       | 
       | [0] https://github.com/hound-search/hound
       | 
       | [1] http://swtch.com/~rsc/regexp/regexp4.html
       | 
       | Different use-cases for alternatives to Lucene depending on your
       | needs.
        
       | yencabulator wrote:
       | Beware, you _still_ cannot add /remove fields:
       | https://github.com/quickwit-oss/tantivy/issues/470
       | 
       | The only way to add fields is to reindex all data into a
       | different search index.
        
         | francoismassot wrote:
         | One workaround is to use the JSON field, see doc
         | https://github.com/quickwit-oss/tantivy/blob/main/doc/src/js...
        
       | jrh3 wrote:
       | Cheesy logo with a horse
       | 
       | - Their website :)
        
       | leyoDeLionKin wrote:
       | but y not just a vector database like pgvector?
        
         | FridgeSeal wrote:
         | Because it's a full text search engine, and not a text
         | embedding? Different query types, requirements, indexing
         | methods, etc.
        
         | teraflop wrote:
         | You can think of a full-text index as being like a vector
         | database that's highly specialized and optimized for the use-
         | case where your documents and queries are both represented as
         | "bags of words", i.e. very high-dimensional and very sparse.
         | 
         | Which works great when you want to retrieve documents that
         | _actually_ contain the specific keywords in your search query,
         | as opposed to using embeddings to find something roughly in the
         | same semantic ballpark.
        
         | kernelsanderz wrote:
         | In practice, a combination of full text and vector databases
         | often gives superior performance than just one of the types.
         | It's called hybrid search. Here's an article that talks a bit
         | about this: https://opster.com/guides/opensearch/opensearch-
         | machine-lear...
         | 
         | Often you take the results from both vector search and lexical
         | search and merge them through algorithms like Reciprocal Rank
         | Fusion.
        
       | blopker wrote:
       | This would be cool to compile to wasm and ship to the browser.
       | Seems like it would give a static site super fast search powers.
        
       | kernelsanderz wrote:
       | Tantivy is also used in an interesting Vector Database product
       | called LanceDb - https://lancedb.github.io/lancedb/fts/ to
       | provide full text search capabilities. Last time I looked it was
       | only through the python bindings, though I know they're looking
       | to implement the rust bindings natively to support other
       | platforms.
        
       ___________________________________________________________________
       (page generated 2024-05-27 23:00 UTC)