[HN Gopher] Tantivy - full-text search engine library inspired b...
___________________________________________________________________
Tantivy - full-text search engine library inspired by Apache Lucene
Author : kaathewise
Score : 129 points
Date : 2024-05-27 17:30 UTC (5 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| kaathewise wrote:
| I was searching for a Meilisearch alternative (which sends out
| telemetry by default) and found Tantivy. It's more of a search
| engine builder, but the setup looks pretty simple [0].
|
| [0]: https://github.com/quickwit-oss/tantivy-cli
| OtomotO wrote:
| Hm, I am interested, but I would love to use it as a rust lib
| and just have rust types instead of some json config...
|
| The java sdk of meilisearch was also nice, same: no need for a
| cli and manual configuration. I just pointed it to a db entity
| and indexed whole tables...
|
| Would love that for tantivy
| mmastrac wrote:
| Major props to the authors of this library. I re-built
| https://progscrape.com [1] on top of it last year, replacing an
| ancient Python2 AppEngine codebase that I had neglected for a
| while. It's a great library and insanely fast, as in indexing the
| entire library of 1M stories on a Raspberry Pi in seconds.
|
| I'm able to host a service on a Pi at home with full-text search
| and a regular peak load of a few rps (not much, admittedly), with
| a CPU that barely spikes above a few percent. I've load tested
| searches on the Pi up to ~100rps and it held up. I keep thinking
| I should write up my experiences with it. It was pretty much a
| drop-in, super-useful library and the team was very responsive
| with bug reports, of which there were very few.
|
| If you want to see how responsive the search is on such a small
| device, try clicking the labels on each story -- it's virtually
| instantaneous to query, and this is hitting up to 10 years * 12
| months of search shards!
| https://progscrape.com/?search=javascript
|
| I'd recommend looking at it over Lucene for modern projects. I am
| a big fan, as you might be able to tell. Given how well it scales
| on a tiny little ARM64, I'd wager your experiences on bigger iron
| will be even more fantastic.
|
| [1] https://github.com/progscrape/progscrape
| OtomotO wrote:
| Thanks for that! A couple of days ago I used meilisearch for a
| quick proof of concept, but I'll check out tantivy again via
| your repo.
|
| I basically just need a fulltext search.
| snorremd wrote:
| It is a very nice library. I'm using it for a very work in
| progress incremental email backup CLI tool for email providers
| using JMAP.
|
| I wanted users to be able to search their backups. As I'm using
| Rust Tantivy looked like just the right thing for the job.
| Indexing happens so fast for an email I did not bother to move
| the work to a separate thread. And search across thousands of
| emails seems to be no problem.
|
| If anyone wants search for their Rust application they should
| take a look at Tantivy.
| adeptima wrote:
| Found recently Tantivy inside of ParadeDB (Postgres extension
| aiming to replace Elastic)
|
| https://github.com/paradedb/paradedb/blob/dev/pg_search/Carg...
|
| after listening
|
| Extending Postgres for High Performance Analytics (with Philippe
| Noel) https://www.youtube.com/watch?v=NbOAEJrsbaM
|
| And inside of the main thing - Quickwit(logs, traces, and soon
| metrics) https://github.com/quickwit-oss/quickwit
|
| Had a surprisingly good experience with combined power of
| Quickwit and Clickhouse for multilingual search pet project.
| Finally something usable for Chinese, Japanese, Korean
|
| https://quickwit.io/docs/guides/add-full-text-search-to-your...
|
| to_tsvector in PG never worked well for my use cases
|
| SELECT * FROM dump WHERE to_tsvector('english'::regconfig,
| hh_fullname) @@ to_tsquery('english'::regconfig, 'query');
|
| Wish them to succeed. Will automatically upvote any post with
| Tantivy as keyword
| fulmicoton wrote:
| Thank you so much for sharing!!!
| karmakaze wrote:
| Another resource is a trigram search index (in Go) used by
| etsy/hound[0] based on an article (and code) from Russ Cox:
| Regular Expression Matching with a Trigram Index[1].
|
| [0] https://github.com/hound-search/hound
|
| [1] http://swtch.com/~rsc/regexp/regexp4.html
|
| Different use-cases for alternatives to Lucene depending on your
| needs.
| yencabulator wrote:
| Beware, you _still_ cannot add /remove fields:
| https://github.com/quickwit-oss/tantivy/issues/470
|
| The only way to add fields is to reindex all data into a
| different search index.
| francoismassot wrote:
| One workaround is to use the JSON field, see doc
| https://github.com/quickwit-oss/tantivy/blob/main/doc/src/js...
| jrh3 wrote:
| Cheesy logo with a horse
|
| - Their website :)
| leyoDeLionKin wrote:
| but y not just a vector database like pgvector?
| FridgeSeal wrote:
| Because it's a full text search engine, and not a text
| embedding? Different query types, requirements, indexing
| methods, etc.
| teraflop wrote:
| You can think of a full-text index as being like a vector
| database that's highly specialized and optimized for the use-
| case where your documents and queries are both represented as
| "bags of words", i.e. very high-dimensional and very sparse.
|
| Which works great when you want to retrieve documents that
| _actually_ contain the specific keywords in your search query,
| as opposed to using embeddings to find something roughly in the
| same semantic ballpark.
| kernelsanderz wrote:
| In practice, a combination of full text and vector databases
| often gives superior performance than just one of the types.
| It's called hybrid search. Here's an article that talks a bit
| about this: https://opster.com/guides/opensearch/opensearch-
| machine-lear...
|
| Often you take the results from both vector search and lexical
| search and merge them through algorithms like Reciprocal Rank
| Fusion.
| blopker wrote:
| This would be cool to compile to wasm and ship to the browser.
| Seems like it would give a static site super fast search powers.
| kernelsanderz wrote:
| Tantivy is also used in an interesting Vector Database product
| called LanceDb - https://lancedb.github.io/lancedb/fts/ to
| provide full text search capabilities. Last time I looked it was
| only through the python bindings, though I know they're looking
| to implement the rust bindings natively to support other
| platforms.
___________________________________________________________________
(page generated 2024-05-27 23:00 UTC)