[HN Gopher] Sonic: Fast, lightweight and schema-less search backend
___________________________________________________________________
Sonic: Fast, lightweight and schema-less search backend
Author : rcarmo
Score : 479 points
Date : 2022-10-24 11:17 UTC (11 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| pavelevst wrote:
| It would be nice if it can be replacement for logging stack,
| elastic is super hungry for ram
| IceWreck wrote:
| Look at zincsearch. Its another lightweight elastic alternative
| and they advertise logging as a usecase.
| francoismassot wrote:
| You can have a look at Quickwit (https://quickwit.io), it's a
| search engine made for logs :). It's still pretty young and...
| there are way less features than in ES.
|
| (disclaimer: I'm one of the cofounder)
| marginalia_nu wrote:
| * We imported ~1,000,000 messages of dynamic length (some very
| long, eg. emails);
|
| * Once imported, the search index weights 20MB (KV) + 1.4MB (FST)
| on disk;
|
| This is almost unbelievably succinct! If you encode the document
| features into 8 bits per document, and thus completely forego the
| need to store the document ID by indexing them implicitly, that
| alone is 1 MB.
|
| Getting meaningful search out of on average 21 bytes per document
| seriously impressive.
|
| [For reference, this sentence is 42 bytes.]
| [deleted]
| mattb314 wrote:
| Wonder if this has anything to do with the sliding window:
|
| > Sonic only keeps the N most recently pushed results for a
| given word, in a sliding window way (the sliding window width
| can be configured)
|
| Default window looks like 1k documents. I read this as saying
| that super common words are basically dropped from the index
| (only 1k out of many thousands of docs retained), but I don't
| know enough about the internals to be sure. Not sure if this
| actually hurts search results in practice, seems like an ok
| trade off for help docs at least.
| nightpool wrote:
| I wonder how easy it would be to change "most recently
| pushed" to something like a redis sorted set where each
| document has a score and only the top N results are retained
| when sorted by their separate score value? That would allow
| you to sort by pageviews / popularity in a more useful way.
| But it fails entirely when looking for uncommon intersections
| of common words, which feels like it makes it useless for
| most actual full-text search use-cases :(
| 411111111111111 wrote:
| It's definitely a great trade-off to make for efficiently,
| but makes it inherently unusable for most of elastic searchs
| usecases.
|
| Looking at it from a practical example such as log search
| (almost everyone I know has used
| kibana/logstash/elasticsearch at some point): you'd be able
| to search for things like tracingId/requestId but adding more
| filters such as logLevel, requestType or serviceName would be
| impossible
|
| It has it's niche, but calling it an elasticsearch
| alternative really is a stretch
| rabuse wrote:
| Also the ability to weight fields when fetching results to
| boost relevancy, which is needed for a lot of my use cases.
| syrusakbary wrote:
| Long ago I was searching on lightweight search engines that could
| run on the Edge, as ElasticSearch -while very popular- is also
| quite heavy and relies on the Lucene/JVM.
|
| Apart from Sonic, I also found Tantivy [1] and Meilisearch [2]...
| all delightfully made in Rust. My favorite, and the closest one
| to ElasticSearch (for its features) is probably Tantivy.
|
| I'd recommend anyone to check up this three projects and choose
| on what best fits your needs... it's awesome to see that more
| projects are becoming available by the day!
|
| [1]: https://github.com/quickwit-oss/tantivy
|
| [2]: https://github.com/meilisearch/meilisearch
| codedokode wrote:
| There is also sphinx search which was open source before 3.0
| version.
| snikolaev wrote:
| And it's open source continuation - Manticore Search [1]
|
| [1] https://manticoresearch.com/
| croes wrote:
| Do they support document access control like ES does?
| sanxiyn wrote:
| Yes, Meilisearch supports ES-like document access control.
| alserio wrote:
| I've looked up tantivy and quickwit. Quickwit uses tantivy as
| the engine. It has decoupled storage (awesome, only recently
| elastic announced something comparable) but is oriented towards
| log processing and esplitly warns against its use to power an
| user facing site search. Do you happen to know if there's
| anything like that with the same minimal footprint that can
| scale up and, importantly, down to serve the needs of highly
| variable traffic websites? Right now I'm looking at something
| with clustering capabilities and decoupled storage (e.g on s3)
| like quickwit
| francoismassot wrote:
| One of the reasons for not using Quickwit for user facing
| search is the latency: for example, you pay 70ms of latency
| when you make a request on AWS S3... and generally you expect
| latency below that figure. Decoupling compute and storage
| while keeping a very low latency may be then impossible
| unless ending up by caching all your data on disk :).
|
| You can have a look at lnx (https://lnx.rs/) that is based on
| tantivy and is performing quite well. It's not yet
| distributed but the author Chillfish8 has some thoughts about
| how to do it.
| alserio wrote:
| Thank you! I'll look into it
| dewey wrote:
| Another interesting alternative:
| https://github.com/meilisearch/meilisearch - I'm using it in one
| of my (small) projects and I had a good experience with it, also
| very helpful community.
| Thaxll wrote:
| So I can use that to inject millions of logs daily and it will do
| sharding and rebalancing automatically?
| sanxiyn wrote:
| No. Sonic is a single node server and not distributed.
| daitangio wrote:
| Nice. I have done some tests with SQLlite, and I find its index
| module very interesting, also because it offers stemming, which
| seems missed here: am I wrong?
|
| SQLite has stemming only for english out-of-the-box, but I find
| it quite a need for a good ES drop in replacement.
|
| My two cents
| rcarmo wrote:
| It is great and works, but sonic has broader applications (I
| found it because it was actually being used as a way to index
| an existing SQLite database that pointed to file storage).
| PedroBatista wrote:
| While I get the wants-and-needs since ElasticSearch has a
| voracious appetite for RAM, I get the feeling most people think
| search engines are a simple thing where you can just import some
| lib, fool around for a bit and call it a day.
|
| The truth is that ElasticSearch/Solr/Lucene is orders of
| magnitude more complex and powerful than these "alternatives".
| All this is mostly fine as long everyone is on the same page
| regarding the expectations.
|
| Most people don't need ElasticSearch for their use cases on the
| surface, but I feel they expect top-notch mind-reading results
| and that requires something like ElasticSearch and someone who
| knows the field.
|
| Having said all of that, Meilisearch and this are quite fine.
| keyle wrote:
| Yeah there needs to be some kind of acid test that will compare
| these products on equal footing and show the pitfalls.
| DeathArrow wrote:
| Here is a performance benchmark: https://db-
| benchmarks.com/test-hn/#manticore-search-columnar...
| jasfi wrote:
| That would be great. However if you wanted to benchmark
| relevance ranking, how would you do that?
| sanxiyn wrote:
| You need a dataset and an evaluation metric. The usual
| evaluation metric is NDCG(Normalized Discounted Cumulative
| Gain): https://en.wikipedia.org/wiki/NDCG
|
| An example dataset is BEIR(BEnchmarking Information
| Retrieval), published in NIPS 2021:
| https://github.com/beir-cellar/beir
| sanxiyn wrote:
| This is very very difficult, but Tantivy tried: see
| https://github.com/quickwit-oss/search-benchmark-game
| ilyt wrote:
| Spivak wrote:
| I think the upshot is that if you have no idea what all the
| advanced features of ES even are then you probably don't need
| ES because it's not turnkey.
|
| If you utter the phrase "I just want search" then it really is
| a matter of just using one of these lightweight projects and
| libs because your needs are simple.
| alessmar wrote:
| I would like to suggest https://typesense.org/ It has some
| features that makes it a better choice than Meilisearch
| paraboul wrote:
| Can you elaborate on said features?
|
| I migrated from typesense to Meilisearch on a project after I
| found it had much better search accuracy. I can't exactly
| explain why, but overall Meilisearch results feel more
| relevant by default.
| jabo wrote:
| I work on Typesense. Mind if I ask which version of
| Typesense and Meilisearch you tried this on? And if this
| was on some public dataset I can use?
|
| I'd love to take a closer look.
| paraboul wrote:
| Hey jabo,
|
| I migrated in April 2021 (latest version of typesense &
| meilisearch at that time).
|
| I don't have a public dataset has it was a fairly large
| ecommerce catalog with close to ~500k entries. And again,
| it was just my own perception which is hard to define. I
| just found that Typesense was a bit off compared to
| Meilisearch on search accuracy, and of course could
| totally be different today with a more recent release.
| jabo wrote:
| Got it, thank you for sharing that. Typesense was at
| v0.19.0 around that time. Two prominent issues we had in
| that version were how we handled matches across multiple
| fields and how we handled "keyword stuffing".
|
| We're now at v0.24.rc, and we've iterated quite a lot on
| improving relevancy since then, as more users shared
| their datasets with us and gave us feedback over the last
| 1.5 years.
|
| If you get a chance to try out Typesense again in the
| future, I'd love to hear how relevance feels with the
| latest version, out of the box for your dataset.
| snikolaev wrote:
| There are actually benchmarks that allow measuring search
| relevancy objectively, e.g. BEIR[1]. Manticore Search team
| did an effort to make a PR to include it to the list. The
| results are here [2]. Unfortunately the BEIR team seems to
| be too busy to review a whole pile of PRs including about
| Vespa. Nevertheless it would be nice to have both
| Meilisearch and Typesense there too since it's interesting
| what performance those non-tf-idf based search engines
| would show compared to BM25-based and vector search
| engines.
|
| [1] https://github.com/beir-cellar/beir [2] https://docs.go
| ogle.com/spreadsheets/d/1_ZyYkPJ_K0st9FJBrjbZ...
| eric4smith wrote:
| What about relevancy?
|
| There's not much mention of that. I'm always on the lookout for
| something lightweight that improves on PostgreSQL full text.
| sanxiyn wrote:
| Sonic doesn't do any ranking other than latest first.
| eric4smith wrote:
| Ouch
| giancarlostoro wrote:
| Now if it were drop-in capable and still more efficient, that
| would be impressive and I would count the days until Elastic buys
| you out.
| cies wrote:
| Other:
|
| https://www.meilisearch.com/
|
| https://github.com/quickwit-oss/tantivy
|
| https://github.com/toshi-search/Toshi
|
| https://github.com/typesense/typesense
| didip wrote:
| Somewhat related, this guy: https://github.com/mosuka/ seems to
| be very passionate about search service.
|
| He built two distributed search services:
|
| - https://github.com/mosuka/phalanx, written in Go.
|
| - https://github.com/mosuka/bayard, written in Rust.
| erikcw wrote:
| One of the features I like in ES that I haven't seen in
| alternatives is "Percolate queries" (queries where you feed the
| service a document and it returns a list of queries that you've
| indexed that would match that document - basically inverting the
| whole process).
|
| Does anyone know of any alternatives that support this use case?
|
| https://www.elastic.co/guide/en/elasticsearch/reference/mast...
| snikolaev wrote:
| Yes. Manticore Search does. Here's an interactive course[1]
| about it, it's a little bit outdated though. More info in the
| docs[2]
|
| [1] https://play.manticoresearch.com/pq [2]
| https://manual.manticoresearch.com/Creating_an_index/Local_i...
| thedougd wrote:
| Just another plug for Lucene or the library route. I had a simple
| use case to offer a search/autocomplete API for the employee
| directory of ~50,000 records. The source of truth was only
| updated once a day. We ran a job that reindexed daily and
| published the index as a file (< 15 megabytes) to where the
| service could access it.
|
| That service worked beautifully. Results were returned in 10-20ms
| and we only ever made software updates to handle the occasional
| CVE. It did, however, take quite a bit of fiddling initially to
| get the query results to match the user expectations. For
| example, weighting first vs last vs full name.
| codedokode wrote:
| I am not sure if it can be called an "alternative". ElasticSearch
| has thousands of features and settings while this library seems
| to be just a simple inverted index implementation only for text
| search.
|
| By the way if you are looking for lightweight "alternative" for
| ElasticSearch you might look at sphinx search engine (although it
| doesn't has as much features as ES has and it has became closed-
| source since 3.0 version).
| snikolaev wrote:
| > you might look at sphinx search engine
|
| Manticore Search [1] forked from the latest open source version
| and has been continually improved for more than 5 years.
|
| [1] https://manticoresearch.com/
|
| > although it doesn't has as much features as ES has
|
| Manticore unlike Sphinx is much closer to Elasticsearch in
| terms of features set.
| 9dev wrote:
| Every time someone comes up with an alternative to a software
| behemoth like Elasticsearch, what they actually mean is: "An
| alternative to the 10% of functionality of $tool _that are
| interesting to me_ ".
|
| This is surely an impressive engineering feat, but hardly a
| replacement for the myriad of query possibilities Elasticsearch
| offers.
| coldtea wrote:
| "ative to a software behemoth like Elasticsearch, what they
| actually mean is: "An alternative to the 10% of functionality
| of $tool that are interesting to me"
|
| Which is perfectly fine. A lot of tools become so general and
| bloated, that there are large groups that would be fine with
| many different 10% subsets of their features...
|
| Kind of like how I don't need MS Word or OpenOffice Write, any
| simple text editing program with a few basic features (like
| printing, bold/italics, and word count) will do for my needs...
| 9dev wrote:
| I'm not opposed to that, however, the chance of _their 10%_
| and _my 10%_ overlapping is rather slim. Just like you only
| need basic formatting, and I require footnotes in my
| documents. Nothing wrong with either, but I 'd be upset if
| you tried to sell me GEdit as a replacement for OpenOffice
| Write.
| manigandham wrote:
| True, but most deployments are also just generic searching of
| records like Algolia rather than using all the low-level
| functionality.
|
| Tyoesense is probably the most compete competitor in that
| regard: https://typesense.org/
|
| Other alternatives here:
| https://gist.github.com/manigandham/58320ddb24fed654b57b4ba2...
| jethro_tell wrote:
| well, and how do you solve excessive ram usage for a search
| engine? Generally you write the indexes/search trees to disk
| which, may or may not be ideal.
| GrinningFool wrote:
| From the opening line of the README:
|
| "Sonic can be used as a simple alternative to super-heavy and
| full-featured search backends such as Elasticsearch in some
| use-cases."
|
| Seem pretty up-front about it, and doesn't claim to be a full-
| featured alternative.
| lolinder wrote:
| Agreed, they do a good job of hedging it. I think OP was
| probably pre-empting the usual comments along the lines of
| "yep, $tool is super bloated, $smallerTool proves that those
| other guys building $tool are bad engineers."
| marginalia_nu wrote:
| To be fair, there is often a better reasons to only replace a
| portion of ES' functionality, since doing so can save a lot of
| computation and space; than to replace ES itself, since it
| already exists and does a good job if what you need is the full
| kit.
|
| I found myself last week reimplementing 10% of RoaringBitmap's
| functionality as a homebrew replacement, because doing so was
| 500% faster. Not that RB isn't great, but it's designed for a
| general problem space, and not my particular problem.
| tensor wrote:
| My guess is that the majority of people using ES could actually
| use something simpler like this.
| jamil7 wrote:
| Agreed, although to be fair to the actual author (assuming they
| didn't post this here) the readme is a lot more upfront about
| it's capabilities.
|
| > Sonic can be used as a simple alternative to super-heavy and
| full-featured search backends such as Elasticsearch in some
| use-cases.
| rlex wrote:
| and almost all of them won't offer ES-compatible API. Out of my
| head i can think about manticore (https://manticoresearch.com/)
| that offers at least subset of elasticsearch API
| sanxiyn wrote:
| Quickwit does too (a subset anyway):
| https://github.com/quickwit-oss/quickwit
| papruapap wrote:
| tbh "Text Search" is a vague description for these kind of
| softwares, so I guess everyone go with elasticsearch-like.
| RicoElectrico wrote:
| Honestly most of the "alternative to" programs do not meet
| expectations they set by dropping a big known name. So much so,
| that I think people are doing FOSS disservice by comparing to
| those who they can't meaningfully overtake.
|
| The only exceptions could be small single feature utilities.
| graftak wrote:
| To me it seems the "alternative to" part is more damaging in
| that sense than dropping a big name. The name is used to put
| a complicated piece of software in a context many people are
| familiar with. The same thing happens with the
| "Tinder/Uber/Airbnb for <x>..." type of services.
|
| The friction is introduced where it's not made crystal clear
| how it's similar, and which concept are different or missing
| altogether. Then it will cause unmet expectations.
|
| Perhaps it's better to say "inspired by ..." or "similar to
| ..." to make a more precise statement.
| ianbutler wrote:
| I don't think your opinion is wrong, but I do think
| ElasticSearch has a lot of features that many people consider
| bloat depending on their work, and scaling and doing general
| dev ops for ES can be an absolute slog. Light weight
| alternatives that cut down to a set of core features for some
| niche seem like a good idea to me.
| 9dev wrote:
| It's totally fine that many people consider stuff bloat, but
| other people don't. I've built a highly specialised search
| engine for manufacturing companies on top of Elasticsearch,
| and I _decidedly_ need vector queries, TF-IDF queries,
| geospatial range queries, and heaps of other, niche features
| you probably never used before.
|
| Having a lightweight search engine is fine, but calling it an
| alternative to Elasticsearch is not doing either justice.
| ianbutler wrote:
| That's very assumptive of you. I have in fact used most of
| those features, and note I said their opinion was not
| wrong. In their readme they said it's a replacement for
| some use cases which is upfront and fine.
|
| Vector queries aren't niche, Elastic however only tacked on
| a proper (non HNSW) implementation in the last year and a
| half. Geospatial isn't niche, anyone working with location
| data will work with those queries. TF-IDF is a basic
| ranking algo / signal.
|
| Maybe Elasticsearch is good for you because they have all
| their features in aggregate. But I can name a tool that
| focuses specifically on each area and query type and is
| better for that specific subset of functionality.
|
| So my point still stands, if all you need are specific
| features Elastic is too much. You need all of it and that's
| fine too.
| sanxiyn wrote:
| I mean, Sonic doesn't store term frequency at all, so it
| can't do TF-IDF. It probably doesn't want to. If you need
| any ranking other than latest first, Sonic is not for
| you.
| _tom_ wrote:
| The problem with subsets is everyone wants a different
| subset. It's my popular software almost always bloats.
| Everyone wants some different features.
| pbowyer wrote:
| > But I can name a tool that focuses specifically on each
| area and query type and is better for that specific
| subset of functionality.
|
| Please do name them, because I for one would like to
| never run ElasticSearch again for faceted, full-text and
| specialised search.
| osigurdson wrote:
| I agree, but ES should re-write their core engine to be more
| lightweight, otherwise a viable competitor will emerge.
| snorremd wrote:
| Projects like Meili Search are already coming for Elastic
| Search's lunch: https://www.meilisearch.com. I think there is
| a market for fast, light weight alternatives like Meili that
| offers up a fully featured open source experience.
|
| With Elastic Search many of the features, security being one,
| are locked away behind commercial licenses. With Meili it
| seems they are, for the time being anyway, going with a
| proper open source version. I understand Elastic needs to
| earn money, and I get their licensing model to accomplish
| this. But Meili will probably steal away a good portion of
| customers interested in self hosting their search solution.
| osigurdson wrote:
| I'm not sure what this competitor will be but > ES will
| have the following properties:
|
| - written in rust or maybe just C - extremely lightweight
| and high performance - single small binary that runs
| anywhere - designed to run in Kubernetes from the ground up
| - scales dynamically up/down - zero downtime upgrades -
| rigorous security built into the core offering - fully open
| source - wire compatibility with ES
|
| I hope that ES themselves do this. There are pretty
| significant barriers to creating a serious competitor to ES
| (unlike something like MongoDB for example which seems to
| have a very limited role in the future).
| felipellrocha wrote:
| That is exactly what they are, and I don't think they hide it?!
| So, I don't know what the issue it. This is the kind of
| innovation that keeps us moving forward.
| atesti wrote:
| >Also, Sonic only keeps the N most recently pushed results for a
| given word, in a sliding window way (the sliding window width can
| be configured)
|
| Does this mean that it only ever finds at most N documents per
| word? Even searches for "A and B" would probably not find
| everything, even if less than N documents contain A and B,
| because they might have been removed with the sliding window
| already for A or B alone. Is that correct?
| sanxiyn wrote:
| As far as I can tell, yes, this is correct.
| Aeolun wrote:
| Huh? Yeah. I can keep my index size down by throwing results
| away as well.
|
| Every time you think it's somehow magic, someone has to dump a
| bucket of cold water over your head.
| marsven_422 wrote:
| eerikkivistik wrote:
| About 2 weeks ago, I was searching for an alternative to Elastic
| for this exact use case. Funny how the world works, now I have my
| answer: "someone has built it".
| habibur wrote:
| First thing I looked for is how long does it takes to delete a
| document from the index.
|
| Looks like it rebuilds the whole index periodically and that's
| very processor intensive. The delete will be reflected after a
| rebuild.
| IYasha wrote:
| But does it scale?
| sanxiyn wrote:
| No, it doesn't.
| keroro wrote:
| There's also mellisearch which is another elasticsearch
| alternative written in rust.
|
| Comparison to elasticsearch:
| https://docs.meilisearch.com/learn/what_is_meilisearch/compa...
|
| Github: https://github.com/meilisearch/meilisearch
|
| Website: https://www.meilisearch.com/
| mhitza wrote:
| The readme doesn't offer enough information to accept that it can
| be an alternative to elasticsearch. From what I can gather by
| skimming the information, it can only do word level matching and
| that it isn't some form of TF-IDF type index (as is Lucene, which
| stands behind Solr/ElasticSearch).
| sanxiyn wrote:
| Yes, it doesn't do any ranking at all. Results are returned in
| the reverse order of indexing.
| vlovich123 wrote:
| Using a 32 bit ID is an interesting choice. It means you can only
| index 64-bits per bucket. I wonder if using a varint encoding
| would give you even more savings while handling > 4 billion
| documents at the cost of a bit more expensive
| serialization/deserialization cost (which should be negligible in
| the grand scheme of everything else being done).
| speps wrote:
| Does anyone know of an alternative for the time series side of
| Elastic?
| gkorland wrote:
| You might want to check Redis-Stack -
| https://redis.io/docs/stack. It's a stack on top of Redis,
| which come bundled with RedisTimeSeries, RediSearch, and
| RedisJSON (also includes RedisGraph and RedisBloom).
| snikolaev wrote:
| Manticore Search. Here's a blog post with detailed comparison
| [1]
|
| [1] https://manticoresearch.com/blog/manticore-alternative-to-
| el...
| pipeline_peak wrote:
| If they keep introducing hipster names like Deno and Sonic, no
| one will know what anything means anymore.
| endisneigh wrote:
| I wish someone would write a full text engine that supports
| pluggable storage engines.
| ilyt wrote:
| AndrewKemendo wrote:
| If anyone has been successful compiling this with VSCode on Win10
| please let me know how you get CLANG/LLVM to play nicely with
| VSCode.
|
| I'd like to avoid compiling LLVM from source if I can
| hardwaresofton wrote:
| Wow it's weird that this comes up, I'm actually running a site I
| am going to repost to HN today that I want to use as a testbed
| for search engines (kind of like an extension to my recent
| collaboration with supabase[0]).
|
| Right now I've got the site going on just Postgres FTS + trigram
| and it's pretty darn fast, looks like I need to test sonic too.
|
| Going to burn some midnight oil (in my timezone, anyway) and get
| it out -- though sonic isn't implemented yet!
|
| Anyway to make this comment useful to people, here's my short
| list of engines that I want to run in parallel:
|
| - MeiliSearch (https://github.com/meilisearch/MeiliSearch)
|
| - TypeSense (https://github.com/typesense/typesense)
|
| - Lyra (https://github.com/LyraSearch/lyra)
|
| - OpenSearch (https://github.com/opensearch-project/OpenSearch)
|
| - ZincSearch (https://github.com/prabhatsharma/zinc)
|
| - Sonic (https://github.com/valeriansaliou/sonic)
|
| There isn't enough out there comparing all these for the simple
| typical fuzzy search/search box usecase, so I'm adapting a little
| podcast search site I made to try and use all of these at the
| same time. So far only Postgres though, will try and add
| Meilisearch today and post it!
|
| Like other people are pointing out, most of these engines won't
| have all the features of ES (or more accurately Lucene) but I am
| pretty convinced that most of the time it doesn't _actually_
| matter and if someone is searching on your site excessively maybe
| there 's a problem with your UX (unless you're a search engine or
| repository of information).
|
| [0]: https://supabase.com/blog/postgres-full-text-search-vs-
| the-r...
| Bilal_io wrote:
| Hey that's a great list of tools.
|
| Are you aware of any that can be used client side like Lyra and
| supports faceted search?
|
| I've been looking for a solution and cannot find it, even an
| algorithm and/or a data structure can be helpful. I attempted
| coming up with a solution myself but ended up with frustration
| when it came to making the facets dynamic and update as other
| filters are applied.
|
| I read a couple of papers and one stood out [0], which
| introduces category theory as a solution to faceted filtering.
| I understood it in theory and it was still does not seem
| straight forward to implement but I haven't attempted yet.
|
| 0.
| https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5145200/#!po=28...
| hardwaresofton wrote:
| So for client-side search, I generally know of Lunr.js:
|
| https://lunrjs.com/docs/index.html
|
| There are some others but I can't find them at this moment --
| a bunch of the other projects I find are somewhat abandoned,
| lunr is actually on my list of things to use (because it
| makes the most sense to just ship a pre-built index with the
| first like... 5 letters maybe of typeahead, no matter how
| fast the backend is)
| Bilal_io wrote:
| Thanks for the link. This unfortunately is not what I am
| looking for. Faceted filters are a different beast.
| hawski wrote:
| Thank you for this comparison. I would also like to know how
| Bleve Search (https://github.com/blevesearch/bleve) turns out.
|
| I have for many years now a small search engine project in my
| free-time pipeline, but I'm before crawling even and I intend
| to sit for searching part after some of that.
| hardwaresofton wrote:
| You're right I should put bleve on there as well. This isn't
| even the whole list. Toshi (https://github.com/toshi-
| search/Toshi) is also out there...
| snikolaev wrote:
| If you decide to add Manticore Search to the list feel free
| to ping me at sergey@manticoresearch.com if you need help
| with preparing the ingestion scripts etc.
| hardwaresofton wrote:
| Oh! Damn it I forgot about manticore -- I had seen it
| before but forgot to include it.
|
| Eventually all of these projects will be highlighted on
| Awesome F/OSS (https://awsmfoss.com), but for now I'm
| just going to dump my bookmarks here for other people,
| since I'm leaving awesome projects out:
|
| Search Engines
|
| AWS OpenSearch https://github.com/opensearch-
| project/OpenSearch
|
| https://github.com/opensearch-project/OpenSearch-
| Dashboards
|
| https://github.com/opensearch-project/perftop
|
| https://github.com/go-ego/riot
|
| https://groonga.org/ https://github.com/groonga/groonga
|
| https://github.com/meilisearch/MeiliSearch
|
| https://github.com/mosuka/bayard
|
| https://github.com/nezaboodka/nevod
|
| https://github.com/searx/searx
|
| https://github.com/stryku/okon
|
| https://github.com/toshi-search/Toshi
|
| https://github.com/typesense/typesense
|
| https://github.com/valeriansaliou/sonic
|
| Algolia
|
| https://github.com/marconi1992/algolite
|
| https://quickwit.io/
|
| https://github.com/quickwit-inc/quickwit
|
| https://docs.meilisearch.com/
|
| https://github.com/prabhatsharma/zinc
|
| phalanx https://github.com/blugelabs/bluge
| https://github.com/mosuka/phalanx
| https://github.com/mosuka/blast
|
| ManticoreSearch
|
| https://github.com/manticoresoftware/manticoresearch
|
| https://github.com/manticoresoftware/docker
|
| https://manticoresearch.com/blog/manticore-alternative-
| to-el...
|
| https://manticoresearch.com/
|
| https://manual.manticoresearch.com/Introduction
|
| https://forum.manticoresearch.com/t/manticore-search-
| cheatsh...
|
| https://forum.manticoresearch.com/
|
| Whoosh https://whoosh.readthedocs.io/en/latest/
|
| https://pypi.org/project/Whoosh/
|
| lyra https://github.com/nearform/lyra
|
| https://nearform.github.io/lyra/
|
| https://github.com/LyraSearch/lyra
|
| https://lyrasearch.io/
|
| flexsearch
|
| https://github.com/nextapps-de/flexsearch#performance-
| benchm...
|
| https://pagefind.app/docs/
|
| Lucene
|
| https://github.com/apache/lucene
|
| https://lucene.apache.org/
|
| ZincSearch
|
| https://zincsearch.com/
|
| Solr
|
| https://solr.apache.org/
|
| https://solr.apache.org/operator/
|
| https://solr.apache.org/guide/solr/latest/getting-
| started/so...
|
| https://github.com/apache/solr
|
| https://solr.apache.org/guide/solr/latest/deployment-
| guide/s...
|
| Konnu https://gitlab.com/shadowislord/konnu
|
| Quickwit QuickWit + Clickhouse
|
| https://clickhouse.com/docs/en/guides/developer/full-
| text-se...
|
| https://clickhouse.com/docs/en/sql-
| reference/functions/strin...
|
| There is no way I can get to running _all_ of these (this
| project was supposed to be quick!!), but I will run the
| ones I noted earlier, and probably manticore too since it
| was high on my list since it 's quite polished looking.
| _tom_ wrote:
| I'd encourage you to maintain and publish your list of
| search engines. Even if you aren't supporting them.
|
| The list has value on its own, especially if you maintain
| it.
| kapilvt wrote:
| + xapian which has been around a while, and while gpl
| licensed, is quite capable https://xapian.org/
| donio wrote:
| Xapian is great, especially when you need a a C/C++
| library rather than a separate service. Kinda like an
| sqlite for search. Some of my favorite tools like notmuch
| and recoll use it.
| fzliu wrote:
| + Milvus (https://github.com/milvus-io/milvus) for large
| scale similarity/semantic search.
| thirdtrigger wrote:
| + Weaviate for vector based search. Has a BSD-3 license.
| https://weaviate.io/developers/weaviate/current/
| nightpool wrote:
| > and if someone is searching on your site excessively maybe
| there's a problem with your UX (unless you're a search engine
| or repository of information).
|
| I don't understand this comment. Why would you search something
| that *isn't*, in some senses, a repository of information? I
| would say almost every website needs to have search in some
| sense, and it's *because* sites function as a repository of
| information that they need this search. Think about e.g.
| Stripe's documentation, or Github's repository / code search.
| HN is also another great example--I search for stories or
| comments all the time to try and remember something I read
| about recently or heard about last week, but couldn't quite
| remember. I'm hard-pressed to think of a web site I use
| regularly that *shouldn't* have full-text search, if I'm being
| honest.
| hardwaresofton wrote:
| I don't consider use cases like documentation a "repository"
| of information, but maybe this is just me not phrasing it
| badly. In the literal sense sure it is, but when I think of a
| "repository of information" I think of wikipedia, amazon
| search items, etc.
|
| The scale of a documentation site is a very different problem
| -- you can brute force it in ways that you can't at larger
| scales.
|
| I agree that HN would be a case of the large repository, but
| even then what most people want out of HN search is pretty
| simple/basic keyword search. I think a decent non-frustrating
| HN search feature could be very basic and get by without most
| of the advanced features/rabbit holes available in search.
|
| Basically I think most apps fall into the lighter search use
| case -- command palettes, search inside of apps with a small
| scale of information, etc.
|
| My comment wasn't that apps _shouldn 't_ have full text
| search -- it was that most that have full text search don't
| need _complex_ full text search with all the bells and
| whistles that lucene and other serious search engines
| provide. These up-and-comers might be enough for a bunch of
| apps for which search is not the main feature.
| TylerE wrote:
| Most site searches are basically unusable. Either it isn't
| very good, is painfully slow, or both.
|
| Just gooling site:foo.com/baz <query> almost always produces
| better results.
| francoismassot wrote:
| You can consider also lnx that is based on tantivy and is
| performing quite well (https://lnx.rs/).
| hardwaresofton wrote:
| Meili is still ingesting documents but we're live:
|
| https://news.ycombinator.com/item?id=33321268
|
| Maybe I should have used their batch thing instead.
| MobiusHorizons wrote:
| Would it make sense to include Sqlite FTS5 in that mix?
| hardwaresofton wrote:
| It would, I did for the supabase post but... This is already
| way too much! I have no idea when I'll actually be able to
| get to all this as-is.
|
| Waiting for meilisearch to ingest documents right now and the
| Show HN is going up.
| blacklight wrote:
| While I really like their lightweight, SQL-like protocol instead
| of Elasticsearch's fat JSON, I really think that this project
| could have much more impact if it could be a drop-in replacement
| for ES.
|
| Even if it offers only a fraction of the features offered by ES,
| that may be fair enough for at least half of the use-cases out
| there.
|
| Sonic could have really had a strong selling point: "Use an ES-
| alternative that works fine in most of the real-world
| applications, but it's written in Rust and it only takes a
| fraction of the memory footprint required by ES, and it shouldn't
| require you to change your application code".
|
| Instead, they are proposing yet another search protocol, that
| developers have to learn and adopt. That definitely increases the
| adoption barriers.
| tensor wrote:
| It's probably fairly easy to write an adapter here.
| xvello wrote:
| Since Elastic spitefully patched all of their client libraries
| to fail if the server is not a "genuine" ES server, I don't see
| what good a drop-in replacement with protocol compatibility
| would do.
|
| Go client: https://github.com/elastic/go-
| elasticsearch/blob/3985f2a1554...
|
| Python client: https://github.com/elastic/elasticsearch-
| py/commit/e72aa3e24...
| snikolaev wrote:
| Is it prohibited to include `X-Elastic-Product:
| Elasticsearch` in the output of your server if the user
| instructs the server to do so? :)
| hangonhn wrote:
| I don't see how they can legally have any control over what
| a 3rd party's software outputs. And more importantly, how
| would they even enforce such restrictions?
| yvan wrote:
| I believe Elasticsearch is a trademark.
| jeltz wrote:
| A trademark does not forbid people from using a name, it
| only restricts how it can be used in marketing. I do not
| see how that would be applicable here.
| metadat wrote:
| Are HTTP headers important or even relevant at all for
| branding trademark purposes?
|
| Such a concern seems utterly ridiculous.
| mumblemumble wrote:
| If it really does work this way, then we're all doomed.
|
| https://stackoverflow.com/questions/1114254/why-do-all-
| brows...
| blowski wrote:
| I imagine AWS can't put it on the headers of their
| managed service, and that's what it's about.
| AbraKdabra wrote:
| Those libraries are open source, just nuke those
| restrictions and you're good to go. Is it the best way?
| Maybe not, but it's better than modifying your server
| responses (and in the worst 1984 case, allowing Elastic to
| sue you), if you develop such a tool you can always put
| that distinction in your README.
| markandrewj wrote:
| Although not exactly the same, Elastic has an SQL query syntax
| which can be used now as well.
|
| https://www.elastic.co/what-is/elasticsearch-sql
| leros wrote:
| ElasticSearch is so much more than search. Sonic is very
| minimal in comparison, so a drop in replacement doesn't work
| here.
|
| But yes, Sonic could replace lots of use cases.
| nathell wrote:
| I've written a full-text search engine as well. I don't tout it
| as a replacement for Elasticsearch, but it does have a few
| advantages: it's fast; supports HTML documents; supports Polish
| inflection (via a full-blown morphological dictionary, not just a
| stemmer); and has a very compact on-disk format (pre-parsed HTML
| trees, Huffman-encoded over large alphabets). Oh, and it's 100%
| Clojure.
|
| It underlies a concordancer GUI called Smyrna:
| https://github.com/nathell/smyrna, https://smyrna.danieljanus.pl
|
| I haven't touched it in six years, other than a few small
| changes. But I do plan on revisiting it when time permits.
| johnebgd wrote:
| That's very cool. I hope you consider open sourcing it so
| others can contribute.
| nathell wrote:
| It is open-source already (MIT)! I just need to make other
| languages more easily pluggable, and factor out the search
| engine so that it can be used on its own. :)
| _tom_ wrote:
| Could your steamer be ported to Lucene? Might get more usage
| there.
| scottwick wrote:
| Does anyone have any recommendations of books or other resources
| that go over the theory behind full-text search? i.e. language
| processing, data encoding, on-disk storage and retrieval, etc.
| sanxiyn wrote:
| If you want a book, Managing Gigabytes is still pretty good.
| snikolaev wrote:
| https://nlp.stanford.edu/IR-book/information-retrieval-book....
| dang wrote:
| Related:
|
| _Sonic: Fast, lightweight and schemaless search back end in
| Rust_ - https://news.ycombinator.com/item?id=19471471 - March
| 2019 (39 comments)
| excsn wrote:
| This is not a direct alternative to ElasticSearch. Tantivy is
| closer to an alternative to ElasticSearch since ES is built on
| top of Lucene. An alternative could be achieved if built on top
| of Tantivy.
|
| Sonic here only returns document identifiers so you will never be
| able to get document information back. This is very useful though
| if all you want to do is index text data and then get the stored
| information from another data store.
| codedokode wrote:
| > Sonic here only returns document identifiers
|
| In many cases that is what you want because you have the data
| in a database and don't want to duplicate it in Elastisearch.
| counttheforks wrote:
| > Sonic here only returns document identifiers so you will
| never be able to get document information back
|
| Why would you want that anyway? Always thought it was silly to
| duplicate all your data which will be stored in a real database
| anyway
| excsn wrote:
| From a use case I am not experienced with. If you index
| books, you want the search engine to return highlighted data
| like google does.
|
| Also, now that I think of it, typically logs/structured data
| is stored only in ES.
| sanxiyn wrote:
| Quickwit is a search engine built on top of Tantivy (by the
| author of Tantivy): https://github.com/quickwit-oss/quickwit
|
| Quickwit supports Elasticsearch compatible bulk indexing API.
| croes wrote:
| Most of the time these ES replacements lack a decent access
| control.
|
| One thing is to find what you search, but the other is not to
| find what you aren't allowed to see.
| sanxiyn wrote:
| Meilisearch supports ES-like document access control.
| DeathArrow wrote:
| >Also, Sonic only keeps the N most recently pushed results for a
| given word, in a sliding window way (the sliding window width can
| be configured)
|
| If you discard many potential hits, why not use /dev/null as the
| search engine?
| Someone1234 wrote:
| I believe you must have misread what you quoted, because
| whatever point you're trying doesn't really follow what you
| quoted.
|
| They let you configure the number of expected results to cache
| for a given query, the number of cache results are configurable
| based on your use-case for the results (e.g. if your website
| only lists 100 results, don't store beyond that).
|
| If more results than that for a given query are returned then
| they disregard additional results since you told it you won't
| make use of them. In essence, they're saving you from caching
| results that you'll never consume.
|
| How you got from this to "just use /dev/null" is a mystery to
| me. It has to be a misread or misunderstanding.
| nine_k wrote:
| This thing looks like a very genetic cache. You can of course
| use /dev/null as a degenerate cache, without any performance
| benefit though.
| manigandham wrote:
| Lots of (elastic)search alternatives now, I keep track here:
| https://gist.github.com/manigandham/58320ddb24fed654b57b4ba2...
|
| Sonic is good. Typesense is probably what most are looking for as
| more of an Algolia-like setup: https://typesense.org/
___________________________________________________________________
(page generated 2022-10-24 23:00 UTC)