[HN Gopher] Nixiesearch: Running Lucene over S3, and why we're b...
___________________________________________________________________
Nixiesearch: Running Lucene over S3, and why we're building a new
search engine
Author : shutty
Score : 99 points
Date : 2024-10-10 09:11 UTC (13 hours ago)
(HTM) web link (nixiesearch.substack.com)
(TXT) w3m dump (nixiesearch.substack.com)
| ko_pivot wrote:
| I'm a fan of all these projects that are leveraging S3 to
| implement high availability / high scalability for traditionally
| sensitive stateful workloads.
|
| Local caching is a key element of such architectures, otherwise
| S3 is too slow and expensive to query.
| candiddevmike wrote:
| The write speed is going to be horrendous IME, and how do you
| handle performant indexing...
| gyre007 wrote:
| It took us almost 2 decades but finally the truly cloud native
| architectures are becoming a reality. Warp and Turbopuffer are
| some of the many other examples
| candiddevmike wrote:
| Curious what your definition of cloud native is and why you
| think this is a new innovation. Storing your state in a bunch
| of files on a shared disk is a tale as old as time.
| cowsandmilk wrote:
| Not having to worry about the size of the disk for one. So
| much time in on that premise systems was about managing
| quotas for systems and users alongside the physical capacity.
| mdaniel wrote:
| I didn't recognize Turbopuffer but a quick search coughed up a
| previous discussion:
| https://news.ycombinator.com/item?id=40916786
|
| I'm guessing Warp is Warpstream which I have been chomping at
| the bit to try out: https://hn.algolia.com/?q=warpstream
| Sirupsen wrote:
| Ya, the world needed S3 to become fully consistent. This didn't
| happen until end of 2020!
| manx wrote:
| I thought about creating a search engine using
| https://github.com/phiresky/sql.js-httpvfs, commoncrawl and
| cloudflare R2. But never found the time to start...
| mallets wrote:
| Many things seem feasible with competitive object storage
| pricing. Still needs a little a bit of local caching to reduce
| read requests and origin abuse.
|
| I think rclone mount can do the same thing with its chunked
| reads + cache, wonder what's the memory overhead for the
| process.
| oersted wrote:
| You will like this then, that was the main demo from the
| Quickwit team.
|
| https://common-crawl.quickwit.io/
| mhitza wrote:
| I've used offline indexing with Solr back in 2010-2012, and this
| was because the latency between the Solr server and the MySQL db
| (indexing done via dataimport handler) was causing the indexer to
| take hours instead of the sub 1 hour (same server vs servers in
| same datacenter).
|
| In many ways Solr has come a long way since, and I'm curious to
| see how well they can make a similar system perform in the cloud
| environment.
| marginalia_nu wrote:
| This would have been a lot easier to read without all the memes
| and attempts to inject humor into the writing. It's a frustrating
| because it's an otherwise interesting topic :-/
| prmoustache wrote:
| How hard is it to just jump past them?
|
| Answere: it is not.
| infecto wrote:
| It generally is a major distraction from the content and
| feels like a pattern from a decade+ ago when technical blog
| posts became the hot thing to do.
|
| You can certainly jump over it but I imagine a number of
| people like myself just skip the article entirely.
| Semaphor wrote:
| It is.
| vundercind wrote:
| I like the style, but this case felt forced. Like when
| corporate tries to do memes.
| mikeocool wrote:
| I love all of the software coming out recently backed by simple
| object storage.
|
| As someone who spent the last decade and half getting alerts from
| RDBMSes I'm basically to the point that if you think your system
| requires more than object storage for state management, I don't
| want to be involved.
|
| My last company looked at rolling out elastic/open search to
| alleviate certain loads from our db, but it became clear it was
| just going to be a second monstrously complicated system that was
| going to require a lot of care and feeding, and we were probably
| better off spending the time trying to squeeze some additional
| performance out of our DB.
| spaceribs wrote:
| This is a very unix philosophy right? Everything is a file?[1]
|
| [1]https://en.wikipedia.org/wiki/Everything_is_a_file
| pjc50 wrote:
| Not quite - "everything is a blob" has very different
| concurrency semantics to "everything is a POSIX file". You
| can't write into the middle of a blob, for example. This
| makes certain use cases harder but the concurrency of blobs
| is _much_ easier to reason about and get right.
|
| Personally I think you might actually need a DB to do the
| work of a DB, and you can't as easily build one on top of a
| blob store as on a block device. But I do think most
| distributed systems should use blob and/or DB and _not_ the
| filesystem.
| candiddevmike wrote:
| Why would you prefer state management in object storage vs a
| relational (or document) database?
| mikeocool wrote:
| So many less moving parts to manage/break.
| orthecreedence wrote:
| Two main reasons I can see:
|
| Ops is easier, for the most part. Doing ops on an RDBMS
| _correctly_ can be a pain. Things like replication, failover,
| performance tuning, etc etc can be hard. This is much less of
| an issue because services like RDS solve this and have solved
| it for a long time. Not a huge issue there.
|
| Splitting compute from storage makes scaling a lot easier,
| _especially_ when storage is an object store system where you
| don 't have to worry about RAID, disk backups, etc etc.
| Especially for clustered systems like elasticsearch, having
| object store backing would be incredible: if you need to spin
| up/down a new server, instead of starting it, convincing it
| to download the portions of the indexes it's supposed to and
| waiting for everything to transfer, you just start it and let
| it run immediately. You can also now run 80% spot instances
| for your compute nodes because if one gets recalled, the
| replacement doesn't have to sync all its state from the other
| servers, it can just go to business as usual, and a sudden
| loss of 60% of your nodes doesn't mean data loss like it does
| if your nodes are holding all the state.
|
| I think for something like an RDBMS, object-store backing is
| very likely completely overkill, unless you're hitting some
| scaling threshold that most of us don't deal with ever. For
| clustered DB systems (cassandra/scylla, ES, etc etc),
| splitting out storage makes cluster management, scalability,
| and resiliency worlds easier.
| remram wrote:
| On the other hand, the S3-compatible server options are quite
| limited. While you're not locking yourself to one cloud, you
| are locking yourself to the cloud.
| mikeocool wrote:
| At this point my career, I've found that paying to make
| something hard someone else's is often well worth it.
| oersted wrote:
| Check out Quickwit, it is briefly mentioned but I think
| mistakenly dismissed. They have been working on a similar concept
| for a few years and the results are excellent. It's in no way
| mainly for logs as they claim, it is a general purpose cloud
| native search engine like the one they suggest, very well
| engineered.
|
| It is based on Tantivy, a Lucene alternative in Rust. I have
| extensive hands on experience with both and I highly recommend
| Tantivy, it's just superior in every way now, such a pleasure to
| use, an ideal example of what Rust was designed for.
| victor106 wrote:
| Thanks for this info.
| bomewish wrote:
| The big issue with tantivy I've found is that it only deals
| with immutable data. So it can't be used for anything you want
| to do CRUD on. This rules out a LOT of use cases. It's a real
| shame imo.
| oersted wrote:
| It is indeed mostly designed for bulk indexing and static
| search. But it is not a strict limitation, frequent small
| inserts and updates are performant too. Deleting can be a bit
| awkward, you can only delete every document with a given term
| in a field, but if you use it on a unique id field it's just
| like a normal delete.
|
| Tantivy is a low-level library to build your own search
| engine (Quickwit), like Lucene, it's not a search engine in
| itself. Kind of like how DBs are built on top of Key-Value
| Stores. But you can definitely build a CRUD abstraction on
| top of it.
| pentlander wrote:
| I'm pretty sure that Lucene is exactly the same, the segments
| it creates are immutable and Elastic is what handles a
| "mutable" view of the data. Which makes sense because Tantivy
| is like Lucene, not ES.
|
| https://lucene.apache.org/core/7_7_0/core/org/apache/lucene/.
| ..
| Semaphor wrote:
| > It's in no way mainly for logs as they claim
|
| Where can I find more information on using it for user-facing
| search? The repository [0] starts with "Cloud-native search
| engine for observability (logs, traces, and soon metrics!)" and
| keeps talking about those.
|
| [0]: https://github.com/quickwit-oss/quickwit
| hovering_nox wrote:
| I would say here: Features
|
| https://quickwit.io/docs/overview/introduction#key-features
| oersted wrote:
| That just seems to be the market where search engines have
| the most obvious business case, Elasticsearch positioned
| themselves in the same way. But both are general-purpose
| full-text search engines perfectly capable of any serious
| search use-case.
|
| Their original breakout demo was on Common Crawl:
| https://common-crawl.quickwit.io/
|
| But thanks for pointing it out, I hadn't looked at it in a
| few months, it looks like they significantly changed their
| pitch in the last year. I assume they got VC money and they
| need to deliver now.
| AsianOtter wrote:
| But the demo does not work.
|
| I tried "England is" and a few similar queries. It spends
| three seconds then shows that nothing is found.
| oersted wrote:
| I tried it once and it instantly showed no results, but
| then I tried it again and it returned results in <1s.
| Just try it with a bunch of queries, I think there's
| caching too so it's hard to gauge performance properly.
|
| The blog post about the demo is from 2021 and they
| haven't promoted it much since. I'm surprised that they
| even kept it online, according to the sidebar it was
| ~$810/month in AWS at the time.
| erk__ wrote:
| I have been using Tantivy for Garfield comic search for a few
| years now, it has been really nice to use in all that time.
| jprd wrote:
| I'm simultaneously intrigued and thinking this is a funny
| joke at the same time. If this isn't a joke, I would love an
| example.
| erk__ wrote:
| Luckily it is not a joke!
|
| Its been about I have had running in some capacity for some
| years by now through a couple of rewrites. At some point
| Discord added "auto-complete" for commands, this meant that
| I can do a live lookup and give users a list of comics
| where some piece of text is.
|
| My index is a bit out of date, but comics before September
| last year can be searched up.
|
| The search index lives fully in memory as it is not that
| big since it is only 17363 comics. This does mean that it
| is rebuilt every startup, but that does not take long
| compared to the month long uptime it usually has.
|
| Example of a search for "funny joke":
| https://imgur.com/a/J4sRhPJ
|
| Hosted bot: https://discord.com/application-
| directory/404364579645292564
|
| Source code: https://git.sr.ht/~erk/lasagna
| ZeroCool2u wrote:
| Meili search is also a great option.
|
| https://www.meilisearch.com/docs/learn/resources/comparison_...
| orthecreedence wrote:
| Does Meili support object store backends?
| notamy wrote:
| Meilisearch is great when it works, but when it breaks it's a
| total nightmare. I've hit multiple bugs that destroyed my
| search index, I've hit multiple undocumented limits, ... that
| all required rebuilding my index from scratch and doing a lot
| of work to find what was actually going on to report it. It
| doesn't help that some of the errors it gives are incredibly
| non-specific and make it quite difficult to find what's
| actually breaking it.
|
| All of that said, I still use it because it has sucked less
| than the other search engines to run.
| lsowen wrote:
| Has anyone tried openobserve
| (https://github.com/openobserve/openobserve)? How does it
| compare/contrast to Quickwit as an "Elasticsearch for logs"
| replacement?
| cynicalsecurity wrote:
| This is a great way to waste investors' money.
| stroupwaffle wrote:
| There's no such thing as stateless, and there's no such thing as
| serverless.
|
| The universe is a stateful organism in constant flux.
|
| Put another way: brushing-it-under-the-rug as a service.
| zdragnar wrote:
| There is no spoon.
|
| Put it another way: serverless and stateless don't mean what
| you think they mean.
| MeteorMarc wrote:
| I feel clueless
| stroupwaffle wrote:
| It's not the spoon that bends, it's the world around it.
| ctxcode wrote:
| serverless just means that a hosting company routes your
| domain to one or more servers that hosting company owns and
| where they put your code on. And that hosting company can
| spin up more or less servers based on traffic.. TL;DR;
| Serverless uses many many servers, just none that you own.
| zdragnar wrote:
| More specifically: no instances that you maintain or
| manage. You don't care which machine your code runs on,
| or even if all your code is even on the same machine.
|
| Compute availability is lumped into one gigantic pool and
| all of the concerns below the execution of your code is
| managed for you.
| mdaniel wrote:
| > Nixiesearch uses an S3-compatible block storage (like AWS S3,
| Google GCS and Azure Blob Storage)
|
| Hair-splitting: I don't believe Blob Storage is S3 compatible, so
| one may want to consider rewording to distinguish between whether
| it really, no kidding, needs "S3 compatible" or it's a euphemism
| for "key value blob storage"
|
| I'm fully cognizant of the 2017 nature of this, but even they are
| all "use Minio"
| https://opensource.microsoft.com/blog/2017/11/09/s3cmd-amazo...
| which I guess made a lot more sense before its license change.
| There's also a more recent question from 2023 _(by an alleged
| Microsoft Employee!)_ with a very similar "use this shim"
| answer: https://learn.microsoft.com/en-
| us/answers/questions/1183760/...
| ko_pivot wrote:
| Azure is the only major (or even minor) cloud provider refusing
| to build an S3 API. Strange to me, because Azure Cosmos DB
| supports Mongo and Cassandra at the API level, for example, so
| idk what is so offensive to them about S3 becoming the standard
| HTTP API for object storage.
| ignaloidas wrote:
| It's because S3 api is quite a fair bit worse than what they
| offer. They define their guarantees for storage products way
| more clearly than other clouds, and for blob storage, from my
| understanding, their model is better than S3.
| warangal wrote:
| I myself have been working on a personal search engine for
| sometime, and one problem i faced was to have an effective fuzzy-
| search for all the diverse filenames/directories. All approaches
| i could find were based on Levenshtein distance , which would
| have led to storing of original strings/text content in the
| index, and neither would be practical for larger strings'
| comparison nor would be generic enough to handle all knowledge
| domains. This led me to start looking at (Local sensitive hashes)
| LSH approaches to measure difference b/w any two strings in
| constant time. After some work i finally managed to complete an
| experimental fuzzy search engine (keyword search is a just a
| special case!).
|
| In my analysis of 1 Million hacker news stories, it worked much
| better than algolia search while running on a single core ! More
| details are provided in this post:
| https://eagledot.xyz/malhar.md.html . I tried to submit it here
| to gather more feedback but didn't work i guess!
| iudqnolq wrote:
| I'm super new to this so I'm probably missing something simple,
| but isn't a trigram index one of the canonical solutions for
| fuzzy search? Eg
| https://www.postgresql.org/docs/current/pgtrgm.html
|
| That often involves recording original trigram position, but I
| think that's necessary to weigh "I like happy cats" higher than
| "I like happy dogs but I don't like cats" in a search for
| "happy cats".
| warangal wrote:
| Yes, trigram mainly but also bigram and/or combination of
| both are used generally to implement fuzzy search, zoekt also
| uses trigram index. But such indices depend heavily on the
| content being indexed, for example if ever encounter a rare
| "trigram" during querying not indexed, they would fail to
| return relevant results! LSH implementations on the other
| hand employ a more diverse collection of stats depending upon
| the number of buckets and N(-gram)/window-size used, to
| compare better with unseen content/bytes during querying. But
| it is not cheap as each hash is around 30 bytes, even more
| than the string/text being indexed most of the time ! But its
| leads to fixed size hashes independent of size of content
| indexed and acts as an "auxiliary" index which can be queried
| independently of original index! Comparison of hashes can be
| optimized leading to a quite fast fuzzy search .
| whalesalad wrote:
| I recently got back into search after not touching ES since like
| 2012-2013. I forgot how much of a fucking nightmare it is to work
| with and query. Love to see innovation in this space.
| staticautomatic wrote:
| I feel like it's not that bad to interact with if you do it
| regularly, but if I go a while without using it I forget how to
| do everything. I sure as hell wouldn't want to admin an
| instance.
| mannyv wrote:
| I forgot that a reindex on solr/lucene blows away the index. Now
| I remember how much of a nightmare that was because you couldn't
| find anything until that was done - which usually was a few hours
| when things were hdd based.
|
| Just started a search project, and this one will be on the list
| for sure.
| jillesvangurp wrote:
| Both Elastic and Opensearch also have S3 based stateless versions
| of their search engines in the works. The Elastic one is
| available in early access currently. It would be interesting to
| see how this on improves on both approaches.
|
| With all the licensing complexities around Elastic, more choice
| is not necessarily bad.
|
| The tradeoff with using S3 is indexing latency (the time between
| the write getting accepted and being visible via search) vs. easy
| scaling. The default refresh interval (the time the search engine
| waits before committing changes to an index) is 1 second. That
| means it takes upto 1 second before indices get updated with
| recently added data. A common performance tweak is to increase
| this to 5 or more seconds. That reduces the number of writes and
| can improve write throughput, which when you are writing lots of
| data is helpful.
|
| If you need low latency (anything where users might want to
| "read" their own writes), clustered approaches are more flexible.
| If you can afford to wait a few seconds, using S3 to store stuff
| becomes more feasible.
|
| Lucene internally stores documents in segments. Segments are
| append only and there tend to be cleanup activities related to
| rewriting and merging segments to e.g. get rid of deleted
| documents, or deal with fragmentation. Once written, having some
| jobs to merge segments in the background isn't that hard. My
| guess is that with S3, the trick is to gather whatever amount of
| writes up and then store them as one segment and put that in S3.
|
| S3 is not a proper file system and file operations are relatively
| expensive (compared to a file system) because they are
| essentially REST API calls. So, this favors use cases where you
| write segments in bulk and never/rarely update or delete
| individual things that you write. Because that would require
| updating a segment in S3, which means deleting and rewriting it
| and then notifying other nodes somehow that they need to re-read
| that segment.
|
| For both Elasticsearch and Opensearch log data or other time
| series data fits very well to this because you don't have to deal
| with deletes/updates typically.
| rakoo wrote:
| I'm wondering if it would be better to have a LevelDB-like
| approach here. Store the recent stuff in DynamoDB, and once it
| hits the threshold, store it in a segment in S3. This is also
| similar to SQLite and WAL.
|
| Really, nothing is ever new in computing.
| ctxcode wrote:
| Sounds like this is going to cost alot of money. (more than it
| should)
| huntaub wrote:
| This is a super cool project, and I think that we will continue
| to see more and more applications move towards an "on S3"
| stateless architecture. That's part of the reason why we are
| building Regatta [1]. We are trying to enable folks who are
| running software that needs file system semantics (like Lucene)
| to get the super-fast NVME-like latencies on data that's really
| in S3. While this is awesome, I worry about all of the
| applications which _don 't_ have someone rewrite a bunch of
| layers to work on S3. That's where we come in.
|
| [1] https://regattastorage.com
| hipadev23 wrote:
| I know block storage backends is all the rage, but this is about
| the most capital intensive thing you can do on the major cloud
| providers. Storage and reads are cheap, but writes and list
| operations are insanely expensive.
|
| Once you hook these backends up to real-time streaming updates,
| transactions, heavy indexing, or immutable backends that cause
| constant churn (hive/hudi/iceberg/delta lake), you're in for a
| bad time financially.
| parhamn wrote:
| Stateless S3 apps have much more appeal given the existence of
| Cloudflare R2 -- bandwidth is free and GetObject is $0.36 per
| million requests.
___________________________________________________________________
(page generated 2024-10-10 23:01 UTC)