[HN Gopher] Nixiesearch: Running Lucene over S3, and why we're b...
       ___________________________________________________________________
        
       Nixiesearch: Running Lucene over S3, and why we're building a new
       search engine
        
       Author : shutty
       Score  : 99 points
       Date   : 2024-10-10 09:11 UTC (13 hours ago)
        
 (HTM) web link (nixiesearch.substack.com)
 (TXT) w3m dump (nixiesearch.substack.com)
        
       | ko_pivot wrote:
       | I'm a fan of all these projects that are leveraging S3 to
       | implement high availability / high scalability for traditionally
       | sensitive stateful workloads.
       | 
       | Local caching is a key element of such architectures, otherwise
       | S3 is too slow and expensive to query.
        
         | candiddevmike wrote:
         | The write speed is going to be horrendous IME, and how do you
         | handle performant indexing...
        
       | gyre007 wrote:
       | It took us almost 2 decades but finally the truly cloud native
       | architectures are becoming a reality. Warp and Turbopuffer are
       | some of the many other examples
        
         | candiddevmike wrote:
         | Curious what your definition of cloud native is and why you
         | think this is a new innovation. Storing your state in a bunch
         | of files on a shared disk is a tale as old as time.
        
           | cowsandmilk wrote:
           | Not having to worry about the size of the disk for one. So
           | much time in on that premise systems was about managing
           | quotas for systems and users alongside the physical capacity.
        
         | mdaniel wrote:
         | I didn't recognize Turbopuffer but a quick search coughed up a
         | previous discussion:
         | https://news.ycombinator.com/item?id=40916786
         | 
         | I'm guessing Warp is Warpstream which I have been chomping at
         | the bit to try out: https://hn.algolia.com/?q=warpstream
        
         | Sirupsen wrote:
         | Ya, the world needed S3 to become fully consistent. This didn't
         | happen until end of 2020!
        
       | manx wrote:
       | I thought about creating a search engine using
       | https://github.com/phiresky/sql.js-httpvfs, commoncrawl and
       | cloudflare R2. But never found the time to start...
        
         | mallets wrote:
         | Many things seem feasible with competitive object storage
         | pricing. Still needs a little a bit of local caching to reduce
         | read requests and origin abuse.
         | 
         | I think rclone mount can do the same thing with its chunked
         | reads + cache, wonder what's the memory overhead for the
         | process.
        
         | oersted wrote:
         | You will like this then, that was the main demo from the
         | Quickwit team.
         | 
         | https://common-crawl.quickwit.io/
        
       | mhitza wrote:
       | I've used offline indexing with Solr back in 2010-2012, and this
       | was because the latency between the Solr server and the MySQL db
       | (indexing done via dataimport handler) was causing the indexer to
       | take hours instead of the sub 1 hour (same server vs servers in
       | same datacenter).
       | 
       | In many ways Solr has come a long way since, and I'm curious to
       | see how well they can make a similar system perform in the cloud
       | environment.
        
       | marginalia_nu wrote:
       | This would have been a lot easier to read without all the memes
       | and attempts to inject humor into the writing. It's a frustrating
       | because it's an otherwise interesting topic :-/
        
         | prmoustache wrote:
         | How hard is it to just jump past them?
         | 
         | Answere: it is not.
        
           | infecto wrote:
           | It generally is a major distraction from the content and
           | feels like a pattern from a decade+ ago when technical blog
           | posts became the hot thing to do.
           | 
           | You can certainly jump over it but I imagine a number of
           | people like myself just skip the article entirely.
        
           | Semaphor wrote:
           | It is.
        
           | vundercind wrote:
           | I like the style, but this case felt forced. Like when
           | corporate tries to do memes.
        
       | mikeocool wrote:
       | I love all of the software coming out recently backed by simple
       | object storage.
       | 
       | As someone who spent the last decade and half getting alerts from
       | RDBMSes I'm basically to the point that if you think your system
       | requires more than object storage for state management, I don't
       | want to be involved.
       | 
       | My last company looked at rolling out elastic/open search to
       | alleviate certain loads from our db, but it became clear it was
       | just going to be a second monstrously complicated system that was
       | going to require a lot of care and feeding, and we were probably
       | better off spending the time trying to squeeze some additional
       | performance out of our DB.
        
         | spaceribs wrote:
         | This is a very unix philosophy right? Everything is a file?[1]
         | 
         | [1]https://en.wikipedia.org/wiki/Everything_is_a_file
        
           | pjc50 wrote:
           | Not quite - "everything is a blob" has very different
           | concurrency semantics to "everything is a POSIX file". You
           | can't write into the middle of a blob, for example. This
           | makes certain use cases harder but the concurrency of blobs
           | is _much_ easier to reason about and get right.
           | 
           | Personally I think you might actually need a DB to do the
           | work of a DB, and you can't as easily build one on top of a
           | blob store as on a block device. But I do think most
           | distributed systems should use blob and/or DB and _not_ the
           | filesystem.
        
         | candiddevmike wrote:
         | Why would you prefer state management in object storage vs a
         | relational (or document) database?
        
           | mikeocool wrote:
           | So many less moving parts to manage/break.
        
           | orthecreedence wrote:
           | Two main reasons I can see:
           | 
           | Ops is easier, for the most part. Doing ops on an RDBMS
           | _correctly_ can be a pain. Things like replication, failover,
           | performance tuning, etc etc can be hard. This is much less of
           | an issue because services like RDS solve this and have solved
           | it for a long time. Not a huge issue there.
           | 
           | Splitting compute from storage makes scaling a lot easier,
           | _especially_ when storage is an object store system where you
           | don 't have to worry about RAID, disk backups, etc etc.
           | Especially for clustered systems like elasticsearch, having
           | object store backing would be incredible: if you need to spin
           | up/down a new server, instead of starting it, convincing it
           | to download the portions of the indexes it's supposed to and
           | waiting for everything to transfer, you just start it and let
           | it run immediately. You can also now run 80% spot instances
           | for your compute nodes because if one gets recalled, the
           | replacement doesn't have to sync all its state from the other
           | servers, it can just go to business as usual, and a sudden
           | loss of 60% of your nodes doesn't mean data loss like it does
           | if your nodes are holding all the state.
           | 
           | I think for something like an RDBMS, object-store backing is
           | very likely completely overkill, unless you're hitting some
           | scaling threshold that most of us don't deal with ever. For
           | clustered DB systems (cassandra/scylla, ES, etc etc),
           | splitting out storage makes cluster management, scalability,
           | and resiliency worlds easier.
        
         | remram wrote:
         | On the other hand, the S3-compatible server options are quite
         | limited. While you're not locking yourself to one cloud, you
         | are locking yourself to the cloud.
        
           | mikeocool wrote:
           | At this point my career, I've found that paying to make
           | something hard someone else's is often well worth it.
        
       | oersted wrote:
       | Check out Quickwit, it is briefly mentioned but I think
       | mistakenly dismissed. They have been working on a similar concept
       | for a few years and the results are excellent. It's in no way
       | mainly for logs as they claim, it is a general purpose cloud
       | native search engine like the one they suggest, very well
       | engineered.
       | 
       | It is based on Tantivy, a Lucene alternative in Rust. I have
       | extensive hands on experience with both and I highly recommend
       | Tantivy, it's just superior in every way now, such a pleasure to
       | use, an ideal example of what Rust was designed for.
        
         | victor106 wrote:
         | Thanks for this info.
        
         | bomewish wrote:
         | The big issue with tantivy I've found is that it only deals
         | with immutable data. So it can't be used for anything you want
         | to do CRUD on. This rules out a LOT of use cases. It's a real
         | shame imo.
        
           | oersted wrote:
           | It is indeed mostly designed for bulk indexing and static
           | search. But it is not a strict limitation, frequent small
           | inserts and updates are performant too. Deleting can be a bit
           | awkward, you can only delete every document with a given term
           | in a field, but if you use it on a unique id field it's just
           | like a normal delete.
           | 
           | Tantivy is a low-level library to build your own search
           | engine (Quickwit), like Lucene, it's not a search engine in
           | itself. Kind of like how DBs are built on top of Key-Value
           | Stores. But you can definitely build a CRUD abstraction on
           | top of it.
        
           | pentlander wrote:
           | I'm pretty sure that Lucene is exactly the same, the segments
           | it creates are immutable and Elastic is what handles a
           | "mutable" view of the data. Which makes sense because Tantivy
           | is like Lucene, not ES.
           | 
           | https://lucene.apache.org/core/7_7_0/core/org/apache/lucene/.
           | ..
        
         | Semaphor wrote:
         | > It's in no way mainly for logs as they claim
         | 
         | Where can I find more information on using it for user-facing
         | search? The repository [0] starts with "Cloud-native search
         | engine for observability (logs, traces, and soon metrics!)" and
         | keeps talking about those.
         | 
         | [0]: https://github.com/quickwit-oss/quickwit
        
           | hovering_nox wrote:
           | I would say here: Features
           | 
           | https://quickwit.io/docs/overview/introduction#key-features
        
           | oersted wrote:
           | That just seems to be the market where search engines have
           | the most obvious business case, Elasticsearch positioned
           | themselves in the same way. But both are general-purpose
           | full-text search engines perfectly capable of any serious
           | search use-case.
           | 
           | Their original breakout demo was on Common Crawl:
           | https://common-crawl.quickwit.io/
           | 
           | But thanks for pointing it out, I hadn't looked at it in a
           | few months, it looks like they significantly changed their
           | pitch in the last year. I assume they got VC money and they
           | need to deliver now.
        
             | AsianOtter wrote:
             | But the demo does not work.
             | 
             | I tried "England is" and a few similar queries. It spends
             | three seconds then shows that nothing is found.
        
               | oersted wrote:
               | I tried it once and it instantly showed no results, but
               | then I tried it again and it returned results in <1s.
               | Just try it with a bunch of queries, I think there's
               | caching too so it's hard to gauge performance properly.
               | 
               | The blog post about the demo is from 2021 and they
               | haven't promoted it much since. I'm surprised that they
               | even kept it online, according to the sidebar it was
               | ~$810/month in AWS at the time.
        
         | erk__ wrote:
         | I have been using Tantivy for Garfield comic search for a few
         | years now, it has been really nice to use in all that time.
        
           | jprd wrote:
           | I'm simultaneously intrigued and thinking this is a funny
           | joke at the same time. If this isn't a joke, I would love an
           | example.
        
             | erk__ wrote:
             | Luckily it is not a joke!
             | 
             | Its been about I have had running in some capacity for some
             | years by now through a couple of rewrites. At some point
             | Discord added "auto-complete" for commands, this meant that
             | I can do a live lookup and give users a list of comics
             | where some piece of text is.
             | 
             | My index is a bit out of date, but comics before September
             | last year can be searched up.
             | 
             | The search index lives fully in memory as it is not that
             | big since it is only 17363 comics. This does mean that it
             | is rebuilt every startup, but that does not take long
             | compared to the month long uptime it usually has.
             | 
             | Example of a search for "funny joke":
             | https://imgur.com/a/J4sRhPJ
             | 
             | Hosted bot: https://discord.com/application-
             | directory/404364579645292564
             | 
             | Source code: https://git.sr.ht/~erk/lasagna
        
         | ZeroCool2u wrote:
         | Meili search is also a great option.
         | 
         | https://www.meilisearch.com/docs/learn/resources/comparison_...
        
           | orthecreedence wrote:
           | Does Meili support object store backends?
        
           | notamy wrote:
           | Meilisearch is great when it works, but when it breaks it's a
           | total nightmare. I've hit multiple bugs that destroyed my
           | search index, I've hit multiple undocumented limits, ... that
           | all required rebuilding my index from scratch and doing a lot
           | of work to find what was actually going on to report it. It
           | doesn't help that some of the errors it gives are incredibly
           | non-specific and make it quite difficult to find what's
           | actually breaking it.
           | 
           | All of that said, I still use it because it has sucked less
           | than the other search engines to run.
        
         | lsowen wrote:
         | Has anyone tried openobserve
         | (https://github.com/openobserve/openobserve)? How does it
         | compare/contrast to Quickwit as an "Elasticsearch for logs"
         | replacement?
        
       | cynicalsecurity wrote:
       | This is a great way to waste investors' money.
        
       | stroupwaffle wrote:
       | There's no such thing as stateless, and there's no such thing as
       | serverless.
       | 
       | The universe is a stateful organism in constant flux.
       | 
       | Put another way: brushing-it-under-the-rug as a service.
        
         | zdragnar wrote:
         | There is no spoon.
         | 
         | Put it another way: serverless and stateless don't mean what
         | you think they mean.
        
           | MeteorMarc wrote:
           | I feel clueless
        
             | stroupwaffle wrote:
             | It's not the spoon that bends, it's the world around it.
        
             | ctxcode wrote:
             | serverless just means that a hosting company routes your
             | domain to one or more servers that hosting company owns and
             | where they put your code on. And that hosting company can
             | spin up more or less servers based on traffic.. TL;DR;
             | Serverless uses many many servers, just none that you own.
        
               | zdragnar wrote:
               | More specifically: no instances that you maintain or
               | manage. You don't care which machine your code runs on,
               | or even if all your code is even on the same machine.
               | 
               | Compute availability is lumped into one gigantic pool and
               | all of the concerns below the execution of your code is
               | managed for you.
        
       | mdaniel wrote:
       | > Nixiesearch uses an S3-compatible block storage (like AWS S3,
       | Google GCS and Azure Blob Storage)
       | 
       | Hair-splitting: I don't believe Blob Storage is S3 compatible, so
       | one may want to consider rewording to distinguish between whether
       | it really, no kidding, needs "S3 compatible" or it's a euphemism
       | for "key value blob storage"
       | 
       | I'm fully cognizant of the 2017 nature of this, but even they are
       | all "use Minio"
       | https://opensource.microsoft.com/blog/2017/11/09/s3cmd-amazo...
       | which I guess made a lot more sense before its license change.
       | There's also a more recent question from 2023 _(by an alleged
       | Microsoft Employee!)_ with a very similar  "use this shim"
       | answer: https://learn.microsoft.com/en-
       | us/answers/questions/1183760/...
        
         | ko_pivot wrote:
         | Azure is the only major (or even minor) cloud provider refusing
         | to build an S3 API. Strange to me, because Azure Cosmos DB
         | supports Mongo and Cassandra at the API level, for example, so
         | idk what is so offensive to them about S3 becoming the standard
         | HTTP API for object storage.
        
           | ignaloidas wrote:
           | It's because S3 api is quite a fair bit worse than what they
           | offer. They define their guarantees for storage products way
           | more clearly than other clouds, and for blob storage, from my
           | understanding, their model is better than S3.
        
       | warangal wrote:
       | I myself have been working on a personal search engine for
       | sometime, and one problem i faced was to have an effective fuzzy-
       | search for all the diverse filenames/directories. All approaches
       | i could find were based on Levenshtein distance , which would
       | have led to storing of original strings/text content in the
       | index, and neither would be practical for larger strings'
       | comparison nor would be generic enough to handle all knowledge
       | domains. This led me to start looking at (Local sensitive hashes)
       | LSH approaches to measure difference b/w any two strings in
       | constant time. After some work i finally managed to complete an
       | experimental fuzzy search engine (keyword search is a just a
       | special case!).
       | 
       | In my analysis of 1 Million hacker news stories, it worked much
       | better than algolia search while running on a single core ! More
       | details are provided in this post:
       | https://eagledot.xyz/malhar.md.html . I tried to submit it here
       | to gather more feedback but didn't work i guess!
        
         | iudqnolq wrote:
         | I'm super new to this so I'm probably missing something simple,
         | but isn't a trigram index one of the canonical solutions for
         | fuzzy search? Eg
         | https://www.postgresql.org/docs/current/pgtrgm.html
         | 
         | That often involves recording original trigram position, but I
         | think that's necessary to weigh "I like happy cats" higher than
         | "I like happy dogs but I don't like cats" in a search for
         | "happy cats".
        
           | warangal wrote:
           | Yes, trigram mainly but also bigram and/or combination of
           | both are used generally to implement fuzzy search, zoekt also
           | uses trigram index. But such indices depend heavily on the
           | content being indexed, for example if ever encounter a rare
           | "trigram" during querying not indexed, they would fail to
           | return relevant results! LSH implementations on the other
           | hand employ a more diverse collection of stats depending upon
           | the number of buckets and N(-gram)/window-size used, to
           | compare better with unseen content/bytes during querying. But
           | it is not cheap as each hash is around 30 bytes, even more
           | than the string/text being indexed most of the time ! But its
           | leads to fixed size hashes independent of size of content
           | indexed and acts as an "auxiliary" index which can be queried
           | independently of original index! Comparison of hashes can be
           | optimized leading to a quite fast fuzzy search .
        
       | whalesalad wrote:
       | I recently got back into search after not touching ES since like
       | 2012-2013. I forgot how much of a fucking nightmare it is to work
       | with and query. Love to see innovation in this space.
        
         | staticautomatic wrote:
         | I feel like it's not that bad to interact with if you do it
         | regularly, but if I go a while without using it I forget how to
         | do everything. I sure as hell wouldn't want to admin an
         | instance.
        
       | mannyv wrote:
       | I forgot that a reindex on solr/lucene blows away the index. Now
       | I remember how much of a nightmare that was because you couldn't
       | find anything until that was done - which usually was a few hours
       | when things were hdd based.
       | 
       | Just started a search project, and this one will be on the list
       | for sure.
        
       | jillesvangurp wrote:
       | Both Elastic and Opensearch also have S3 based stateless versions
       | of their search engines in the works. The Elastic one is
       | available in early access currently. It would be interesting to
       | see how this on improves on both approaches.
       | 
       | With all the licensing complexities around Elastic, more choice
       | is not necessarily bad.
       | 
       | The tradeoff with using S3 is indexing latency (the time between
       | the write getting accepted and being visible via search) vs. easy
       | scaling. The default refresh interval (the time the search engine
       | waits before committing changes to an index) is 1 second. That
       | means it takes upto 1 second before indices get updated with
       | recently added data. A common performance tweak is to increase
       | this to 5 or more seconds. That reduces the number of writes and
       | can improve write throughput, which when you are writing lots of
       | data is helpful.
       | 
       | If you need low latency (anything where users might want to
       | "read" their own writes), clustered approaches are more flexible.
       | If you can afford to wait a few seconds, using S3 to store stuff
       | becomes more feasible.
       | 
       | Lucene internally stores documents in segments. Segments are
       | append only and there tend to be cleanup activities related to
       | rewriting and merging segments to e.g. get rid of deleted
       | documents, or deal with fragmentation. Once written, having some
       | jobs to merge segments in the background isn't that hard. My
       | guess is that with S3, the trick is to gather whatever amount of
       | writes up and then store them as one segment and put that in S3.
       | 
       | S3 is not a proper file system and file operations are relatively
       | expensive (compared to a file system) because they are
       | essentially REST API calls. So, this favors use cases where you
       | write segments in bulk and never/rarely update or delete
       | individual things that you write. Because that would require
       | updating a segment in S3, which means deleting and rewriting it
       | and then notifying other nodes somehow that they need to re-read
       | that segment.
       | 
       | For both Elasticsearch and Opensearch log data or other time
       | series data fits very well to this because you don't have to deal
       | with deletes/updates typically.
        
         | rakoo wrote:
         | I'm wondering if it would be better to have a LevelDB-like
         | approach here. Store the recent stuff in DynamoDB, and once it
         | hits the threshold, store it in a segment in S3. This is also
         | similar to SQLite and WAL.
         | 
         | Really, nothing is ever new in computing.
        
       | ctxcode wrote:
       | Sounds like this is going to cost alot of money. (more than it
       | should)
        
       | huntaub wrote:
       | This is a super cool project, and I think that we will continue
       | to see more and more applications move towards an "on S3"
       | stateless architecture. That's part of the reason why we are
       | building Regatta [1]. We are trying to enable folks who are
       | running software that needs file system semantics (like Lucene)
       | to get the super-fast NVME-like latencies on data that's really
       | in S3. While this is awesome, I worry about all of the
       | applications which _don 't_ have someone rewrite a bunch of
       | layers to work on S3. That's where we come in.
       | 
       | [1] https://regattastorage.com
        
       | hipadev23 wrote:
       | I know block storage backends is all the rage, but this is about
       | the most capital intensive thing you can do on the major cloud
       | providers. Storage and reads are cheap, but writes and list
       | operations are insanely expensive.
       | 
       | Once you hook these backends up to real-time streaming updates,
       | transactions, heavy indexing, or immutable backends that cause
       | constant churn (hive/hudi/iceberg/delta lake), you're in for a
       | bad time financially.
        
       | parhamn wrote:
       | Stateless S3 apps have much more appeal given the existence of
       | Cloudflare R2 -- bandwidth is free and GetObject is $0.36 per
       | million requests.
        
       ___________________________________________________________________
       (page generated 2024-10-10 23:01 UTC)