[HN Gopher] Show HN: SeekStorm - open-source sub-millisecond sea...
       ___________________________________________________________________
        
       Show HN: SeekStorm - open-source sub-millisecond search in Rust
        
       Author : wolfgarbe
       Score  : 102 points
       Date   : 2024-12-02 13:06 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | Leoko wrote:
       | Sub-millisecond latency sounds impressive, but isn't network
       | latency going to overshadow these gains in most real-world
       | scenarios?
        
         | wolfgarbe wrote:
         | It depends on the application.
         | 
         | When using SeekStorm as a server, keeping the latency per query
         | low increases the throughput and the number of parallel queries
         | a server can handle on top of a given hardware. An efficient
         | search server can reduce the required investments in server
         | hardware.
         | 
         | In other cases, only the local search performance matters,
         | e.g., for data mining or RAG.
         | 
         | Also, it's not only about averages but also about tail
         | latencies. While network latencies dominate the average search
         | time, that is not the case for tail latencies, which in turn
         | heavily influence user satisfaction and revenue in online
         | shopping.
        
         | pornel wrote:
         | When search is cheap and quick, it's possible to improve search
         | by postprocessing search results and running more queries when
         | necessary.
         | 
         | I use Tantivy, and add refinements like: if the top result is
         | objectively a low-quality one, it's usually a query with a typo
         | finding a document with the same typo, so I run the query again
         | with fuzzy spelling. If all the top results have the same tag
         | (that isn't in the query), then I mix in results from another
         | search with the most common tag excluded. If the query is a
         | word that has multiple meanings, I can ensure that each meaning
         | is represented in the top results.
        
         | intelVISA wrote:
         | A typical server is serving more than one request at a time,
         | hopefully.
        
       | tlofreso wrote:
       | Demo = impressed.
       | 
       | How's SeekStorm's prowess in mid-cap enterprise? How hairy is the
       | ingest pipeline for sources like: decade old sharepoint sites,
       | PDFs with partial text layers, excel, email.msg files, etc...
        
         | wolfgarbe wrote:
         | Yes, integration in complex legacy systems is always
         | challenging. As a small startup, we are concentrating on core
         | search technology to make search faster and to make the most of
         | available server infrastructure. As SeekStorm is open-source,
         | system integrators can take it from there.
        
         | fiedzia wrote:
         | Same as any other full-text search solution - it's your job to
         | integrate it.
        
         | jazzyjackson wrote:
         | On that topic, can anybody chime in on state of the art PDF
         | OCR? Even if that's a multimodal LLM, I've used ChatGPT to
         | extract tabular data from images but need something I can self
         | host for proprietary data.
        
       | DonnyV wrote:
       | The fact they could only squeeze out more performance was by
       | switching to Rust from C#. Says a lot of C#. .NET Framework has
       | come a long way with performance.
        
         | neonsunset wrote:
         | I find the note unfortunate. They state 2-4x performance
         | improvement. I'm sure looking at the implementation with a
         | profiler and tactically optimizing critical paths would have
         | yielded them 2-3x as is. They could have also reached out to
         | .NET JIT team via issues or discussions on GitHub for guidance.
         | Especially since .NET has rich set of SIMD APIs very well
         | suited for implementing SOTA text search algorithms (and also
         | comes with many out of box, seriously, look at e.g.
         | https://devblogs.microsoft.com/dotnet/performance-
         | improvemen...)
         | 
         | The note also states "No framework dependencies (CLR or JVM
         | virtual machines)" which isn't true either - 'dotnet publish
         | /p:PublishSingleFile=true /p:PublishTrimmed=true' gives the
         | same "dependency-less" experience. "Ahead-of-time instead of
         | just-in-time compilation" is similarly wrong - replace previous
         | args with '/p:PublishAot=true' and you get a native binary.
        
           | wolfgarbe wrote:
           | The 2-4 speed ratio was not meant to denounce C#, which is a
           | great language I loved to program in for over two decades,
           | coming from Delphi. Unfortunately, C# has not a complete SIMD
           | support. See our request to support the SSE4.2 _mm_cmpistrm
           | instruction
           | https://github.com/dotnet/runtime/discussions/63332, which we
           | required for a vectorized intersection between two sorted
           | 16-bit arrays. We did the switch from C# to Rust not light-
           | minded, as the cost of porting a fairly large codebase is
           | time-consuming. We just wanted to share our experience for
           | our specific task, not as a general statement.
        
             | neonsunset wrote:
             | Thank you. It is indeed true that .NET has some gaps in its
             | SIMD API, which might require either writing a specific
             | routine in C and pinvoking it or implementing the algorithm
             | differently.
             | 
             | Were there any other factors that contributed to the
             | decision?
             | 
             | FWIW I forwarded the issue the discussion links to
             | dotnetevolution discord server.
        
               | wolfgarbe wrote:
               | Yes. We waited long for AOT compilation to become mature,
               | to remove the need for the user to install the .Net
               | framework. But two years ago when we decided to switch,
               | we still couldn't just get the AOT compilation of our
               | codebase to work without changes (perhaps it was somehow
               | possible, but the available documentation was not very
               | verbose about this). Also, there is still a performance
               | gap. Of course, this doesn't matter for most of the
               | applications, where the completeness and consistency of
               | the framework, and the number of programmers fluent in
               | that language might matter more. But for a search server,
               | we needed to carve out every inch of performance we could
               | get. And other benchmarks seemed to echo our experience:
               | https://programming-language-benchmarks.vercel.app/rust-
               | vs-c...
        
               | neonsunset wrote:
               | That specific suite is...not the best.
               | https://benchmarksgame-
               | team.pages.debian.net/benchmarksgame/... is more focused
               | on optimized implementations and showcases where the
               | performance of .NET places given submissions someone
               | cared to spend some time optimizing.
               | 
               | It is true that 2 years ago NAOT was in its infancy, it
               | has improved substantially since then. Self-contained
               | trimmed binaries already worked back then however.
               | 
               | I guess it is more about unfortunate timing than anything
               | - even the compiler itself moves fast and in some areas
               | the difference in codegen quality is very significant
               | between 7, 8 and 9.
        
       | remram wrote:
       | What is the story for multi-language corpus? Do I have to do my
       | own stop word pruning, tokenizing, lemming, etc? This is usually
       | the case with full-text search solutions and it is a pain.
        
         | wolfgarbe wrote:
         | We started with making the core search technology faster. Then
         | we added a Unicode character folding/normalization tokenizer
         | (diacritics, accents, umlauts, bold, italic, full-width
         | chars...). Last week we added a tokenizer that supports Chinese
         | word segmentation. Currently, we are working on a multi-
         | language tokenizer, that segments Chinese, Japanese an Korean
         | without switching the tokenizer.
        
         | jazzyjackson wrote:
         | Re: stemming and lemming, I just want to plug the most
         | impressive NLP stack I ever used, "chat script", really it's
         | for building dialog trees where it walks down a branch of
         | conversation using effectively switch statements but with
         | really rich conceptual pattern matching and capturing - so
         | somewhere in the middle of the stack it has excellent
         | abstracting from word input to general concept (in WordNet),
         | performing all the spell correction (according to your
         | dictionary), stem, lem, and disambiguation.
         | 
         | I've had it in mind for a while to build a fuzzy search tool
         | based on parsing each phrase into concepts, parsing the search
         | query into concepts, and finding nearest match based on that.
         | It's a C library and very fast.
         | 
         | https://github.com/ChatScript/ChatScript
         | 
         | Looks like it hasn't been committed to in some time, I'll have
         | to check out their blog and see what's up. I guess with the
         | advent of LLMs, dialog trees are passe.
        
       | athompsondog wrote:
       | I wonder how burnt sushi feels about this
        
       | throwaway888abc wrote:
       | Impressive, bookmarked, upvoted.
       | 
       | Appreciate the demo: https://deephn.org/?q=apple+silicon
        
       | justmarc wrote:
       | I really like your approach. Impressed by your care for
       | performance and your fast pace of adding what appears to be
       | pretty complex stuff, while making sure it stays performant.
       | 
       | Keep it up!
       | 
       | Bookmarked.
        
       | distracted_boy wrote:
       | How does this compare to PostgreSQL?
        
         | wolfgarbe wrote:
         | PostgreSQL is an SQL database that also offers full-text search
         | (FTS), with extensions like pg_search it also supports BM25
         | scoring which is essential for lexical search. SeekStorm is
         | centered around full-text search only, it doesn't offer SQL.
         | 
         | Performance-wise it would be indeed interesting to run a
         | benchmark. The third-party open-source benchmark we are
         | currently using (search_benchmark_game) does not yet support
         | PostgreSQL. So yes, that comparison is still pending.
        
       | treefarmer wrote:
       | Is there distributed server support? I see it on the list of new
       | features with (currently PoC) next to it, but is the code for the
       | PoC available anywhere?
       | 
       | Also, would there be any potential issues if the index was
       | mounted on shared storage between multiple instances?
        
       | Thaxll wrote:
       | It feels like everyone re-implement the same application,
       | searching text in language x.y.z has been done a million times,
       | search speed in not a problem so what differenciate this solution
       | with the dozen+ mature ones.
        
       ___________________________________________________________________
       (page generated 2024-12-02 23:01 UTC)